Optimal Binning for Numerical Variables with Fisher’s Exact Test — optimal_binning_numerical

Implements a supervised, monotonic, optimal binning procedure for numeric predictors against a binary target. The algorithm iteratively merges the pair of adjacent bins whose class composition is most similar according to the two‑tailed Fisher’s Exact Test, and guarantees a monotone Weight of Evidence (WoE) profile. Designed for scorecard development, churn modelling and any logistic application where robust, information‑preserving discretisation is required.

Usage

optimal_binning_numerical_fetb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  max_n_prebins = 20L,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: Integer (0/1) vector, length $N$.
feature: Numeric vector, length $N$.
min_bins: Minimum number of final bins (default 3).
max_bins: Maximum number of final bins (default 5).
max_n_prebins: Maximum number of pre‑bins created before optimisation (default 20).
convergence_threshold: Absolute tolerance for change in total IV used as convergence criterion (default 1e-6).
max_iterations: Safety cap for merge + monotonicity iterations (default 1000).

Value

A named list:

id: Bin index (1‑based).
bin: Character vector "(lo; hi]" describing intervals.
woe, iv: WoE and IV per bin.
count, count_pos, count_neg: Bin frequencies.
cutpoints: Numeric vector of internal cut‑points $c_1,\dots,c_{B-1}$.
converged: Logical flag.
iterations: Number of iterations executed.

Details

Notation
$(x_i,\,y_i),\; i=1,\dots,N$ are observations with $y_i\in\{0,1\}$. A cut‑point vector $c=(c_0=-\infty < c_1 < \dots < c_{B-1} < c_B=+\infty)$ induces bins $I_b=(c_{b-1},c_b],\; b=1,\dots,B$. For each bin collect contingency counts $$(a_b,b_b)=\Bigl(\sum_{x_i\in I_b}y_i,\;\sum_{x_i\in I_b}(1-y_i)\Bigr).$$

Algorithm

Pre‑binning. Create up to max_n_prebins equal‑frequency bins from the ordered feature. This bounds subsequent complexity.
Fisher merge loop. While $B>$max_bins, merge the adjacent pair $(I_j,I_{j+1})$ maximising the point probability of the corresponding 2×2 table $p_j = P\{ \text{table }(a_j,b_j,c_j,d_j)\}$.
Monotonicity. After every merge, if the WoE sequence $w_1,\dots,w_B$ violates monotonicity ($\exists\,b:\,w_b>w_{b+1}$ for ascending trend or vice‑versa) merge that offending pair and restart the check locally.
Convergence. Stop when $|IV_{t+1}-IV_t|<$convergence_threshold or the iteration cap is reached.

Complexity

Pre‑binning: $O(N\log N)$ (sort) but done once.
Merge loop: worst‑case $O(B^2)$ with $B\le$max_n_prebins.
Memory: $O(B)$.

Formulae
$$ \mathrm{WoE}_b = \log\!\left( \frac{a_b / T_1}{\,b_b / T_0}\right)\!, \qquad \mathrm{IV} = \sum_{b=1}^{B} \left(\frac{a_b}{T_1}-\frac{b_b}{T_0}\right)\mathrm{WoE}_b$$ where $T_1=\sum_b a_b$,\ $T_0=\sum_b b_b$.

References

Fisher, R. A. (1922) On the interpretation of $X^2$ from contingency tables, and the calculation of P. JRSS, 85 (1), 87‑94.
Siddiqi, N. (2012) Credit Risk Scorecards. Wiley.
Navas‑Palencia, G. (2019) optbinning documentation – Numerical FETB.
Hand, D. J., & Adams, N. M. (2015) Supervised Classification in High Dimensions. Springer (Ch. 4, discretisation).
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013) Applied Logistic Regression (3rd ed.). Wiley.

Author

Lopes, J. E.

Examples

# \donttest{
set.seed(2025)
N  <- 1000
y  <- rbinom(N, 1, 0.3)             # 30 % positives
x  <- rnorm(N, mean = 50, sd = 10)  # numeric predictor
res <- optimal_binning_numerical_fetb(y, x,
        min_bins = 2, max_bins = 6, max_n_prebins = 25)
print(res)
#> $id
#> [1] 1 2 3 4 5
#> 
#> $bin
#> [1] "(-inf; 31.8986]"    "(31.8986; 41.2299]" "(41.2299; 49.6264]"
#> [4] "(49.6264; 64.0953]" "(64.0953; inf]"    
#> 
#> $woe
#> [1]  0.49626425  0.09325460  0.03041461 -0.05407541 -0.29115469
#> 
#> $iv
#> [1] 0.0108370609 0.0014169813 0.0002601131 0.0012699106 0.0064345803
#> 
#> $count
#> [1]  40 160 280 440  80
#> 
#> $count_pos
#> [1]  17  53  89 132  20
#> 
#> $count_neg
#> [1]  23 107 191 308  60
#> 
#> $cutpoints
#> [1] 31.89857 41.22994 49.62642 64.09529
#> 
#> $converged
#> [1] TRUE
#> 
#> $iterations
#> [1] 0
#> 
# }