Optimal Binning for Numerical Variables with Fisher’s Exact Test
optimal_binning_numerical_fetb.Rd
Implements a supervised, monotonic, optimal binning procedure for numeric predictors against a binary target. The algorithm iteratively merges the pair of adjacent bins whose class composition is most similar according to the two‑tailed Fisher’s Exact Test, and guarantees a monotone Weight of Evidence (WoE) profile. Designed for scorecard development, churn modelling and any logistic application where robust, information‑preserving discretisation is required.
Usage
optimal_binning_numerical_fetb(
target,
feature,
min_bins = 3L,
max_bins = 5L,
max_n_prebins = 20L,
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
Integer (0/1) vector, length \(N\).
- feature
Numeric vector, length \(N\).
- min_bins
Minimum number of final bins (default
3
).- max_bins
Maximum number of final bins (default
5
).- max_n_prebins
Maximum number of pre‑bins created before optimisation (default
20
).- convergence_threshold
Absolute tolerance for change in total IV used as convergence criterion (default
1e-6
).- max_iterations
Safety cap for merge + monotonicity iterations (default
1000
).
Value
A named list
:
- id
Bin index (1‑based).
- bin
Character vector
"(lo; hi]"
describing intervals.- woe, iv
WoE and IV per bin.
- count, count_pos, count_neg
Bin frequencies.
- cutpoints
Numeric vector of internal cut‑points \(c_1,\dots,c_{B-1}\).
- converged
Logical flag.
- iterations
Number of iterations executed.
Details
Notation
\((x_i,\,y_i),\; i=1,\dots,N\) are observations with
\(y_i\in\{0,1\}\). A cut‑point vector
\(c=(c_0=-\infty < c_1 < \dots < c_{B-1} < c_B=+\infty)\)
induces bins \(I_b=(c_{b-1},c_b],\; b=1,\dots,B\). For each bin collect
contingency counts
$$(a_b,b_b)=\Bigl(\sum_{x_i\in I_b}y_i,\;\sum_{x_i\in I_b}(1-y_i)\Bigr).$$
Algorithm
Pre‑binning. Create up to
max_n_prebins
equal‑frequency bins from the ordered feature. This bounds subsequent complexity.Fisher merge loop. While \(B>\)
max_bins
, merge the adjacent pair \((I_j,I_{j+1})\) maximising the point probability of the corresponding 2×2 table \(p_j = P\{ \text{table }(a_j,b_j,c_j,d_j)\}\).Monotonicity. After every merge, if the WoE sequence \(w_1,\dots,w_B\) violates monotonicity (\(\exists\,b:\,w_b>w_{b+1}\) for ascending trend or vice‑versa) merge that offending pair and restart the check locally.
Convergence. Stop when \(|IV_{t+1}-IV_t|<\)
convergence_threshold
or the iteration cap is reached.
Complexity
Pre‑binning: \(O(N\log N)\) (sort) but done once.
Merge loop: worst‑case \(O(B^2)\) with \(B\le\)
max_n_prebins
.Memory: \(O(B)\).
Formulae
$$ \mathrm{WoE}_b = \log\!\left(
\frac{a_b / T_1}{\,b_b / T_0}\right)\!, \qquad
\mathrm{IV} = \sum_{b=1}^{B}
\left(\frac{a_b}{T_1}-\frac{b_b}{T_0}\right)\mathrm{WoE}_b$$
where \(T_1=\sum_b a_b\),\ \(T_0=\sum_b b_b\).
References
Fisher, R. A. (1922) On the interpretation of \(X^2\) from contingency
tables, and the calculation of P. JRSS, 85 (1), 87‑94.
Siddiqi, N. (2012) Credit Risk Scorecards. Wiley.
Navas‑Palencia, G. (2019) optbinning documentation – Numerical FETB.
Hand, D. J., & Adams, N. M. (2015) Supervised Classification in
High Dimensions. Springer (Ch. 4, discretisation).
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013)
Applied Logistic Regression (3rd ed.). Wiley.
Examples
# \donttest{
set.seed(2025)
N <- 1000
y <- rbinom(N, 1, 0.3) # 30 % positives
x <- rnorm(N, mean = 50, sd = 10) # numeric predictor
res <- optimal_binning_numerical_fetb(y, x,
min_bins = 2, max_bins = 6, max_n_prebins = 25)
print(res)
#> $id
#> [1] 1 2 3 4 5
#>
#> $bin
#> [1] "(-inf; 31.8986]" "(31.8986; 41.2299]" "(41.2299; 49.6264]"
#> [4] "(49.6264; 64.0953]" "(64.0953; inf]"
#>
#> $woe
#> [1] 0.49626425 0.09325460 0.03041461 -0.05407541 -0.29115469
#>
#> $iv
#> [1] 0.0108370609 0.0014169813 0.0002601131 0.0012699106 0.0064345803
#>
#> $count
#> [1] 40 160 280 440 80
#>
#> $count_pos
#> [1] 17 53 89 132 20
#>
#> $count_neg
#> [1] 23 107 191 308 60
#>
#> $cutpoints
#> [1] 31.89857 41.22994 49.62642 64.09529
#>
#> $converged
#> [1] TRUE
#>
#> $iterations
#> [1] 0
#>
# }