Skip to contents

Implements a supervised, monotonic, optimal binning procedure for numeric predictors against a binary target. The algorithm iteratively merges the pair of adjacent bins whose class composition is most similar according to the two‑tailed Fisher’s Exact Test, and guarantees a monotone Weight of Evidence (WoE) profile. Designed for scorecard development, churn modelling and any logistic application where robust, information‑preserving discretisation is required.

Usage

optimal_binning_numerical_fetb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  max_n_prebins = 20L,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target

Integer (0/1) vector, length \(N\).

feature

Numeric vector, length \(N\).

min_bins

Minimum number of final bins (default 3).

max_bins

Maximum number of final bins (default 5).

max_n_prebins

Maximum number of pre‑bins created before optimisation (default 20).

convergence_threshold

Absolute tolerance for change in total IV used as convergence criterion (default 1e-6).

max_iterations

Safety cap for merge + monotonicity iterations (default 1000).

Value

A named list:

id

Bin index (1‑based).

bin

Character vector "(lo; hi]" describing intervals.

woe, iv

WoE and IV per bin.

count, count_pos, count_neg

Bin frequencies.

cutpoints

Numeric vector of internal cut‑points \(c_1,\dots,c_{B-1}\).

converged

Logical flag.

iterations

Number of iterations executed.

Details

Notation
\((x_i,\,y_i),\; i=1,\dots,N\) are observations with \(y_i\in\{0,1\}\). A cut‑point vector \(c=(c_0=-\infty < c_1 < \dots < c_{B-1} < c_B=+\infty)\) induces bins \(I_b=(c_{b-1},c_b],\; b=1,\dots,B\). For each bin collect contingency counts $$(a_b,b_b)=\Bigl(\sum_{x_i\in I_b}y_i,\;\sum_{x_i\in I_b}(1-y_i)\Bigr).$$

Algorithm

  1. Pre‑binning. Create up to max_n_prebins equal‑frequency bins from the ordered feature. This bounds subsequent complexity.

  2. Fisher merge loop. While \(B>\)max_bins, merge the adjacent pair \((I_j,I_{j+1})\) maximising the point probability of the corresponding 2×2 table \(p_j = P\{ \text{table }(a_j,b_j,c_j,d_j)\}\).

  3. Monotonicity. After every merge, if the WoE sequence \(w_1,\dots,w_B\) violates monotonicity (\(\exists\,b:\,w_b>w_{b+1}\) for ascending trend or vice‑versa) merge that offending pair and restart the check locally.

  4. Convergence. Stop when \(|IV_{t+1}-IV_t|<\)convergence_threshold or the iteration cap is reached.

Complexity

  • Pre‑binning: \(O(N\log N)\) (sort) but done once.

  • Merge loop: worst‑case \(O(B^2)\) with \(B\le\)max_n_prebins.

  • Memory: \(O(B)\).

Formulae
$$ \mathrm{WoE}_b = \log\!\left( \frac{a_b / T_1}{\,b_b / T_0}\right)\!, \qquad \mathrm{IV} = \sum_{b=1}^{B} \left(\frac{a_b}{T_1}-\frac{b_b}{T_0}\right)\mathrm{WoE}_b$$ where \(T_1=\sum_b a_b\),\ \(T_0=\sum_b b_b\).

References

Fisher, R. A. (1922) On the interpretation of \(X^2\) from contingency tables, and the calculation of P. JRSS, 85 (1), 87‑94.
Siddiqi, N. (2012) Credit Risk Scorecards. Wiley.
Navas‑Palencia, G. (2019) optbinning documentation – Numerical FETB.
Hand, D. J., & Adams, N. M. (2015) Supervised Classification in High Dimensions. Springer (Ch. 4, discretisation).
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013) Applied Logistic Regression (3rd ed.). Wiley.

Author

Lopes, J. E.

Examples

# \donttest{
set.seed(2025)
N  <- 1000
y  <- rbinom(N, 1, 0.3)             # 30 % positives
x  <- rnorm(N, mean = 50, sd = 10)  # numeric predictor
res <- optimal_binning_numerical_fetb(y, x,
        min_bins = 2, max_bins = 6, max_n_prebins = 25)
print(res)
#> $id
#> [1] 1 2 3 4 5
#> 
#> $bin
#> [1] "(-inf; 31.8986]"    "(31.8986; 41.2299]" "(41.2299; 49.6264]"
#> [4] "(49.6264; 64.0953]" "(64.0953; inf]"    
#> 
#> $woe
#> [1]  0.49626425  0.09325460  0.03041461 -0.05407541 -0.29115469
#> 
#> $iv
#> [1] 0.0108370609 0.0014169813 0.0002601131 0.0012699106 0.0064345803
#> 
#> $count
#> [1]  40 160 280 440  80
#> 
#> $count_pos
#> [1]  17  53  89 132  20
#> 
#> $count_neg
#> [1]  23 107 191 308  60
#> 
#> $cutpoints
#> [1] 31.89857 41.22994 49.62642 64.09529
#> 
#> $converged
#> [1] TRUE
#> 
#> $iterations
#> [1] 0
#> 
# }