Optimal Binning for Numerical Variables using Unsupervised Decision Trees — optimal_binning_numerical

This function implements an optimal binning algorithm for numerical variables using an Unsupervised Decision Tree (UDT) approach with Weight of Evidence (WoE) and Information Value (IV) criteria. The algorithm creates bins that maximize the predictive power of the feature while maintaining interpretability.

Usage

optimal_binning_numerical_udt(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  laplace_smoothing = 0.5,
  monotonicity_direction = "none",
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: An integer vector of binary target values (0 or 1).
feature: A numeric vector of feature values to be binned.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency of observations in each bin as a proportion (default: 0.05).
max_n_prebins: Maximum number of pre-bins for initial discretization (default: 20).
laplace_smoothing: Smoothing parameter for WoE calculation to handle zero counts (default: 0.5).
monotonicity_direction: Specify monotonicity constraint: "none", "increasing", "decreasing", or "auto" (default: "none").
convergence_threshold: Threshold for convergence of the optimization process (default: 1e-6).
max_iterations: Maximum number of iterations for the optimization process (default: 1000).

Value

A list containing binning details:

id: A numeric vector of bin identifiers.
bin: A character vector of bin intervals.
woe: A numeric vector of Weight of Evidence values for each bin.
iv: A numeric vector of Information Value for each bin.
event_rate: A numeric vector of event rates (proportion of positives) for each bin.
count: An integer vector of total observations in each bin.
count_pos: An integer vector of positive observations in each bin.
count_neg: An integer vector of negative observations in each bin.
cutpoints: A numeric vector of cut points between bins.
total_iv: The total Information Value of the binning.
gini: The Gini coefficient measuring discrimination power.
ks: The Kolmogorov-Smirnov statistic measuring separation.
converged: A logical value indicating whether the algorithm converged.
iterations: An integer value of the number of iterations run.

Details

The Unsupervised Decision Tree (UDT) binning algorithm discretizes a continuous variable into bins that maximize the Information Value (IV) while respecting constraints on the number and size of bins.

The algorithm follows these main steps:

Initial discretization using an entropy-based decision tree approach
Merging of rare bins based on the bin_cutoff parameter
Bin optimization using IV and WoE criteria
Optional enforcement of monotonicity in WoE across bins
Adjustment of the number of bins to be within the specified range

The mathematical formulation of the optimization problem is:

$$ \max_{\{c_1, c_2, ..., c_{m-1}\}} \sum_{i=1}^{m} (p_i - q_i) \cdot \ln\left(\frac{p_i + \epsilon}{q_i + \epsilon}\right) $$

Subject to:

$min\_bins \leq m \leq max\_bins$
$\frac{n_i}{n} \geq bin\_cutoff$ for all i
Optionally, $WoE_1 \leq WoE_2 \leq ... \leq WoE_m$ (for increasing monotonicity)

Where:

$p_i = \frac{n_{i,1}}{n_1}$ is the proportion of positive observations in bin i
$q_i = \frac{n_{i,0}}{n_0}$ is the proportion of negative observations in bin i
$\epsilon$ is the Laplace smoothing parameter

The algorithm includes special handling for missing values (NA/NaN) and extreme values (±Inf), as well as proper treatment of variables with very few unique values.

References

Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854.

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.

Thomas, L.C., Edelman, D.B., & Crook, J.N. (2002). Credit Scoring and Its Applications. SIAM.

Siddiqi, N. (2017). Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards. Wiley.

Examples

if (FALSE) { # \dontrun{
# Generate sample data
set.seed(123)
n <- 10000
feature <- rnorm(n)
target <- rbinom(n, 1, plogis(0.5 * feature))

# Apply optimal binning
result <- optimal_binning_numerical_udt(
  target, feature, 
  min_bins = 3, 
  max_bins = 5,
  monotonicity_direction = "auto",
  laplace_smoothing = 0.5
)

# View binning results
print(result)
} # }