Optimal Binning for Numerical Variables using Unsupervised Decision Trees
optimal_binning_numerical_udt.Rd
This function implements an optimal binning algorithm for numerical variables using an Unsupervised Decision Tree (UDT) approach with Weight of Evidence (WoE) and Information Value (IV) criteria. The algorithm creates bins that maximize the predictive power of the feature while maintaining interpretability.
Usage
optimal_binning_numerical_udt(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
laplace_smoothing = 0.5,
monotonicity_direction = "none",
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A numeric vector of feature values to be binned.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency of observations in each bin as a proportion (default: 0.05).
- max_n_prebins
Maximum number of pre-bins for initial discretization (default: 20).
- laplace_smoothing
Smoothing parameter for WoE calculation to handle zero counts (default: 0.5).
- monotonicity_direction
Specify monotonicity constraint: "none", "increasing", "decreasing", or "auto" (default: "none").
- convergence_threshold
Threshold for convergence of the optimization process (default: 1e-6).
- max_iterations
Maximum number of iterations for the optimization process (default: 1000).
Value
A list containing binning details:
- id
A numeric vector of bin identifiers.
- bin
A character vector of bin intervals.
- woe
A numeric vector of Weight of Evidence values for each bin.
- iv
A numeric vector of Information Value for each bin.
- event_rate
A numeric vector of event rates (proportion of positives) for each bin.
- count
An integer vector of total observations in each bin.
- count_pos
An integer vector of positive observations in each bin.
- count_neg
An integer vector of negative observations in each bin.
- cutpoints
A numeric vector of cut points between bins.
- total_iv
The total Information Value of the binning.
- gini
The Gini coefficient measuring discrimination power.
- ks
The Kolmogorov-Smirnov statistic measuring separation.
- converged
A logical value indicating whether the algorithm converged.
- iterations
An integer value of the number of iterations run.
Details
The Unsupervised Decision Tree (UDT) binning algorithm discretizes a continuous variable into bins that maximize the Information Value (IV) while respecting constraints on the number and size of bins.
The algorithm follows these main steps:
Initial discretization using an entropy-based decision tree approach
Merging of rare bins based on the bin_cutoff parameter
Bin optimization using IV and WoE criteria
Optional enforcement of monotonicity in WoE across bins
Adjustment of the number of bins to be within the specified range
The mathematical formulation of the optimization problem is:
$$ \max_{\{c_1, c_2, ..., c_{m-1}\}} \sum_{i=1}^{m} (p_i - q_i) \cdot \ln\left(\frac{p_i + \epsilon}{q_i + \epsilon}\right) $$
Subject to:
\(min\_bins \leq m \leq max\_bins\)
\(\frac{n_i}{n} \geq bin\_cutoff\) for all i
Optionally, \(WoE_1 \leq WoE_2 \leq ... \leq WoE_m\) (for increasing monotonicity)
Where:
\(p_i = \frac{n_{i,1}}{n_1}\) is the proportion of positive observations in bin i
\(q_i = \frac{n_{i,0}}{n_0}\) is the proportion of negative observations in bin i
\(\epsilon\) is the Laplace smoothing parameter
The algorithm includes special handling for missing values (NA/NaN) and extreme values (±Inf), as well as proper treatment of variables with very few unique values.
References
Belkin, M., Hsu, D., Ma, S., & Mandal, S. (2019). Reconciling modern machine-learning practice and the classical bias-variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849-15854.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
Thomas, L.C., Edelman, D.B., & Crook, J.N. (2002). Credit Scoring and Its Applications. SIAM.
Siddiqi, N. (2017). Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards. Wiley.
Examples
if (FALSE) { # \dontrun{
# Generate sample data
set.seed(123)
n <- 10000
feature <- rnorm(n)
target <- rbinom(n, 1, plogis(0.5 * feature))
# Apply optimal binning
result <- optimal_binning_numerical_udt(
target, feature,
min_bins = 3,
max_bins = 5,
monotonicity_direction = "auto",
laplace_smoothing = 0.5
)
# View binning results
print(result)
} # }