Optimal Binning for Numerical Variables using Divergence Measures and Information Value
optimal_binning_numerical_dmiv.Rd
Performs optimal binning for numerical variables using various divergence measures as proposed by Zeng (2013). This method transforms continuous features into discrete bins by maximizing the statistical divergence between distributions of positive and negative cases, while maintaining interpretability constraints.
Usage
optimal_binning_numerical_dmiv(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
is_monotonic = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000L,
bin_method = "woe1",
divergence_method = "l2"
)
Arguments
- target
An integer binary vector (0 or 1) representing the target variable.
- feature
A numeric vector of feature values to be binned.
- min_bins
Minimum number of bins to generate (default: 3).
- max_bins
Maximum number of bins to generate (default: 5).
- bin_cutoff
Minimum frequency fraction for each bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins generated before optimization (default: 20).
- is_monotonic
Logical value indicating whether to enforce monotonicity in WoE (default: TRUE).
- convergence_threshold
Convergence threshold for divergence measure change (default: 1e-6).
- max_iterations
Maximum number of iterations allowed for optimization (default: 1000).
- bin_method
Method for WoE calculation, either 'woe' (traditional) or 'woe1' (Zeng's) (default: 'woe1').
- divergence_method
Divergence measure to optimize. Options:
'he': Hellinger Discrimination
'kl': Kullback-Leibler Divergence
'tr': Triangular Discrimination
'klj': J-Divergence (symmetric KL)
'sc': Chi-Square Symmetric Divergence
'js': Jensen-Shannon Divergence
'l1': L1 metric (Manhattan distance)
'l2': L2 metric (Euclidean distance) - Default
'ln': L-infinity metric (Maximum distance)
Value
A list containing:
- id
Numeric identifiers for each bin (1-based).
- bin
Character vector with the intervals of each bin (e.g.,
(-Inf; 0]
,(0; +Inf)
).- woe
Numeric vector with the Weight of Evidence values for each bin.
- divergence
Numeric vector with the divergence measure contribution for each bin.
- count
Integer vector with the total number of observations in each bin.
- count_pos
Integer vector with the number of positive observations in each bin.
- count_neg
Integer vector with the number of negative observations in each bin.
- cutpoints
Numeric vector of cut points between bins (excluding infinity).
- converged
Logical value indicating whether the algorithm converged.
- iterations
Number of iterations executed by the optimization algorithm.
- total_divergence
The total divergence measure of the binning solution.
- bin_method
The WoE calculation method used ('woe' or 'woe1').
- divergence_method
The divergence measure used for optimization.
Details
This implementation is based on the theoretical framework from Zeng (2013) "Metric Divergence Measures and Information Value in Credit Scoring", which explores various divergence measures for optimal binning in credit scoring applications.
The algorithm extends traditional optimal binning by:
Supporting multiple divergence measures including true metric distances (L1, L2, L-infinity)
Offering choice between traditional WoE and Zeng's corrected WOE1 formula
Optimizing bin boundaries to maximize the chosen divergence measure
Ensuring monotonicity when requested, with direction determined by divergence maximization
The mathematical formulations of the divergence measures include:
$$Hellinger: h(P||Q) = \frac{1}{2}\sum_{i=1}^{n}(\sqrt{p_i} - \sqrt{q_i})^2$$ $$Kullback-Leibler: D(P||Q) = \sum_{i=1}^{n}p_i\ln(\frac{p_i}{q_i})$$ $$J-Divergence: J(P||Q) = \sum_{i=1}^{n}(p_i - q_i)\ln(\frac{p_i}{q_i})$$ $$Triangular: \Delta(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2}{p_i + q_i}$$ $$Chi-Square: \psi(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2(p_i + q_i)}{p_iq_i}$$ $$Jensen-Shannon: I(P||Q) = \frac{1}{2}[\sum_{i=1}^{n}p_i\ln(\frac{2p_i}{p_i+q_i}) + \sum_{i=1}^{n}q_i\ln(\frac{2q_i}{p_i+q_i})]$$ $$L1: L_1(P||Q) = \sum_{i=1}^{n}|p_i - q_i|$$ $$L2: L_2(P||Q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}$$ $$L-infinity: L_\infty(P||Q) = \max_{1 \leq i \leq n}|p_i - q_i|$$
WoE calculation methods: $$Traditional WoE: \ln(\frac{p_i/P}{n_i/N})$$ $$Zeng's WOE1: \ln(\frac{g_i}{b_i})$$
Where:
\(p_i, q_i\): Proportion of positive/negative cases in bin i
\(g_i, b_i\): Count of positive/negative cases in bin i
\(P, N\): Total positive/negative cases
References
Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271, 10 pages.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics.
Examples
if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
n <- 10000
feature <- rnorm(n)
# Create target with logistic relationship
target <- rbinom(n, 1, plogis(0.5 * feature))
# Apply optimal binning with default L2 metric and WOE1
result <- optimal_binning_numerical_dmiv(target, feature)
print(result)
# Try with J-Divergence and traditional WoE
result_j <- optimal_binning_numerical_dmiv(
target = target,
feature = feature,
divergence_method = "klj",
bin_method = "woe"
)
# Compare results from different metrics
l1_result <- optimal_binning_numerical_dmiv(target, feature, divergence_method = "l1")
l2_result <- optimal_binning_numerical_dmiv(target, feature, divergence_method = "l2")
ln_result <- optimal_binning_numerical_dmiv(target, feature, divergence_method = "ln")
# Compare total divergence values
cat("L1 total divergence:", l1_result$total_divergence, "\n")
cat("L2 total divergence:", l2_result$total_divergence, "\n")
cat("L-infinity total divergence:", ln_result$total_divergence, "\n")
} # }