Optimal Binning for Numerical Variables using Divergence Measures and Information Value — optimal_binning_numerical

Performs optimal binning for numerical variables using various divergence measures as proposed by Zeng (2013). This method transforms continuous features into discrete bins by maximizing the statistical divergence between distributions of positive and negative cases, while maintaining interpretability constraints.

Usage

optimal_binning_numerical_dmiv(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  is_monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  bin_method = "woe1",
  divergence_method = "l2"
)

Arguments

target

An integer binary vector (0 or 1) representing the target variable.

feature

A numeric vector of feature values to be binned.

min_bins

Minimum number of bins to generate (default: 3).

max_bins

Maximum number of bins to generate (default: 5).

bin_cutoff

Minimum frequency fraction for each bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins generated before optimization (default: 20).

is_monotonic

Logical value indicating whether to enforce monotonicity in WoE (default: TRUE).

convergence_threshold

Convergence threshold for divergence measure change (default: 1e-6).

max_iterations

Maximum number of iterations allowed for optimization (default: 1000).

bin_method

Method for WoE calculation, either 'woe' (traditional) or 'woe1' (Zeng's) (default: 'woe1').

divergence_method

Divergence measure to optimize. Options:

'he': Hellinger Discrimination
'kl': Kullback-Leibler Divergence
'tr': Triangular Discrimination
'klj': J-Divergence (symmetric KL)
'sc': Chi-Square Symmetric Divergence
'js': Jensen-Shannon Divergence
'l1': L1 metric (Manhattan distance)
'l2': L2 metric (Euclidean distance) - Default
'ln': L-infinity metric (Maximum distance)

Value

A list containing:

id: Numeric identifiers for each bin (1-based).
bin: Character vector with the intervals of each bin (e.g., (-Inf; 0], (0; +Inf)).
woe: Numeric vector with the Weight of Evidence values for each bin.
divergence: Numeric vector with the divergence measure contribution for each bin.
count: Integer vector with the total number of observations in each bin.
count_pos: Integer vector with the number of positive observations in each bin.
count_neg: Integer vector with the number of negative observations in each bin.
cutpoints: Numeric vector of cut points between bins (excluding infinity).
converged: Logical value indicating whether the algorithm converged.
iterations: Number of iterations executed by the optimization algorithm.
total_divergence: The total divergence measure of the binning solution.
bin_method: The WoE calculation method used ('woe' or 'woe1').
divergence_method: The divergence measure used for optimization.

Details

This implementation is based on the theoretical framework from Zeng (2013) "Metric Divergence Measures and Information Value in Credit Scoring", which explores various divergence measures for optimal binning in credit scoring applications.

The algorithm extends traditional optimal binning by:

Supporting multiple divergence measures including true metric distances (L1, L2, L-infinity)
Offering choice between traditional WoE and Zeng's corrected WOE1 formula
Optimizing bin boundaries to maximize the chosen divergence measure
Ensuring monotonicity when requested, with direction determined by divergence maximization

The mathematical formulations of the divergence measures include:

$$Hellinger: h(P||Q) = \frac{1}{2}\sum_{i=1}^{n}(\sqrt{p_i} - \sqrt{q_i})^2$$ $$Kullback-Leibler: D(P||Q) = \sum_{i=1}^{n}p_i\ln(\frac{p_i}{q_i})$$ $$J-Divergence: J(P||Q) = \sum_{i=1}^{n}(p_i - q_i)\ln(\frac{p_i}{q_i})$$ $$Triangular: \Delta(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2}{p_i + q_i}$$ $$Chi-Square: \psi(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2(p_i + q_i)}{p_iq_i}$$ $$Jensen-Shannon: I(P||Q) = \frac{1}{2}[\sum_{i=1}^{n}p_i\ln(\frac{2p_i}{p_i+q_i}) + \sum_{i=1}^{n}q_i\ln(\frac{2q_i}{p_i+q_i})]$$ $$L1: L_1(P||Q) = \sum_{i=1}^{n}|p_i - q_i|$$ $$L2: L_2(P||Q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}$$ $$L-infinity: L_\infty(P||Q) = \max_{1 \leq i \leq n}|p_i - q_i|$$

WoE calculation methods: $$Traditional WoE: \ln(\frac{p_i/P}{n_i/N})$$ $$Zeng's WOE1: \ln(\frac{g_i}{b_i})$$

Where:

$p_i, q_i$: Proportion of positive/negative cases in bin i
$g_i, b_i$: Count of positive/negative cases in bin i
$P, N$: Total positive/negative cases

References

Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271, 10 pages.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics.

Examples

if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
n <- 10000
feature <- rnorm(n)
# Create target with logistic relationship
target <- rbinom(n, 1, plogis(0.5 * feature))

# Apply optimal binning with default L2 metric and WOE1
result <- optimal_binning_numerical_dmiv(target, feature)
print(result)

# Try with J-Divergence and traditional WoE
result_j <- optimal_binning_numerical_dmiv(
  target = target,
  feature = feature,
  divergence_method = "klj",
  bin_method = "woe"
)

# Compare results from different metrics
l1_result <- optimal_binning_numerical_dmiv(target, feature, divergence_method = "l1")
l2_result <- optimal_binning_numerical_dmiv(target, feature, divergence_method = "l2")
ln_result <- optimal_binning_numerical_dmiv(target, feature, divergence_method = "ln")

# Compare total divergence values
cat("L1 total divergence:", l1_result$total_divergence, "\n")
cat("L2 total divergence:", l2_result$total_divergence, "\n")
cat("L-infinity total divergence:", ln_result$total_divergence, "\n")
} # }