Optimal Binning for Categorical Variables using Divergence Measures (V2) — optimal_binning_categorical

Performs optimal binning for categorical variables using various divergence measures as proposed by Zeng (2013). This is Version 2, incorporating fixes for potential crashes and performance improvements. This method transforms categorical features into discrete bins by maximizing the statistical divergence between distributions of positive and negative cases, while maintaining interpretability constraints.

Usage

optimal_binning_categorical_dmiv(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  bin_method = "woe1",
  divergence_method = "l2"
)

Arguments

target

An integer binary vector (0 or 1) representing the target variable. Cannot contain NAs.

feature

A character vector of categorical feature values. NA values will be treated as a distinct category "NA".

min_bins

Minimum number of bins to generate (default: 3, must be >= 2).

max_bins

Maximum number of bins to generate (default: 5).

bin_cutoff

Minimum frequency fraction threshold used for OLD rare category handling (NOTE: V2 primarily uses max_n_prebins and min_prebin_count for initial handling, this cutoff is less relevant now but kept for interface compatibility). Default: 0.05.

max_n_prebins

Maximum number of initial bins before merging starts. If unique categories exceed this, categories with counts < min_prebin_count (hardcoded as 5 currently) are grouped into a "PREBIN_OTHER" bin (default: 20).

bin_separator

String separator for concatenating category names in bins (default: "%;%").

convergence_threshold

Convergence threshold for the change in minimum divergence between iterations during merging (default: 1e-6).

max_iterations

Maximum number of merging iterations allowed (default: 1000).

bin_method

Method for WoE calculation, either 'woe' (traditional) or 'woe1' (Zeng's, smoothed) (default: 'woe1').

divergence_method

Divergence measure to optimize for merging (lower is better). Options:

'he': Hellinger Distance
'kl': Symmetrized Kullback-Leibler Divergence
'tr': Triangular Discrimination
'klj': J-Divergence (Symmetric KL)
'sc': Symmetric Chi-Square Divergence
'js': Jensen-Shannon Divergence
'l1': L1 metric (Manhattan distance) on local bin proportions p/(p+n) vs q/(p+n)
'l2': L2 metric (Euclidean distance) on local bin proportions - Default
'ln': L-infinity metric (Maximum distance) on local bin proportions

Value

A list containing:

id: Numeric identifiers for each bin (1-based).
bin: Character vector with the categories in each bin (or "PREBIN_OTHER").
woe: Numeric vector with the Weight of Evidence values for each bin.
divergence: Numeric vector with the divergence measure contribution for each bin (for L2/L-inf, this holds an intermediate value: (p-n)^2 or |p-n| respectively, where p/n are proportions relative to total pos/neg).
count: Integer vector with the total number of observations in each bin.
count_pos: Integer vector with the number of positive observations in each bin.
count_neg: Integer vector with the number of negative observations in each bin.
converged: Logical value indicating whether the merging algorithm converged before hitting max iterations or max_bins.
iterations: Number of merging iterations executed.
total_divergence: The total divergence measure of the final binning solution (calculated correctly for all methods, including L2/L-inf).
bin_method: The WoE calculation method used ('woe' or 'woe1').
divergence_method: The divergence measure used for optimization.

Details

This implementation (V2) addresses potential stability and performance issues found in V1. It follows the framework from Zeng (2013) using various divergence measures. The max_n_prebins parameter is now functional, grouping rare categories initially if cardinality is high. The similarity matrix update logic during merging and splitting has been corrected and optimized. Calculation and reporting of L2/L-infinity total divergence are corrected. Formulas for divergence measures (where P = (p_1,..p_n), Q = (q_1,..q_n) are distributions): $$Hellinger: h(P||Q) = \frac{1}{2}\sum_{i=1}^{n}(\sqrt{p_i} - \sqrt{q_i})^2$$ $$Symmetric KL: D_S(P||Q) = D(P||Q) + D(Q||P)$$ $$J-Divergence: J(P||Q) = D_S(P||Q)$$ $$Triangular: \Delta(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2}{p_i + q_i}$$ $$Chi-Square Symm: \psi(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2(p_i + q_i)}{p_iq_i}$$ $$Jensen-Shannon: JSD(P||Q) = \frac{1}{2}D(P||M) + \frac{1}{2}D(Q||M), M=\frac{P+Q}{2}$$ $$L1 (local proportions): L_1(p_1, p_2) = | \frac{g_1}{g_1+b_1} - \frac{g_2}{g_2+b_2} | + | \frac{b_1}{g_1+b_1} - \frac{b_2}{g_2+b_2} |$$ (Note: The code calculates L1 based on global proportions p_i/P vs n_i/N for merging criteria, but documentation needs clarification if local prop is intended) -> V2 Code calculates L1/L2/Ln based on local proportions for the bin similarity/distance now. $$L2 (local proportions): L_2(p_1, p_2) = \sqrt{ (\frac{g_1}{g_1+b_1} - \frac{g_2}{g_2+b_2})^2 + (\frac{b_1}{g_1+b_1} - \frac{b_2}{g_2+b_2})^2 }$$ $$L-infinity (local proportions): L_\infty(p_1, p_2) = \max ( | \frac{g_1}{g_1+b_1} - \frac{g_2}{g_2+b_2} |, | \frac{b_1}{g_1+b_1} - \frac{b_2}{g_2+b_2} | ) $$ WoE Methods: $$Traditional WoE: \ln(\frac{g_i/G}{b_i/B})$$ $$Zeng's WOE1: \ln(\frac{g_i + 0.5}{b_i + 0.5})$$ (using smoothing)

References

Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271, 10 pages.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Examples

if (FALSE) { # \dontrun{
# Generate sample data
set.seed(123)
n <- 1000
categories <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "RARE1", "RARE2")
feature <- sample(categories, n, replace = TRUE, prob = c(rep(0.09, 10), 0.05, 0.05))
feature[sample(1:n, 50)] <- NA # Add some NAs

# Create target with different distribution per category
base_probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.05, 0.99)
target <- numeric(n)
for (i in 1:n) {
  if (is.na(feature[i])) {
    target[i] <- rbinom(1, 1, 0.5) # Assign random target for NA category
  } else {
     cat_idx <- match(feature[i], categories)
     target[i] <- rbinom(1, 1, base_probs[cat_idx])
  }
}

# Apply optimal binning V2 with L2 metric and WOE1
result_v2 <- optimal_binning_categorical_dmiv_v2(target, feature, max_bins = 4)
print(result_v2)

# Test with high cardinality and max_n_prebins
set.seed(456)
n_high <- 5000
categories_high <- paste0("CAT_", 1:100)
feature_high <- sample(categories_high, n_high, replace = TRUE)
target_high <- rbinom(n_high, 1, runif(n_high, 0.1, 0.9)) # Random target

result_prebin <- optimal_binning_categorical_dmiv_v2(
   target_high,
   feature_high,
   max_n_prebins = 15, # Force pre-binning
   max_bins = 5
)
print(result_prebin)
print(result_prebin$bin) # Check for PREBIN_OTHER
} # }