Optimal Binning for Categorical Variables using Divergence Measures (V2)
optimal_binning_categorical_dmiv.Rd
Performs optimal binning for categorical variables using various divergence measures as proposed by Zeng (2013). This is Version 2, incorporating fixes for potential crashes and performance improvements. This method transforms categorical features into discrete bins by maximizing the statistical divergence between distributions of positive and negative cases, while maintaining interpretability constraints.
Usage
optimal_binning_categorical_dmiv(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L,
bin_method = "woe1",
divergence_method = "l2"
)
Arguments
- target
An integer binary vector (0 or 1) representing the target variable. Cannot contain NAs.
- feature
A character vector of categorical feature values.
NA
values will be treated as a distinct category "NA".- min_bins
Minimum number of bins to generate (default: 3, must be >= 2).
- max_bins
Maximum number of bins to generate (default: 5).
- bin_cutoff
Minimum frequency fraction threshold used for OLD rare category handling (NOTE: V2 primarily uses
max_n_prebins
andmin_prebin_count
for initial handling, this cutoff is less relevant now but kept for interface compatibility). Default: 0.05.- max_n_prebins
Maximum number of initial bins before merging starts. If unique categories exceed this, categories with counts <
min_prebin_count
(hardcoded as 5 currently) are grouped into a "PREBIN_OTHER" bin (default: 20).- bin_separator
String separator for concatenating category names in bins (default: "%;%").
- convergence_threshold
Convergence threshold for the change in minimum divergence between iterations during merging (default: 1e-6).
- max_iterations
Maximum number of merging iterations allowed (default: 1000).
- bin_method
Method for WoE calculation, either 'woe' (traditional) or 'woe1' (Zeng's, smoothed) (default: 'woe1').
- divergence_method
Divergence measure to optimize for merging (lower is better). Options:
'he': Hellinger Distance
'kl': Symmetrized Kullback-Leibler Divergence
'tr': Triangular Discrimination
'klj': J-Divergence (Symmetric KL)
'sc': Symmetric Chi-Square Divergence
'js': Jensen-Shannon Divergence
'l1': L1 metric (Manhattan distance) on local bin proportions p/(p+n) vs q/(p+n)
'l2': L2 metric (Euclidean distance) on local bin proportions - Default
'ln': L-infinity metric (Maximum distance) on local bin proportions
Value
A list containing:
- id
Numeric identifiers for each bin (1-based).
- bin
Character vector with the categories in each bin (or "PREBIN_OTHER").
- woe
Numeric vector with the Weight of Evidence values for each bin.
- divergence
Numeric vector with the divergence measure contribution for each bin (for L2/L-inf, this holds an intermediate value:
(p-n)^2
or|p-n|
respectively, where p/n are proportions relative to total pos/neg).- count
Integer vector with the total number of observations in each bin.
- count_pos
Integer vector with the number of positive observations in each bin.
- count_neg
Integer vector with the number of negative observations in each bin.
- converged
Logical value indicating whether the merging algorithm converged before hitting max iterations or
max_bins
.- iterations
Number of merging iterations executed.
- total_divergence
The total divergence measure of the final binning solution (calculated correctly for all methods, including L2/L-inf).
- bin_method
The WoE calculation method used ('woe' or 'woe1').
- divergence_method
The divergence measure used for optimization.
Details
This implementation (V2) addresses potential stability and performance issues found in V1.
It follows the framework from Zeng (2013) using various divergence measures.
The max_n_prebins
parameter is now functional, grouping rare categories initially if cardinality is high.
The similarity matrix update logic during merging and splitting has been corrected and optimized.
Calculation and reporting of L2/L-infinity total divergence are corrected.
Formulas for divergence measures (where P = (p_1,..p_n), Q = (q_1,..q_n) are distributions):
$$Hellinger: h(P||Q) = \frac{1}{2}\sum_{i=1}^{n}(\sqrt{p_i} - \sqrt{q_i})^2$$
$$Symmetric KL: D_S(P||Q) = D(P||Q) + D(Q||P)$$
$$J-Divergence: J(P||Q) = D_S(P||Q)$$
$$Triangular: \Delta(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2}{p_i + q_i}$$
$$Chi-Square Symm: \psi(P||Q) = \sum_{i=1}^{n}\frac{(p_i - q_i)^2(p_i + q_i)}{p_iq_i}$$
$$Jensen-Shannon: JSD(P||Q) = \frac{1}{2}D(P||M) + \frac{1}{2}D(Q||M), M=\frac{P+Q}{2}$$
$$L1 (local proportions): L_1(p_1, p_2) = | \frac{g_1}{g_1+b_1} - \frac{g_2}{g_2+b_2} | + | \frac{b_1}{g_1+b_1} - \frac{b_2}{g_2+b_2} |$$ (Note: The code calculates L1 based on global proportions p_i/P vs n_i/N for merging criteria, but documentation needs clarification if local prop is intended) -> V2 Code calculates L1/L2/Ln based on local proportions for the bin similarity/distance now.
$$L2 (local proportions): L_2(p_1, p_2) = \sqrt{ (\frac{g_1}{g_1+b_1} - \frac{g_2}{g_2+b_2})^2 + (\frac{b_1}{g_1+b_1} - \frac{b_2}{g_2+b_2})^2 }$$
$$L-infinity (local proportions): L_\infty(p_1, p_2) = \max ( | \frac{g_1}{g_1+b_1} - \frac{g_2}{g_2+b_2} |, | \frac{b_1}{g_1+b_1} - \frac{b_2}{g_2+b_2} | ) $$
WoE Methods:
$$Traditional WoE: \ln(\frac{g_i/G}{b_i/B})$$
$$Zeng's WOE1: \ln(\frac{g_i + 0.5}{b_i + 0.5})$$ (using smoothing)
References
Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271, 10 pages.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Examples
if (FALSE) { # \dontrun{
# Generate sample data
set.seed(123)
n <- 1000
categories <- c("A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "RARE1", "RARE2")
feature <- sample(categories, n, replace = TRUE, prob = c(rep(0.09, 10), 0.05, 0.05))
feature[sample(1:n, 50)] <- NA # Add some NAs
# Create target with different distribution per category
base_probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.05, 0.99)
target <- numeric(n)
for (i in 1:n) {
if (is.na(feature[i])) {
target[i] <- rbinom(1, 1, 0.5) # Assign random target for NA category
} else {
cat_idx <- match(feature[i], categories)
target[i] <- rbinom(1, 1, base_probs[cat_idx])
}
}
# Apply optimal binning V2 with L2 metric and WOE1
result_v2 <- optimal_binning_categorical_dmiv_v2(target, feature, max_bins = 4)
print(result_v2)
# Test with high cardinality and max_n_prebins
set.seed(456)
n_high <- 5000
categories_high <- paste0("CAT_", 1:100)
feature_high <- sample(categories_high, n_high, replace = TRUE)
target_high <- rbinom(n_high, 1, runif(n_high, 0.1, 0.9)) # Random target
result_prebin <- optimal_binning_categorical_dmiv_v2(
target_high,
feature_high,
max_n_prebins = 15, # Force pre-binning
max_bins = 5
)
print(result_prebin)
print(result_prebin$bin) # Check for PREBIN_OTHER
} # }