Optimal Binning for Categorical Variables using Divergence Measures

Performs supervised discretization of categorical variables using a divergence-based hierarchical merging algorithm. This implementation supports multiple information-theoretic and metric divergence measures as described by Zeng (2013), enabling flexible optimization of binning structures for credit scoring and binary classification tasks.

Usage

ob_categorical_dmiv(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000,
  bin_method = "woe1",
  divergence_method = "l2"
)

Arguments

feature

A character vector or factor representing the categorical predictor variable to be binned. Missing values are automatically converted to the category "NA".

target

An integer vector of binary outcomes (0/1) corresponding to each observation in feature. Missing values are not permitted.

min_bins

Integer. Minimum number of bins to produce. Must be >= 2. If the final number of bins after merging falls below this threshold, the algorithm will attempt to split bins. Defaults to 3.

max_bins

Integer. Maximum number of bins to produce. Must be >= min_bins. The algorithm performs hierarchical merging until this constraint is satisfied. Defaults to 5.

bin_cutoff

Numeric. Frequency threshold for rare category handling. Categories with relative frequency below this value are candidates for pre-binning. Must be in (0, 1). Defaults to 0.05.

max_n_prebins

Integer. Maximum number of initial bins before the main merging phase. When unique categories exceed this limit, rare categories are pre-merged into an "other" bin. Must be >= 2. Defaults to 20.

bin_separator

Character string used to concatenate category names when multiple categories are merged into a single bin. Defaults to "%;%".

convergence_threshold

Numeric. Convergence tolerance for the iterative merging process. Merging stops when the change in minimum divergence between iterations falls below this threshold. Must be > 0. Defaults to 1e-6.

max_iterations

Integer. Maximum number of merge operations allowed. Prevents infinite loops in edge cases. Must be > 0. Defaults to 1000.

bin_method

Character string specifying the Weight of Evidence calculation method. Must be one of:

"woe": Traditional WoE: \(\ln\left(\frac{p_i/P}{n_i/N}\right)\)
"woe1": Smoothed WoE (Zeng): \(\ln\left(\frac{g_i + 0.5}{b_i + 0.5}\right)\)

The smoothed variant provides numerical stability for sparse bins. Defaults to "woe1".

divergence_method

Character string specifying the divergence measure used for determining bin similarity. Must be one of:

"he": Hellinger Distance: \(\sum(\sqrt{p_i} - \sqrt{n_i})^2\)
"kl": Symmetrized Kullback-Leibler Divergence
"klj": Jeffreys J-Divergence: \((p-n)\ln(p/n)\)
"tr": Triangular Discrimination: \((p-n)^2/(p+n)\)
"sc": Symmetric Chi-Square: \((p-n)^2(p+n)/(pn)\)
"js": Jensen-Shannon Divergence
"l1": L1 Metric (Manhattan Distance): \(|p-n|\)
"l2": L2 Metric (Euclidean Distance): \(\sqrt{\sum(p-n)^2}\)
"ln": L-infinity Metric (Chebyshev Distance): \(\max|p-n|\)

Defaults to "l2".

Value

A list containing the binning results with the following components:

id: Integer vector of bin identifiers (1-indexed)
bin: Character vector of bin labels (merged category names)
woe: Numeric vector of Weight of Evidence values per bin
divergence: Numeric vector of divergence contribution per bin
count: Integer vector of total observations per bin
count_pos: Integer vector of positive cases (target=1) per bin
count_neg: Integer vector of negative cases (target=0) per bin
converged: Logical indicating algorithm convergence
iterations: Integer count of merge operations performed
total_divergence: Numeric total divergence of the binning solution
bin_method: Character string of WoE method used
divergence_method: Character string of divergence measure used

Details

The algorithm implements a hierarchical agglomerative approach where bins are iteratively merged based on minimum pairwise divergence until the max_bins constraint is satisfied or convergence is achieved.

Algorithm Workflow:

Input validation and frequency computation
Pre-binning of rare categories (if unique categories > max_n_prebins)
Initialization of pairwise divergence matrix
Iterative merging of most similar bin pairs
Splitting of heterogeneous bins (if bins < min_bins)
Final metric computation and WoE-based sorting

Divergence Measure Selection: The choice of divergence measure affects the binning structure:

Information-theoretic measures ("kl", "js", "klj"): Emphasize distributional differences; sensitive to rare events
Metric measures ("l1", "l2", "ln"): Provide geometric interpretation; robust to outliers
Chi-square family ("sc", "tr"): Balance between information content and robustness
Hellinger distance ("he"): Bounded measure; suitable for probability distributions

Pre-binning Strategy: When the number of unique categories exceeds max_n_prebins, categories with fewer than 5 observations are aggregated into a special "PREBIN_OTHER" bin to control computational complexity.

References

Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271. doi:10.1155/2013/848271

Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

Lin, J. (1991). Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

Examples