Optimal Binning for Categorical Variables using ChiMerge — optimal_binning_categorical

Implements optimal binning for categorical variables using the ChiMerge algorithm (Kerber, 1992) and optionally the Chi2 algorithm (Liu & Setiono, 1995), calculating Weight of Evidence (WoE) and Information Value (IV) for the resulting bins. This is Version 4 with corrections based on previous code review.

Usage

optimal_binning_categorical_cm(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  chi_merge_threshold = 0.05,
  use_chi2_algorithm = FALSE
)

Arguments

target: Integer vector of binary target values (0 or 1). Cannot contain NAs.
feature: Character vector of categorical feature values. NA values will be treated as a distinct category "NA".
min_bins: Minimum number of bins (default: 3, must be >= 2).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency fraction for a category to potentially avoid being merged in the initial handle_rare_categories step (default: 0.05). Note: The main merging uses chi-square statistics.
max_n_prebins: Maximum number of bins allowed after the initial pre-binning/rare handling step, before the main ChiMerge/Chi2 loop (default: 20). Merging stops if this limit is reached and statistical thresholds aren't met.
bin_separator: Separator string for concatenating category names in bins (default: "%;%").
convergence_threshold: Threshold for convergence based on the absolute difference in minimum chi-square between iterations during bin merging (default: 1e-6).
max_iterations: Maximum number of iterations allowed for the bin merging loop (default: 1000).
chi_merge_threshold: Significance level threshold for the chi-square test used in merging decisions (default: 0.05, corresponds to 95 pct confidence). Lower values lead to fewer merges.
use_chi2_algorithm: Boolean indicating whether to use the enhanced Chi2 algorithm which involves multiple ChiMerge phases with decreasing significance levels (default: FALSE).

Value

A list containing:

id: Vector of numeric IDs (1-based) for each final bin.
bin: Vector of character strings representing the final bins (concatenated category names).
woe: Vector of numeric Weight of Evidence (WoE) values for each bin.
iv: Vector of numeric Information Value (IV) contributions for each bin.
count: Vector of integer total counts (observations) for each bin.
count_pos: Vector of integer positive class counts for each bin.
count_neg: Vector of integer negative class counts for each bin.
converged: Boolean indicating whether the merging algorithm converged (either reached target bins, statistical threshold, or convergence threshold).
iterations: Integer number of merging iterations performed.
algorithm: Character string indicating the algorithm used ("ChiMerge" or "Chi2").

Details

This function implements categorical variable binning based on chi-square statistics. The core logic follows the ChiMerge approach, iteratively merging adjacent bins (sorted by WoE) that have the lowest chi-square statistic below a specified critical value (derived from chi_merge_threshold). The optional Chi2 algorithm applies multiple rounds of ChiMerge with varying significance levels. Monotonicity of WoE across the final bins is enforced by merging adjacent bins that violate the trend.

Weight of Evidence (WoE) is calculated as: \(WoE_i = \ln(\frac{p_{pos,i}}{p_{neg,i}})\) Information Value (IV) is calculated as: \(IV = \sum_{i} (p_{pos,i} - p_{neg,i}) \times WoE_i\) where \(p_{pos,i}\) and \(p_{neg,i}\) are the proportions of positive and negative observations in bin i relative to the total positive and negative observations, respectively.

V4 includes fixes for stability and corrects the initialization and usage of the internal chi-square cache.

References

Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. In AAAI'92.
Liu, H. & Setiono, R. (1995). Chi2: Feature Selection and Discretization of Numeric Attributes. In TAI'95.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Examples

if (FALSE) { # \dontrun{
# Example data
set.seed(123)
target <- sample(0:1, 500, replace = TRUE, prob = c(0.7, 0.3))
feature <- sample(LETTERS[1:8], 500, replace = TRUE)
feature[sample(1:500, 20)] <- NA # Add some NAs

# Run optimal binning with ChiMerge (V4)
result_v4 <- optimal_binning_categorical_cm(target, feature,
                                           min_bins = 3, max_bins = 6,
                                           chi_merge_threshold = 0.05)
print(result_v4)

# Check total IV
print(sum(result_v4$iv))

# Run using the Chi2 algorithm variant
result_chi2_v4 <- optimal_binning_categorical_cm(target, feature,
                                               min_bins = 3, max_bins = 6,
                                               use_chi2_algorithm = TRUE)
print(result_chi2_v4)
} # }