Optimal Binning for Categorical Variables using ChiMerge
optimal_binning_categorical_cm.Rd
Implements optimal binning for categorical variables using the ChiMerge algorithm (Kerber, 1992) and optionally the Chi2 algorithm (Liu & Setiono, 1995), calculating Weight of Evidence (WoE) and Information Value (IV) for the resulting bins. This is Version 4 with corrections based on previous code review.
Usage
optimal_binning_categorical_cm(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L,
chi_merge_threshold = 0.05,
use_chi2_algorithm = FALSE
)
Arguments
- target
Integer vector of binary target values (0 or 1). Cannot contain NAs.
- feature
Character vector of categorical feature values.
NA
values will be treated as a distinct category "NA".- min_bins
Minimum number of bins (default: 3, must be >= 2).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency fraction for a category to potentially avoid being merged in the initial
handle_rare_categories
step (default: 0.05). Note: The main merging uses chi-square statistics.- max_n_prebins
Maximum number of bins allowed after the initial pre-binning/rare handling step, before the main ChiMerge/Chi2 loop (default: 20). Merging stops if this limit is reached and statistical thresholds aren't met.
- bin_separator
Separator string for concatenating category names in bins (default: "%;%").
- convergence_threshold
Threshold for convergence based on the absolute difference in minimum chi-square between iterations during bin merging (default: 1e-6).
- max_iterations
Maximum number of iterations allowed for the bin merging loop (default: 1000).
- chi_merge_threshold
Significance level threshold for the chi-square test used in merging decisions (default: 0.05, corresponds to 95 pct confidence). Lower values lead to fewer merges.
- use_chi2_algorithm
Boolean indicating whether to use the enhanced Chi2 algorithm which involves multiple ChiMerge phases with decreasing significance levels (default: FALSE).
Value
A list containing:
- id
Vector of numeric IDs (1-based) for each final bin.
- bin
Vector of character strings representing the final bins (concatenated category names).
- woe
Vector of numeric Weight of Evidence (WoE) values for each bin.
- iv
Vector of numeric Information Value (IV) contributions for each bin.
- count
Vector of integer total counts (observations) for each bin.
- count_pos
Vector of integer positive class counts for each bin.
- count_neg
Vector of integer negative class counts for each bin.
- converged
Boolean indicating whether the merging algorithm converged (either reached target bins, statistical threshold, or convergence threshold).
- iterations
Integer number of merging iterations performed.
- algorithm
Character string indicating the algorithm used ("ChiMerge" or "Chi2").
Details
This function implements categorical variable binning based on chi-square statistics.
The core logic follows the ChiMerge approach, iteratively merging adjacent bins (sorted by WoE)
that have the lowest chi-square statistic below a specified critical value (derived from chi_merge_threshold
).
The optional Chi2 algorithm applies multiple rounds of ChiMerge with varying significance levels.
Monotonicity of WoE across the final bins is enforced by merging adjacent bins that violate the trend.
Weight of Evidence (WoE) is calculated as: \(WoE_i = \ln(\frac{p_{pos,i}}{p_{neg,i}})\) Information Value (IV) is calculated as: \(IV = \sum_{i} (p_{pos,i} - p_{neg,i}) \times WoE_i\) where \(p_{pos,i}\) and \(p_{neg,i}\) are the proportions of positive and negative observations in bin i relative to the total positive and negative observations, respectively.
V4 includes fixes for stability and corrects the initialization and usage of the internal chi-square cache.
References
Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. In AAAI'92.
Liu, H. & Setiono, R. (1995). Chi2: Feature Selection and Discretization of Numeric Attributes. In TAI'95.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Examples
if (FALSE) { # \dontrun{
# Example data
set.seed(123)
target <- sample(0:1, 500, replace = TRUE, prob = c(0.7, 0.3))
feature <- sample(LETTERS[1:8], 500, replace = TRUE)
feature[sample(1:500, 20)] <- NA # Add some NAs
# Run optimal binning with ChiMerge (V4)
result_v4 <- optimal_binning_categorical_cm(target, feature,
min_bins = 3, max_bins = 6,
chi_merge_threshold = 0.05)
print(result_v4)
# Check total IV
print(sum(result_v4$iv))
# Run using the Chi2 algorithm variant
result_chi2_v4 <- optimal_binning_categorical_cm(target, feature,
min_bins = 3, max_bins = 6,
use_chi2_algorithm = TRUE)
print(result_chi2_v4)
} # }