
Optimal Binning for Categorical Variables using Enhanced ChiMerge Algorithm
Source:R/obc_cm.R
ob_categorical_cm.RdPerforms supervised discretization of categorical variables using an enhanced implementation of the ChiMerge algorithm (Kerber, 1992) with optional Chi2 extension (Liu & Setiono, 1995). This method optimally groups categorical levels based on their relationship with a binary target variable to maximize predictive power while maintaining statistical significance.
Usage
ob_categorical_cm(
feature,
target,
min_bins = 3,
max_bins = 5,
bin_cutoff = 0.05,
max_n_prebins = 20,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000,
chi_merge_threshold = 0.05,
use_chi2_algorithm = FALSE
)Arguments
- feature
A character vector or factor representing the categorical predictor variable to be binned.
- target
An integer vector of binary outcomes (0/1) corresponding to each observation in
feature.- min_bins
Integer. Minimum number of bins to produce. Must be >= 2. Defaults to 3.
- max_bins
Integer. Maximum number of bins to produce. Must be >=
min_bins. Defaults to 5.- bin_cutoff
Numeric. Threshold for treating categories as rare. Categories with frequency <
bin_cutoffwill be merged with their most similar neighbors. Value must be in (0, 1). Defaults to 0.05.- max_n_prebins
Integer. Maximum number of initial pre-bins before merging. Controls computational complexity. Must be >= 2. Defaults to 20.
- bin_separator
String. Separator used when combining multiple categories into a single bin label. Defaults to "%;%".
- convergence_threshold
Numeric. Convergence tolerance for iterative merging process. Smaller values require stricter convergence. Must be > 0. Defaults to 1e-6.
- max_iterations
Integer. Maximum iterations for the merging algorithm. Prevents infinite loops. Must be > 0. Defaults to 1000.
- chi_merge_threshold
Numeric. Statistical significance level (p-value) for chi-square tests during merging. Higher values create fewer bins. Value must be in (0, 1). Defaults to 0.05.
- use_chi2_algorithm
Logical. If TRUE, uses the Chi2 variant which performs multi-pass merging with decreasing significance thresholds. Defaults to FALSE.
Value
A list containing binning results with the following components:
id: Integer vector of bin identifiers (1:n_bins)bin: Character vector of bin labels (merged category names)woe: Numeric vector of Weight of Evidence for each biniv: Numeric vector of Information Value contribution per bincount: Integer vector of total observations per bincount_pos: Integer vector of positive cases per bincount_neg: Integer vector of negative cases per binconverged: Logical indicating if algorithm convergediterations: Integer count of algorithm iterations performedalgorithm: Character string identifying algorithm usedwarnings: Character vector of any warnings encounteredmetadata: List with additional diagnostic information:total_iv: Total Information Value of the binned variablen_bins: Final number of bins producedunique_categories: Number of unique input categoriestotal_obs: Total number of observations processedexecution_time_ms: Processing time in millisecondsmonotonic: Direction of WoE monotonicity ("increasing"/"decreasing")
Details
The algorithm implements two main approaches:
1. Standard ChiMerge: Iteratively merges adjacent bins with lowest chi-square statistics until all remaining pairs are statistically distinguishable at the specified significance level.
2. Chi2 Algorithm (when use_chi2_algorithm = TRUE): Performs
multiple passes with decreasing significance thresholds (0.5 → 0.001),
creating more robust binning structures particularly for noisy data.
Key features include:
Rare category handling through pre-merging
Monotonicity enforcement of Weight of Evidence
Numerical stability with underflow protection
Efficient chi-square caching for performance
Comprehensive input validation and error handling
Information Value interpretation:
< 0.02: Predictive power not useful
0.02-0.1: Weak predictive power
0.1-0.3: Medium predictive power
0.3-0.5: Strong predictive power
> 0.5: Suspiciously high (potential overfitting)
References
Kerber, R. (1992). ChiMerge: Discretization of numeric attributes. In Proceedings of the Tenth National Conference on Artificial Intelligence (pp. 123-128).
Liu, B., & Setiono, R. (1995). Chi2: Feature selection and discretization of numeric attributes. In Proceedings of the Seventh IEEE International Conference on Tools with Artificial Intelligence (pp. 372-377).
Examples
# Example 1: Basic usage with synthetic data
set.seed(123)
n <- 1000
categories <- c("A", "B", "C", "D", "E", "F", "G", "H")
feature <- sample(categories, n, replace = TRUE, prob = c(
0.2, 0.15, 0.15,
0.1, 0.1, 0.1,
0.1, 0.1
))
# Create target with some association to categories
probs <- c(0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8, 0.85) # increasing probability
target <- sapply(seq_along(feature), function(i) {
cat_idx <- which(categories == feature[i])
rbinom(1, 1, probs[cat_idx])
})
result <- ob_categorical_cm(feature, target)
print(result[c("bin", "woe", "iv", "count")])
#> $bin
#> [1] "A" "B" "C%;%D" "E%;%G" "F%;%H"
#>
#> $woe
#> [1] -1.18847425 -0.80239061 -0.01771012 0.87051615 1.37170669
#>
#> $iv
#> [1] 2.801835e-01 1.052373e-01 7.275865e-05 1.299588e-01 3.041406e-01
#>
#> $count
#> [1] 198 159 242 195 206
#>
# View metadata
print(paste("Total IV:", round(result$metadata$total_iv, 3)))
#> [1] "Total IV: 0.82"
print(paste("Algorithm converged:", result$converged))
#> [1] "Algorithm converged: TRUE"
# Example 2: Using Chi2 algorithm for more conservative binning
result_chi2 <- ob_categorical_cm(feature, target,
use_chi2_algorithm = TRUE,
max_bins = 6
)
# Compare number of bins
cat("Standard ChiMerge bins:", result$metadata$n_bins, "\n")
#> Standard ChiMerge bins: 5
cat("Chi2 algorithm bins:", result_chi2$metadata$n_bins, "\n")
#> Chi2 algorithm bins: 6