Skip to contents

Implements optimal binning for categorical variables using the ChiMerge algorithm (Kerber, 1992) and optionally the Chi2 algorithm (Liu & Setiono, 1995), calculating Weight of Evidence (WoE) and Information Value (IV) for the resulting bins. This is Version 4 with corrections based on previous code review.

Usage

optimal_binning_categorical_cm(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  chi_merge_threshold = 0.05,
  use_chi2_algorithm = FALSE
)

Arguments

target

Integer vector of binary target values (0 or 1). Cannot contain NAs.

feature

Character vector of categorical feature values. NA values will be treated as a distinct category "NA".

min_bins

Minimum number of bins (default: 3, must be >= 2).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum frequency fraction for a category to potentially avoid being merged in the initial handle_rare_categories step (default: 0.05). Note: The main merging uses chi-square statistics.

max_n_prebins

Maximum number of bins allowed after the initial pre-binning/rare handling step, before the main ChiMerge/Chi2 loop (default: 20). Merging stops if this limit is reached and statistical thresholds aren't met.

bin_separator

Separator string for concatenating category names in bins (default: "%;%").

convergence_threshold

Threshold for convergence based on the absolute difference in minimum chi-square between iterations during bin merging (default: 1e-6).

max_iterations

Maximum number of iterations allowed for the bin merging loop (default: 1000).

chi_merge_threshold

Significance level threshold for the chi-square test used in merging decisions (default: 0.05, corresponds to 95 pct confidence). Lower values lead to fewer merges.

use_chi2_algorithm

Boolean indicating whether to use the enhanced Chi2 algorithm which involves multiple ChiMerge phases with decreasing significance levels (default: FALSE).

Value

A list containing:

id

Vector of numeric IDs (1-based) for each final bin.

bin

Vector of character strings representing the final bins (concatenated category names).

woe

Vector of numeric Weight of Evidence (WoE) values for each bin.

iv

Vector of numeric Information Value (IV) contributions for each bin.

count

Vector of integer total counts (observations) for each bin.

count_pos

Vector of integer positive class counts for each bin.

count_neg

Vector of integer negative class counts for each bin.

converged

Boolean indicating whether the merging algorithm converged (either reached target bins, statistical threshold, or convergence threshold).

iterations

Integer number of merging iterations performed.

algorithm

Character string indicating the algorithm used ("ChiMerge" or "Chi2").

Details

This function implements categorical variable binning based on chi-square statistics. The core logic follows the ChiMerge approach, iteratively merging adjacent bins (sorted by WoE) that have the lowest chi-square statistic below a specified critical value (derived from chi_merge_threshold). The optional Chi2 algorithm applies multiple rounds of ChiMerge with varying significance levels. Monotonicity of WoE across the final bins is enforced by merging adjacent bins that violate the trend.

Weight of Evidence (WoE) is calculated as: \(WoE_i = \ln(\frac{p_{pos,i}}{p_{neg,i}})\) Information Value (IV) is calculated as: \(IV = \sum_{i} (p_{pos,i} - p_{neg,i}) \times WoE_i\) where \(p_{pos,i}\) and \(p_{neg,i}\) are the proportions of positive and negative observations in bin i relative to the total positive and negative observations, respectively.

V4 includes fixes for stability and corrects the initialization and usage of the internal chi-square cache.

References

  • Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. In AAAI'92.

  • Liu, H. & Setiono, R. (1995). Chi2: Feature Selection and Discretization of Numeric Attributes. In TAI'95.

  • Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Examples

if (FALSE) { # \dontrun{
# Example data
set.seed(123)
target <- sample(0:1, 500, replace = TRUE, prob = c(0.7, 0.3))
feature <- sample(LETTERS[1:8], 500, replace = TRUE)
feature[sample(1:500, 20)] <- NA # Add some NAs

# Run optimal binning with ChiMerge (V4)
result_v4 <- optimal_binning_categorical_cm(target, feature,
                                           min_bins = 3, max_bins = 6,
                                           chi_merge_threshold = 0.05)
print(result_v4)

# Check total IV
print(sum(result_v4$iv))

# Run using the Chi2 algorithm variant
result_chi2_v4 <- optimal_binning_categorical_cm(target, feature,
                                               min_bins = 3, max_bins = 6,
                                               use_chi2_algorithm = TRUE)
print(result_chi2_v4)
} # }