Skip to contents

Implements optimal binning for numerical variables using the ChiMerge algorithm (Kerber, 1992) and Chi2 algorithm (Liu & Setiono, 1995), calculating Weight of Evidence (WoE) and Information Value (IV) for resulting bins.

Usage

optimal_binning_numerical_cm(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  init_method = "equal_frequency",
  chi_merge_threshold = 0.05,
  use_chi2_algorithm = FALSE
)

Arguments

target

Integer vector of binary target values (0 or 1)

feature

Numeric vector of feature values to bin

min_bins

Minimum number of bins (default: 3)

max_bins

Maximum number of bins (default: 5)

bin_cutoff

Minimum frequency for a bin (default: 0.05)

max_n_prebins

Maximum number of initial bins before merging (default: 20)

convergence_threshold

Threshold for convergence in IV difference (default: 1e-6)

max_iterations

Maximum number of iterations (default: 1000)

init_method

Method for initial binning: "equal_width" or "equal_frequency" (default: "equal_frequency")

chi_merge_threshold

Significance level for chi-square test (default: 0.05)

use_chi2_algorithm

Whether to use the enhanced Chi2 algorithm (default: FALSE)

Value

A list containing:

  • id: Vector of numeric IDs for each bin

  • bin: Vector of bin names (intervals)

  • woe: Vector of Weight of Evidence values for each bin

  • iv: Vector of Information Value for each bin

  • count: Vector of total counts for each bin

  • count_pos: Vector of positive class counts for each bin

  • count_neg: Vector of negative class counts for each bin

  • cutpoints: Vector of bin boundaries for prediction

  • converged: Boolean indicating whether the algorithm converged

  • iterations: Number of iterations run

  • total_iv: Total Information Value of the feature

  • monotonic: Boolean indicating if the bins have monotonic WoE

  • algorithm: Which algorithm was used (ChiMerge or Chi2)

  • requested_min_bins: Minimum bins requested in the function call

  • requested_max_bins: Maximum bins requested in the function call

Details

The ChiMerge algorithm (Kerber, 1992) uses chi-square statistics to determine when to merge adjacent bins. The chi-square statistic is calculated as:

$$\chi^2 = \sum_{i=1}^{2}\sum_{j=1}^{2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$

where \(O_{ij}\) is the observed frequency and \(E_{ij}\) is the expected frequency for bin i and class j.

The Chi2 algorithm (Liu & Setiono, 1995) extends ChiMerge with automated threshold determination and feature selection capabilities.

Weight of Evidence (WoE) is calculated as:

$$WoE = \ln(\frac{P(X|Y=1)}{P(X|Y=0)})$$

Information Value (IV) for each bin is calculated as:

$$IV = (P(X|Y=1) - P(X|Y=0)) * WoE$$

The algorithm works by:

  1. Creating initial bins based on the specified method (equal frequency or equal width)

  2. Enforcing the maximum bin count constraint if needed

  3. Iteratively merging adjacent bins with the lowest chi-square statistic

  4. Merging bins with frequency below bin_cutoff

  5. Enforcing monotonicity of WoE across bins

  6. Final enforcement of bin count constraints

  7. Calculating WoE and IV for the final bins

The chi_merge_threshold parameter controls the statistical significance level for merging. A value of 0.05 corresponds to a 95% confidence level.

References:

  • Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pages 123-128.

  • Liu, H. & Setiono, R. (1995). Chi2: Feature Selection and Discretization of Numeric Attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, pages 388-391.

  • Zeng, G. (2014). A necessary condition for a good binning algorithm in credit scoring. Applied Mathematical Sciences, 8(65), 3229-3242.

Examples

if (FALSE) { # \dontrun{
# Example data
set.seed(123)
n <- 1000
feature <- rnorm(n)
# Target with some relationship to feature
target <- rbinom(n, 1, plogis(0.5 * feature))

# Run optimal binning with ChiMerge
result <- optimal_binning_numerical_cm(target, feature, min_bins = 3, max_bins = 6)

# Use Chi2 algorithm instead
result_chi2 <- optimal_binning_numerical_cm(target, feature, min_bins = 3, 
                                           max_bins = 6, use_chi2_algorithm = TRUE)

# View results
print(result)
} # }