Optimal Binning for Numerical Variables using ChiMerge
optimal_binning_numerical_cm.Rd
Implements optimal binning for numerical variables using the ChiMerge algorithm (Kerber, 1992) and Chi2 algorithm (Liu & Setiono, 1995), calculating Weight of Evidence (WoE) and Information Value (IV) for resulting bins.
Usage
optimal_binning_numerical_cm(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
convergence_threshold = 1e-06,
max_iterations = 1000L,
init_method = "equal_frequency",
chi_merge_threshold = 0.05,
use_chi2_algorithm = FALSE
)
Arguments
- target
Integer vector of binary target values (0 or 1)
- feature
Numeric vector of feature values to bin
- min_bins
Minimum number of bins (default: 3)
- max_bins
Maximum number of bins (default: 5)
- bin_cutoff
Minimum frequency for a bin (default: 0.05)
- max_n_prebins
Maximum number of initial bins before merging (default: 20)
- convergence_threshold
Threshold for convergence in IV difference (default: 1e-6)
- max_iterations
Maximum number of iterations (default: 1000)
- init_method
Method for initial binning: "equal_width" or "equal_frequency" (default: "equal_frequency")
- chi_merge_threshold
Significance level for chi-square test (default: 0.05)
- use_chi2_algorithm
Whether to use the enhanced Chi2 algorithm (default: FALSE)
Value
A list containing:
id: Vector of numeric IDs for each bin
bin: Vector of bin names (intervals)
woe: Vector of Weight of Evidence values for each bin
iv: Vector of Information Value for each bin
count: Vector of total counts for each bin
count_pos: Vector of positive class counts for each bin
count_neg: Vector of negative class counts for each bin
cutpoints: Vector of bin boundaries for prediction
converged: Boolean indicating whether the algorithm converged
iterations: Number of iterations run
total_iv: Total Information Value of the feature
monotonic: Boolean indicating if the bins have monotonic WoE
algorithm: Which algorithm was used (ChiMerge or Chi2)
requested_min_bins: Minimum bins requested in the function call
requested_max_bins: Maximum bins requested in the function call
Details
The ChiMerge algorithm (Kerber, 1992) uses chi-square statistics to determine when to merge adjacent bins. The chi-square statistic is calculated as:
$$\chi^2 = \sum_{i=1}^{2}\sum_{j=1}^{2} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}$$
where \(O_{ij}\) is the observed frequency and \(E_{ij}\) is the expected frequency for bin i and class j.
The Chi2 algorithm (Liu & Setiono, 1995) extends ChiMerge with automated threshold determination and feature selection capabilities.
Weight of Evidence (WoE) is calculated as:
$$WoE = \ln(\frac{P(X|Y=1)}{P(X|Y=0)})$$
Information Value (IV) for each bin is calculated as:
$$IV = (P(X|Y=1) - P(X|Y=0)) * WoE$$
The algorithm works by:
Creating initial bins based on the specified method (equal frequency or equal width)
Enforcing the maximum bin count constraint if needed
Iteratively merging adjacent bins with the lowest chi-square statistic
Merging bins with frequency below bin_cutoff
Enforcing monotonicity of WoE across bins
Final enforcement of bin count constraints
Calculating WoE and IV for the final bins
The chi_merge_threshold parameter controls the statistical significance level for merging. A value of 0.05 corresponds to a 95% confidence level.
References:
Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. In Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI'92, pages 123-128.
Liu, H. & Setiono, R. (1995). Chi2: Feature Selection and Discretization of Numeric Attributes. In Proceedings of the 7th IEEE International Conference on Tools with Artificial Intelligence, pages 388-391.
Zeng, G. (2014). A necessary condition for a good binning algorithm in credit scoring. Applied Mathematical Sciences, 8(65), 3229-3242.
Examples
if (FALSE) { # \dontrun{
# Example data
set.seed(123)
n <- 1000
feature <- rnorm(n)
# Target with some relationship to feature
target <- rbinom(n, 1, plogis(0.5 * feature))
# Run optimal binning with ChiMerge
result <- optimal_binning_numerical_cm(target, feature, min_bins = 3, max_bins = 6)
# Use Chi2 algorithm instead
result_chi2 <- optimal_binning_numerical_cm(target, feature, min_bins = 3,
max_bins = 6, use_chi2_algorithm = TRUE)
# View results
print(result)
} # }