Categorical Optimal Binning with Greedy Merge Binning
optimal_binning_categorical_gmb.Rd
Implements optimal binning for categorical variables using a Greedy Merge approach, calculating Weight of Evidence (WoE) and Information Value (IV).
Usage
optimal_binning_categorical_gmb(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
Integer vector of binary target values (0 or 1).
- feature
Character vector of categorical feature values.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency for a separate bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before merging (default: 20).
- bin_separator
Separator used for merging category names (default: "%;%").
- convergence_threshold
Threshold for convergence (default: 1e-6).
- max_iterations
Maximum number of iterations (default: 1000).
Value
A list with the following elements:
id: Numeric vector of bin identifiers.
bin: Character vector of bin names (merged categories).
woe: Numeric vector of Weight of Evidence values for each bin.
iv: Numeric vector of Information Value for each bin.
count: Integer vector of total count for each bin.
count_pos: Integer vector of positive class count for each bin.
count_neg: Integer vector of negative class count for each bin.
total_iv: Total Information Value of the binning.
converged: Logical indicating whether the algorithm converged.
iterations: Integer indicating the number of iterations performed.
Details
The Greedy Merge Binning (GMB) algorithm finds an optimal binning solution by iteratively merging adjacent bins to maximize Information Value (IV) while respecting constraints on the number of bins.
The Weight of Evidence (WoE) measures the predictive power of a bin and is defined as:
$$WoE_i = \ln\left(\frac{n^+_i/N^+}{n^-_i/N^-}\right)$$
where:
\(n^+_i\) is the number of positive cases in bin i
\(n^-_i\) is the number of negative cases in bin i
\(N^+\) is the total number of positive cases
\(N^-\) is the total number of negative cases
The Information Value (IV) quantifies the predictive power of the entire binning and is:
$$IV = \sum_{i=1}^{n} (p_i - q_i) \times WoE_i$$
where:
\(p_i = n^+_i/N^+\) is the proportion of positive cases in bin i
\(q_i = n^-_i/N^-\) is the proportion of negative cases in bin i
This algorithm applies Bayesian smoothing to WoE calculations to improve stability, particularly with small sample sizes or rare categories. The smoothing applies pseudo-counts based on the overall population prevalence.
The algorithm includes the following main steps:
Initialize bins with each unique category.
Merge rare categories based on the bin_cutoff.
Iteratively merge adjacent bins that result in the highest IV.
Stop merging when the number of bins reaches min_bins or max_bins.
Ensure monotonicity of WoE values across bins.
Calculate final WoE and IV for each bin.
Edge cases are handled as follows:
Empty strings in feature are rejected during input validation
Extremely imbalanced datasets (< 5 samples in either class) produce a warning
When merging bins, ties in IV improvement are resolved by preferring more balanced bins
Monotonicity violations are addressed with an adaptive threshold based on average WoE gaps
References
Beltrami, M., Mach, M., & Dall'Aglio, M. (2021). Monotonic Optimal Binning Algorithm for Credit Risk Modeling. Risks, 9(3), 58.
Siddiqi, N. (2006). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons.
García-Magariño, I., Medrano, C., Lombas, A. S., & Barrasa, A. (2019). A hybrid approach with agent-based simulation and clustering for sociograms. Information Sciences, 499, 47-61.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.
Lin, X., Wang, G., & Zhang, T. (2022). Efficient monotonic binning for predictive modeling in high-dimensional spaces. Knowledge-Based Systems, 235, 107629.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.