Skip to contents

Implements optimal binning for categorical variables using a Greedy Merge approach, calculating Weight of Evidence (WoE) and Information Value (IV).

Usage

optimal_binning_categorical_gmb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target

Integer vector of binary target values (0 or 1).

feature

Character vector of categorical feature values.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum frequency for a separate bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins before merging (default: 20).

bin_separator

Separator used for merging category names (default: "%;%").

convergence_threshold

Threshold for convergence (default: 1e-6).

max_iterations

Maximum number of iterations (default: 1000).

Value

A list with the following elements:

  • id: Numeric vector of bin identifiers.

  • bin: Character vector of bin names (merged categories).

  • woe: Numeric vector of Weight of Evidence values for each bin.

  • iv: Numeric vector of Information Value for each bin.

  • count: Integer vector of total count for each bin.

  • count_pos: Integer vector of positive class count for each bin.

  • count_neg: Integer vector of negative class count for each bin.

  • total_iv: Total Information Value of the binning.

  • converged: Logical indicating whether the algorithm converged.

  • iterations: Integer indicating the number of iterations performed.

Details

The Greedy Merge Binning (GMB) algorithm finds an optimal binning solution by iteratively merging adjacent bins to maximize Information Value (IV) while respecting constraints on the number of bins.

The Weight of Evidence (WoE) measures the predictive power of a bin and is defined as:

$$WoE_i = \ln\left(\frac{n^+_i/N^+}{n^-_i/N^-}\right)$$

where:

  • \(n^+_i\) is the number of positive cases in bin i

  • \(n^-_i\) is the number of negative cases in bin i

  • \(N^+\) is the total number of positive cases

  • \(N^-\) is the total number of negative cases

The Information Value (IV) quantifies the predictive power of the entire binning and is:

$$IV = \sum_{i=1}^{n} (p_i - q_i) \times WoE_i$$

where:

  • \(p_i = n^+_i/N^+\) is the proportion of positive cases in bin i

  • \(q_i = n^-_i/N^-\) is the proportion of negative cases in bin i

This algorithm applies Bayesian smoothing to WoE calculations to improve stability, particularly with small sample sizes or rare categories. The smoothing applies pseudo-counts based on the overall population prevalence.

The algorithm includes the following main steps:

  1. Initialize bins with each unique category.

  2. Merge rare categories based on the bin_cutoff.

  3. Iteratively merge adjacent bins that result in the highest IV.

  4. Stop merging when the number of bins reaches min_bins or max_bins.

  5. Ensure monotonicity of WoE values across bins.

  6. Calculate final WoE and IV for each bin.

Edge cases are handled as follows:

  • Empty strings in feature are rejected during input validation

  • Extremely imbalanced datasets (< 5 samples in either class) produce a warning

  • When merging bins, ties in IV improvement are resolved by preferring more balanced bins

  • Monotonicity violations are addressed with an adaptive threshold based on average WoE gaps

References

  • Beltrami, M., Mach, M., & Dall'Aglio, M. (2021). Monotonic Optimal Binning Algorithm for Credit Risk Modeling. Risks, 9(3), 58.

  • Siddiqi, N. (2006). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons.

  • García-Magariño, I., Medrano, C., Lombas, A. S., & Barrasa, A. (2019). A hybrid approach with agent-based simulation and clustering for sociograms. Information Sciences, 499, 47-61.

  • Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.

  • Lin, X., Wang, G., & Zhang, T. (2022). Efficient monotonic binning for predictive modeling in high-dimensional spaces. Knowledge-Based Systems, 235, 107629.

  • Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.

Author

Lopes, J. E.

Examples

if (FALSE) { # \dontrun{
# Example data
target <- c(1, 0, 1, 1, 0, 1, 0, 0, 1, 1)
feature <- c("A", "B", "A", "C", "B", "D", "C", "A", "D", "B")

# Run optimal binning
result <- optimal_binning_categorical_gmb(target, feature, min_bins = 2, max_bins = 4)

# View results
print(result)
} # }