Optimal Binning for Categorical Variables using MILP — optimal_binning_categorical

Performs optimal binning for categorical variables using a Mixed Integer Linear Programming (MILP) inspired approach with enhanced statistical robustness. It creates optimal bins for a categorical feature based on its relationship with a binary target variable, maximizing the predictive power while respecting user-defined constraints. This implementation includes Bayesian smoothing for improved stability with small samples and sophisticated merging strategies.

Usage

optimal_binning_categorical_milp(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: An integer vector of binary target values (0 or 1).
feature: A character vector of feature values.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum proportion of total observations for a bin to avoid being merged (default: 0.05).
max_n_prebins: Maximum number of pre-bins before the optimization process (default: 20).
bin_separator: Separator used to join categories within a bin (default: "%;%").
convergence_threshold: Threshold for convergence of total Information Value (default: 1e-6).
max_iterations: Maximum number of iterations for the optimization process (default: 1000).

Value

A list containing the following elements:

id: Numeric vector of bin identifiers.
bin: Character vector of bin categories.
woe: Numeric vector of Weight of Evidence (WoE) values for each bin.
iv: Numeric vector of Information Value (IV) for each bin.
count: Integer vector of total observations in each bin.
count_pos: Integer vector of positive target observations in each bin.
count_neg: Integer vector of negative target observations in each bin.
total_iv: Total Information Value of the binning.
converged: Logical indicating whether the algorithm converged.
iterations: Integer indicating the number of iterations run.

Details

This enhanced version of the Optimal Binning algorithm for categorical variables implements several key improvements over traditional approaches:

Mathematical Framework:

The Weight of Evidence (WoE) with Bayesian smoothing is calculated as:

$$WoE_i = \ln\left(\frac{p_i^*}{q_i^*}\right)$$

where:

$p_i^* = \frac{n_i^+ + \alpha \cdot \pi}{N^+ + \alpha}$ is the smoothed proportion of events in bin i
$q_i^* = \frac{n_i^- + \alpha \cdot (1-\pi)}{N^- + \alpha}$ is the smoothed proportion of non-events in bin i
$\pi = \frac{N^+}{N^+ + N^-}$ is the overall event rate
$\alpha$ is the prior strength parameter (default: 0.5)
$n_i^+$ is the count of events in bin i
$n_i^-$ is the count of non-events in bin i
$N^+$ is the total number of events
$N^-$ is the total number of non-events

The Information Value (IV) for each bin is calculated as:

$$IV_i = (p_i^* - q_i^*) \times WoE_i$$

The total IV is:

$$IV_{total} = \sum_{i=1}^{k} |IV_i|$$

Algorithm Phases:

Initialization: Create bins for each unique category with comprehensive statistics.
Pre-binning: Reduce to max_n_prebins by merging similar bins based on event rates.
Rare Category Merging: Combine categories with frequency below bin_cutoff using similarity-based strategy.
Monotonicity Enforcement: Ensure monotonic relationship in WoE across bins using adaptive thresholds.
Bin Optimization: Iteratively merge bins to maximize IV while respecting constraints.
Solution Tracking: Maintain the best solution found during optimization.

Key Enhancements:

Bayesian smoothing for robust estimation of WoE with small samples
Similarity-based bin merging rather than just adjacent bins
Adaptive monotonicity enforcement with violation severity prioritization
Best solution tracking to ensure optimal results
Comprehensive handling of edge cases and rare categories

References

Belotti, P., Kirches, C., Leyffer, S., Linderoth, J., Luedtke, J., & Mahajan, A. (2013). Mixed-integer nonlinear optimization. Acta Numerica, 22, 1-131.
Mironchyk, P., & Tchistiakov, V. (2017). Monotone optimal binning algorithm for credit risk modeling. SSRN Electronic Journal. doi:10.2139/ssrn.2978774
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.
Thomas, L.C., Edelman, D.B., & Crook, J.N. (2002). Credit Scoring and its Applications. SIAM.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.

Examples

if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
n <- 1000
target <- sample(0:1, n, replace = TRUE)
feature <- sample(LETTERS[1:10], n, replace = TRUE)

# Run optimal binning
result <- optimal_binning_categorical_milp(target, feature, min_bins = 2, max_bins = 4)

# Print results
print(result)

# Handle rare categories with lower threshold
result2 <- optimal_binning_categorical_milp(
  target, feature, 
  bin_cutoff = 0.02,
  min_bins = 2, 
  max_bins = 5
)
} # }