Optimal Binning for Categorical Variables using Simulated Annealing — optimal_binning_categorical

Performs optimal binning for categorical variables using an enhanced Simulated Annealing approach. This implementation maximizes Information Value (IV) while maintaining monotonicity in the bins, using Bayesian smoothing for robust estimation and adaptive temperature scheduling for better convergence.

Usage

optimal_binning_categorical_sab(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  initial_temperature = 1,
  cooling_rate = 0.995,
  max_iterations = 1000L,
  convergence_threshold = 1e-06,
  adaptive_cooling = TRUE
)

Arguments

target: An integer vector of binary target values (0 or 1).
feature: A character vector of categorical feature values.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum proportion of observations in a bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins (default: 20).
bin_separator: Separator string for merging categories (default: "%;%").
initial_temperature: Initial temperature for Simulated Annealing (default: 1.0).
cooling_rate: Cooling rate for Simulated Annealing (default: 0.995).
max_iterations: Maximum number of iterations for Simulated Annealing (default: 1000).
convergence_threshold: Threshold for convergence (default: 1e-6).
adaptive_cooling: Whether to use adaptive cooling schedule (default: TRUE).

Value

A list containing the following elements:

id: Numeric vector of bin identifiers.
bin: Character vector of bin names.
woe: Numeric vector of Weight of Evidence (WoE) values for each bin.
iv: Numeric vector of Information Value (IV) for each bin.
count: Integer vector of total counts for each bin.
count_pos: Integer vector of positive counts for each bin.
count_neg: Integer vector of negative counts for each bin.
total_iv: Total Information Value of the binning.
converged: Logical value indicating whether the algorithm converged.
iterations: Integer value indicating the number of iterations run.

Details

This enhanced version of the Simulated Annealing Binning (SAB) algorithm implements several key improvements over traditional approaches:

Mathematical Framework:

The Weight of Evidence (WoE) with Bayesian smoothing is calculated as:

$$WoE_i = \ln\left(\frac{p_i^*}{q_i^*}\right)$$

where:

$p_i^* = \frac{n_i^+ + \alpha \cdot \pi}{N^+ + \alpha}$ is the smoothed proportion of events in bin i
$q_i^* = \frac{n_i^- + \alpha \cdot (1-\pi)}{N^- + \alpha}$ is the smoothed proportion of non-events in bin i
$\pi = \frac{N^+}{N^+ + N^-}$ is the overall event rate
$\alpha$ is the prior strength parameter (default: 0.5)
$n_i^+$ is the count of events in bin i
$n_i^-$ is the count of non-events in bin i
$N^+$ is the total number of events
$N^-$ is the total number of non-events

The Information Value (IV) for each bin is calculated as:

$$IV_i = (p_i^* - q_i^*) \times WoE_i$$

Simulated Annealing:

The algorithm uses an enhanced version of Simulated Annealing with these key features:

Multiple neighborhood generation strategies for better exploration
Adaptive temperature scheduling to escape local optima
Periodic restarting from the best known solution
Smart initialization using event rates for better starting points

The probability of accepting a worse solution is calculated as:

$$P(accept) = \exp\left(\frac{\Delta IV}{T}\right)$$

where $\Delta IV$ is the change in Information Value and $T$ is the current temperature.

Algorithm Phases:

Initialization: Create initial bin assignments using a kmeans-like strategy based on event rates
Optimization: Apply Simulated Annealing to find the optimal assignment of categories to bins
Monotonicity Enforcement: Ensure the final solution has monotonic bin event rates

Key Features:

Bayesian smoothing for robust estimation with small samples
Multiple neighbor generation strategies for better search space exploration
Adaptive temperature scheduling to escape local optima
Smart initialization for better starting points
Strong monotonicity enforcement
Comprehensive handling of edge cases

References

Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. science, 220(4598), 671-680.
Belotti, T., Crook, J. (2009). Credit Scoring with Macroeconomic Variables Using Survival Analysis. Journal of the Operational Research Society, 60(12), 1699-1707.
Mironchyk, P., & Tchistiakov, V. (2017). Monotone optimal binning algorithm for credit risk modeling. arXiv preprint arXiv:1711.05095.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.

Examples

if (FALSE) { # \dontrun{
# Basic usage
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:5], 1000, replace = TRUE)
result <- optimal_binning_categorical_sab(target, feature)
print(result)

# Adjust simulated annealing parameters
result2 <- optimal_binning_categorical_sab(
  target, feature,
  min_bins = 2,
  max_bins = 4,
  initial_temperature = 2.0,
  cooling_rate = 0.99,
  max_iterations = 2000
)
} # }