Skip to contents

Performs optimal binning for categorical variables using an enhanced Simulated Annealing approach. This implementation maximizes Information Value (IV) while maintaining monotonicity in the bins, using Bayesian smoothing for robust estimation and adaptive temperature scheduling for better convergence.

Usage

optimal_binning_categorical_sab(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  initial_temperature = 1,
  cooling_rate = 0.995,
  max_iterations = 1000L,
  convergence_threshold = 1e-06,
  adaptive_cooling = TRUE
)

Arguments

target

An integer vector of binary target values (0 or 1).

feature

A character vector of categorical feature values.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum proportion of observations in a bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins (default: 20).

bin_separator

Separator string for merging categories (default: "%;%").

initial_temperature

Initial temperature for Simulated Annealing (default: 1.0).

cooling_rate

Cooling rate for Simulated Annealing (default: 0.995).

max_iterations

Maximum number of iterations for Simulated Annealing (default: 1000).

convergence_threshold

Threshold for convergence (default: 1e-6).

adaptive_cooling

Whether to use adaptive cooling schedule (default: TRUE).

Value

A list containing the following elements:

  • id: Numeric vector of bin identifiers.

  • bin: Character vector of bin names.

  • woe: Numeric vector of Weight of Evidence (WoE) values for each bin.

  • iv: Numeric vector of Information Value (IV) for each bin.

  • count: Integer vector of total counts for each bin.

  • count_pos: Integer vector of positive counts for each bin.

  • count_neg: Integer vector of negative counts for each bin.

  • total_iv: Total Information Value of the binning.

  • converged: Logical value indicating whether the algorithm converged.

  • iterations: Integer value indicating the number of iterations run.

Details

This enhanced version of the Simulated Annealing Binning (SAB) algorithm implements several key improvements over traditional approaches:

Mathematical Framework:

The Weight of Evidence (WoE) with Bayesian smoothing is calculated as:

$$WoE_i = \ln\left(\frac{p_i^*}{q_i^*}\right)$$

where:

  • \(p_i^* = \frac{n_i^+ + \alpha \cdot \pi}{N^+ + \alpha}\) is the smoothed proportion of events in bin i

  • \(q_i^* = \frac{n_i^- + \alpha \cdot (1-\pi)}{N^- + \alpha}\) is the smoothed proportion of non-events in bin i

  • \(\pi = \frac{N^+}{N^+ + N^-}\) is the overall event rate

  • \(\alpha\) is the prior strength parameter (default: 0.5)

  • \(n_i^+\) is the count of events in bin i

  • \(n_i^-\) is the count of non-events in bin i

  • \(N^+\) is the total number of events

  • \(N^-\) is the total number of non-events

The Information Value (IV) for each bin is calculated as:

$$IV_i = (p_i^* - q_i^*) \times WoE_i$$

Simulated Annealing:

The algorithm uses an enhanced version of Simulated Annealing with these key features:

  • Multiple neighborhood generation strategies for better exploration

  • Adaptive temperature scheduling to escape local optima

  • Periodic restarting from the best known solution

  • Smart initialization using event rates for better starting points

The probability of accepting a worse solution is calculated as:

$$P(accept) = \exp\left(\frac{\Delta IV}{T}\right)$$

where \(\Delta IV\) is the change in Information Value and \(T\) is the current temperature.

Algorithm Phases:

  1. Initialization: Create initial bin assignments using a kmeans-like strategy based on event rates

  2. Optimization: Apply Simulated Annealing to find the optimal assignment of categories to bins

  3. Monotonicity Enforcement: Ensure the final solution has monotonic bin event rates

Key Features:

  • Bayesian smoothing for robust estimation with small samples

  • Multiple neighbor generation strategies for better search space exploration

  • Adaptive temperature scheduling to escape local optima

  • Smart initialization for better starting points

  • Strong monotonicity enforcement

  • Comprehensive handling of edge cases

References

  • Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. science, 220(4598), 671-680.

  • Belotti, T., Crook, J. (2009). Credit Scoring with Macroeconomic Variables Using Survival Analysis. Journal of the Operational Research Society, 60(12), 1699-1707.

  • Mironchyk, P., & Tchistiakov, V. (2017). Monotone optimal binning algorithm for credit risk modeling. arXiv preprint arXiv:1711.05095.

  • Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.

  • Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.

Examples

if (FALSE) { # \dontrun{
# Basic usage
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:5], 1000, replace = TRUE)
result <- optimal_binning_categorical_sab(target, feature)
print(result)

# Adjust simulated annealing parameters
result2 <- optimal_binning_categorical_sab(
  target, feature,
  min_bins = 2,
  max_bins = 4,
  initial_temperature = 2.0,
  cooling_rate = 0.99,
  max_iterations = 2000
)
} # }