Optimal Binning for Categorical Variables using Simulated Annealing
optimal_binning_categorical_sab.Rd
Performs optimal binning for categorical variables using an enhanced Simulated Annealing approach. This implementation maximizes Information Value (IV) while maintaining monotonicity in the bins, using Bayesian smoothing for robust estimation and adaptive temperature scheduling for better convergence.
Usage
optimal_binning_categorical_sab(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
initial_temperature = 1,
cooling_rate = 0.995,
max_iterations = 1000L,
convergence_threshold = 1e-06,
adaptive_cooling = TRUE
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A character vector of categorical feature values.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum proportion of observations in a bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins (default: 20).
- bin_separator
Separator string for merging categories (default: "%;%").
- initial_temperature
Initial temperature for Simulated Annealing (default: 1.0).
- cooling_rate
Cooling rate for Simulated Annealing (default: 0.995).
- max_iterations
Maximum number of iterations for Simulated Annealing (default: 1000).
- convergence_threshold
Threshold for convergence (default: 1e-6).
- adaptive_cooling
Whether to use adaptive cooling schedule (default: TRUE).
Value
A list containing the following elements:
id: Numeric vector of bin identifiers.
bin: Character vector of bin names.
woe: Numeric vector of Weight of Evidence (WoE) values for each bin.
iv: Numeric vector of Information Value (IV) for each bin.
count: Integer vector of total counts for each bin.
count_pos: Integer vector of positive counts for each bin.
count_neg: Integer vector of negative counts for each bin.
total_iv: Total Information Value of the binning.
converged: Logical value indicating whether the algorithm converged.
iterations: Integer value indicating the number of iterations run.
Details
This enhanced version of the Simulated Annealing Binning (SAB) algorithm implements several key improvements over traditional approaches:
Mathematical Framework:
The Weight of Evidence (WoE) with Bayesian smoothing is calculated as:
$$WoE_i = \ln\left(\frac{p_i^*}{q_i^*}\right)$$
where:
\(p_i^* = \frac{n_i^+ + \alpha \cdot \pi}{N^+ + \alpha}\) is the smoothed proportion of events in bin i
\(q_i^* = \frac{n_i^- + \alpha \cdot (1-\pi)}{N^- + \alpha}\) is the smoothed proportion of non-events in bin i
\(\pi = \frac{N^+}{N^+ + N^-}\) is the overall event rate
\(\alpha\) is the prior strength parameter (default: 0.5)
\(n_i^+\) is the count of events in bin i
\(n_i^-\) is the count of non-events in bin i
\(N^+\) is the total number of events
\(N^-\) is the total number of non-events
The Information Value (IV) for each bin is calculated as:
$$IV_i = (p_i^* - q_i^*) \times WoE_i$$
Simulated Annealing:
The algorithm uses an enhanced version of Simulated Annealing with these key features:
Multiple neighborhood generation strategies for better exploration
Adaptive temperature scheduling to escape local optima
Periodic restarting from the best known solution
Smart initialization using event rates for better starting points
The probability of accepting a worse solution is calculated as:
$$P(accept) = \exp\left(\frac{\Delta IV}{T}\right)$$
where \(\Delta IV\) is the change in Information Value and \(T\) is the current temperature.
Algorithm Phases:
Initialization: Create initial bin assignments using a kmeans-like strategy based on event rates
Optimization: Apply Simulated Annealing to find the optimal assignment of categories to bins
Monotonicity Enforcement: Ensure the final solution has monotonic bin event rates
Key Features:
Bayesian smoothing for robust estimation with small samples
Multiple neighbor generation strategies for better search space exploration
Adaptive temperature scheduling to escape local optima
Smart initialization for better starting points
Strong monotonicity enforcement
Comprehensive handling of edge cases
References
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. science, 220(4598), 671-680.
Belotti, T., Crook, J. (2009). Credit Scoring with Macroeconomic Variables Using Survival Analysis. Journal of the Operational Research Society, 60(12), 1699-1707.
Mironchyk, P., & Tchistiakov, V. (2017). Monotone optimal binning algorithm for credit risk modeling. arXiv preprint arXiv:1711.05095.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.
Examples
if (FALSE) { # \dontrun{
# Basic usage
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:5], 1000, replace = TRUE)
result <- optimal_binning_categorical_sab(target, feature)
print(result)
# Adjust simulated annealing parameters
result2 <- optimal_binning_categorical_sab(
target, feature,
min_bins = 2,
max_bins = 4,
initial_temperature = 2.0,
cooling_rate = 0.99,
max_iterations = 2000
)
} # }