Optimal Binning for Categorical Variables using Sliding Window Binning (SWB)
optimal_binning_categorical_swb.Rd
This function performs optimal binning for categorical variables using a Sliding Window Binning (SWB) approach. The goal is to generate bins with good predictive power (IV) while maintaining monotonicity of Weight of Evidence (WoE). This implementation includes statistical robustness enhancements through Laplace smoothing and Jensen-Shannon divergence for bin similarity measurement.
Usage
optimal_binning_categorical_swb(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
Integer binary vector (0 or 1) representing the response variable.
- feature
Character vector with the categories of the explanatory variable.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency to consider a category as a separate bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before merging (default: 20).
- bin_separator
Separator used when concatenating category names in each bin (default: "%;%").
- convergence_threshold
Threshold for IV convergence (default: 1e-6).
- max_iterations
Maximum number of iterations for optimization (default: 1000).
Value
A list containing:
id: Numeric identifiers for each bin.
bin: String vector with the names of the bins.
woe: Numeric vector with WoE values for each bin.
iv: Numeric vector with IV values for each bin.
count: Integer vector with the total count in each bin.
count_pos: Integer vector with the count of positives (target=1) in each bin.
count_neg: Integer vector with the count of negatives (target=0) in each bin.
event_rate: Numeric vector with the event rate (proportion of target=1) in each bin.
converged: Logical value indicating whether the algorithm converged.
iterations: Integer value indicating how many iterations were executed.
total_iv: Total Information Value across all bins.
Details
Statistical Methodology
The Sliding Window Binning (SWB) algorithm for categorical variables optimizes binning based on the statistical concepts of Weight of Evidence (WoE) and Information Value (IV):
Weight of Evidence measures the predictive power of a bin: $$WoE_i = \ln\left(\frac{P(X \in Bin_i | Y = 1)}{P(X \in Bin_i | Y = 0)}\right)$$
With Laplace smoothing applied for robustness: $$WoE_i = \ln\left(\frac{(n_{i+} + \alpha)/(n_{+} + 2\alpha)}{(n_{i-} + \alpha)/(n_{-} + 2\alpha)}\right)$$
Where:
\(n_{i+}\) is the number of positive cases (target=1) in bin i
\(n_{i-}\) is the number of negative cases (target=0) in bin i
\(n_{+}\) is the total number of positive cases
\(n_{-}\) is the total number of negative cases
\(\alpha\) is the Laplace smoothing parameter (default: 0.5)
Information Value measures the overall predictive power: $$IV_i = \left(P(X \in Bin_i | Y = 1) - P(X \in Bin_i | Y = 0)\right) \times WoE_i$$ $$IV_{total} = \sum_{i=1}^{k} IV_i$$
Algorithm Steps
Initialize bins for each category, grouping rare categories (below bin_cutoff).
Special handling for variables with 1-2 levels: no optimization, just calculate metrics.
For variables with more levels: a. Sort bins by WoE values b. Iteratively merge similar bins based on Jensen-Shannon divergence and IV loss c. Enforce monotonicity of WoE across bins d. Optimize until constraints (min_bins, max_bins) are satisfied
Bin Similarity Measurement
Bins are merged based on statistical similarity measured using Jensen-Shannon divergence: $$JS(P||Q) = \frac{1}{2}KL(P||M) + \frac{1}{2}KL(Q||M)$$
Where:
\(KL\) is the Kullback-Leibler divergence
\(M = \frac{1}{2}(P+Q)\) is the midpoint distribution
\(P\) and \(Q\) are the event rate distributions of two bins
References
Beltrán, C., et al. (2022). Weight of Evidence (WoE) and Information Value (IV): A novel implementation for predictive modeling in credit scoring. Expert Systems with Applications, 183, 115351.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.