Optimal Binning for Categorical Variables using Sketch-based Algorithm
optimal_binning_categorical_sketch.Rd
This function performs optimal binning for categorical variables using a sketch-based approach, combining Count-Min Sketch for frequency estimation with Weight of Evidence (WOE) and Information Value (IV) methods.
Usage
optimal_binning_categorical_sketch(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L,
sketch_width = 2000L,
sketch_depth = 5L
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A character vector of categorical feature values.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency for a category to be considered as a separate bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before merging (default: 20).
- bin_separator
String used to separate category names when merging bins (default: "%;%").
- convergence_threshold
Threshold for convergence in optimization (default: 1e-6).
- max_iterations
Maximum number of iterations for optimization (default: 1000).
- sketch_width
Width of the Count-Min Sketch (default: 2000).
- sketch_depth
Depth of the Count-Min Sketch (default: 5).
Value
A list containing:
id: Numeric identifiers for each bin
bin: A character vector of bin labels
woe: A numeric vector of Weight of Evidence values for each bin
iv: A numeric vector of Information Value for each bin
count: An integer vector of total counts for each bin
count_pos: An integer vector of positive target counts for each bin
count_neg: An integer vector of negative target counts for each bin
event_rate: A numeric vector of event rates for each bin
converged: A logical value indicating whether the algorithm converged
iterations: An integer indicating the number of iterations run
total_iv: The total Information Value of the binning
Details
The algorithm uses a Count-Min Sketch data structure to efficiently approximate frequency counts for categorical variables, making it suitable for very large datasets or streaming scenarios. The sketch-based approach allows processing data in a single pass with sublinear memory usage.
Statistical Background
Weight of Evidence (WoE) measures the predictive power of a categorical level:
$$WoE_i = \ln\left(\frac{P(X_i | Y = 1)}{P(X_i | Y = 0)}\right) = \ln\left(\frac{n_{i+}/n_+}{n_{i-}/n_-}\right)$$
Where:
\(n_{i+}\) is the number of events (Y=1) in bin i
\(n_{i-}\) is the number of non-events (Y=0) in bin i
\(n_+\) is the total number of events
\(n_-\) is the total number of non-events
Information Value (IV) measures the predictive power of the entire variable:
$$IV_i = \left(\frac{n_{i+}}{n_+} - \frac{n_{i-}}{n_-}\right) \times WoE_i$$ $$IV_{total} = \sum_{i=1}^{n} IV_i$$
Algorithm Steps
The algorithm performs the following steps:
Input validation and preprocessing
Building frequency sketches for the data using Count-Min Sketch
Initial pre-binning based on frequency estimates ("heavy hitters")
Enforcing minimum bin size (bin_cutoff)
Calculating initial Weight of Evidence (WoE) and Information Value (IV)
Enforcing monotonicity of WoE across bins
Optimizing the number of bins through iterative merging using statistical divergence
Statistical Improvements
The implementation uses several statistical enhancements:
Laplace smoothing for WoE and IV calculation to handle rare events
Jensen-Shannon divergence for measuring statistical similarity between bins
Adaptive merging strategy that alternates between IV loss and statistical divergence
Conservative frequency estimation using the Count-Min Sketch properties
Due to the approximation nature of sketches, there might be slight inconsistencies in counts compared to the exact algorithm, but the trade-off is significantly improved memory efficiency and speed for large datasets.
References
Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.
Beltrán, C., et al. (2022). Weight of Evidence (WoE) and Information Value (IV) implementations for predictive modeling in credit risk assessment. Journal of Credit Risk.