Optimal Binning for Categorical Variables using Sketch-based Algorithm — optimal_binning_categorical

This function performs optimal binning for categorical variables using a sketch-based approach, combining Count-Min Sketch for frequency estimation with Weight of Evidence (WOE) and Information Value (IV) methods.

Usage

optimal_binning_categorical_sketch(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  sketch_width = 2000L,
  sketch_depth = 5L
)

Arguments

target: An integer vector of binary target values (0 or 1).
feature: A character vector of categorical feature values.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency for a category to be considered as a separate bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins before merging (default: 20).
bin_separator: String used to separate category names when merging bins (default: "%;%").
convergence_threshold: Threshold for convergence in optimization (default: 1e-6).
max_iterations: Maximum number of iterations for optimization (default: 1000).
sketch_width: Width of the Count-Min Sketch (default: 2000).
sketch_depth: Depth of the Count-Min Sketch (default: 5).

Value

A list containing:

id: Numeric identifiers for each bin
bin: A character vector of bin labels
woe: A numeric vector of Weight of Evidence values for each bin
iv: A numeric vector of Information Value for each bin
count: An integer vector of total counts for each bin
count_pos: An integer vector of positive target counts for each bin
count_neg: An integer vector of negative target counts for each bin
event_rate: A numeric vector of event rates for each bin
converged: A logical value indicating whether the algorithm converged
iterations: An integer indicating the number of iterations run
total_iv: The total Information Value of the binning

Details

The algorithm uses a Count-Min Sketch data structure to efficiently approximate frequency counts for categorical variables, making it suitable for very large datasets or streaming scenarios. The sketch-based approach allows processing data in a single pass with sublinear memory usage.

Statistical Background

Weight of Evidence (WoE) measures the predictive power of a categorical level:

$$WoE_i = \ln\left(\frac{P(X_i | Y = 1)}{P(X_i | Y = 0)}\right) = \ln\left(\frac{n_{i+}/n_+}{n_{i-}/n_-}\right)$$

Where:

$n_{i+}$ is the number of events (Y=1) in bin i
$n_{i-}$ is the number of non-events (Y=0) in bin i
$n_+$ is the total number of events
$n_-$ is the total number of non-events

Information Value (IV) measures the predictive power of the entire variable:

$$IV_i = \left(\frac{n_{i+}}{n_+} - \frac{n_{i-}}{n_-}\right) \times WoE_i$$ $$IV_{total} = \sum_{i=1}^{n} IV_i$$

Algorithm Steps

The algorithm performs the following steps:

Input validation and preprocessing
Building frequency sketches for the data using Count-Min Sketch
Initial pre-binning based on frequency estimates ("heavy hitters")
Enforcing minimum bin size (bin_cutoff)
Calculating initial Weight of Evidence (WoE) and Information Value (IV)
Enforcing monotonicity of WoE across bins
Optimizing the number of bins through iterative merging using statistical divergence

Statistical Improvements

The implementation uses several statistical enhancements:

Laplace smoothing for WoE and IV calculation to handle rare events
Jensen-Shannon divergence for measuring statistical similarity between bins
Adaptive merging strategy that alternates between IV loss and statistical divergence
Conservative frequency estimation using the Count-Min Sketch properties

Due to the approximation nature of sketches, there might be slight inconsistencies in counts compared to the exact algorithm, but the trade-off is significantly improved memory efficiency and speed for large datasets.

References

Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.
Beltrán, C., et al. (2022). Weight of Evidence (WoE) and Information Value (IV) implementations for predictive modeling in credit risk assessment. Journal of Credit Risk.

Examples

if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:5], 1000, replace = TRUE)

# Run optimal binning with sketch
result <- optimal_binning_categorical_sketch(feature, target)

# View results
print(result)
} # }