Optimal Binning for Categorical Variables using Sketch-based Algorithm — ob_categorical

This function performs optimal binning for categorical variables using a Sketch-based algorithm designed for large-scale data processing. It employs probabilistic data structures (Count-Min Sketch) to efficiently estimate category frequencies and event rates, enabling near real-time binning on massive datasets.

Usage

ob_categorical_sketch(
  feature,
  target,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  sketch_width = 2000L,
  sketch_depth = 5L
)

Arguments

feature: A character vector or factor representing the categorical predictor variable. Missing values (NA) will be converted to the string "N/A" and treated as a separate category.
target: An integer vector containing binary outcome values (0 or 1). Must be the same length as feature. Cannot contain missing values.
min_bins: Integer. Minimum number of bins to create. Must be at least 2. Default is 3.
max_bins: Integer. Maximum number of bins to create. Must be greater than or equal to min_bins. Default is 5.
bin_cutoff: Numeric. Minimum relative frequency threshold for categories to be considered "heavy hitters". Categories below this proportion will be grouped together. Value must be between 0 and 1. Default is 0.05 (5%).
max_n_prebins: Integer. Maximum number of initial bins created during pre-binning phase. Controls early-stage complexity. Default is 20.
bin_separator: Character string used to separate category names when multiple categories are merged into a single bin. Default is "%;%".
convergence_threshold: Numeric. Threshold for determining algorithm convergence based on changes in total Information Value. Default is 1e-6.
max_iterations: Integer. Maximum number of iterations for the optimization process. Default is 1000.
sketch_width: Integer. Width of the Count-Min Sketch (number of counters per hash function). Larger values reduce estimation error but increase memory usage. Must be >= 100. Default is 2000.
sketch_depth: Integer. Depth of the Count-Min Sketch (number of hash functions). Larger values reduce collision probability but increase computational overhead. Must be >= 3. Default is 5.

Value

A list containing the results of the optimal binning procedure:

id: Numeric vector of bin identifiers (1 to n_bins)
bin: Character vector of bin labels, which are combinations of original categories separated by bin_separator
woe: Numeric vector of Weight of Evidence values for each bin
iv: Numeric vector of Information Values for each bin
count: Integer vector of total observations in each bin
count_pos: Integer vector of positive outcomes in each bin
count_neg: Integer vector of negative outcomes in each bin
event_rate: Numeric vector of the observed event rate in each bin
total_iv: Numeric scalar. Total Information Value across all bins
converged: Logical. Whether the algorithm converged
iterations: Integer. Number of iterations performed

Details

The Sketch-based algorithm follows these steps:

Frequency Estimation: Uses Count-Min Sketch to approximate the frequency of each category in a single data pass.
Heavy Hitter Detection: Identifies frequently occurring categories (above a threshold defined by bin_cutoff) using sketch estimates.
Pre-binning: Creates initial bins from detected heavy categories, grouping rare categories separately.
Optimization: Applies iterative merging based on statistical divergence measures to optimize Information Value (IV) while respecting bin count constraints (min_bins, max_bins).
Monotonicity Enforcement: Ensures the final binning has monotonic Weight of Evidence (WoE).

Key advantages of this approach:

Memory Efficiency: Uses sub-linear space complexity, independent of dataset size.
Speed: Single-pass algorithm with constant-time updates.
Scalability: Suitable for streaming data or datasets too large to fit in memory.
Approximation: Trades perfect accuracy for significant gains in speed and memory usage.

Mathematical concepts:

The Count-Min Sketch uses multiple hash functions to map items to counters: $$CMS[i][h_i(x)] += 1 \quad \forall i \in \{1,\ldots,d\}$$ where $d$ is the sketch depth and $w$ is the sketch width.

Frequency estimates are obtained by taking the minimum across all counters: $$\hat{f}(x) = \min_{i} CMS[i][h_i(x)]$$

Statistical divergence between bins is measured using Jensen-Shannon divergence: $$JSD(P||Q) = \frac{1}{2} \left[ KL(P||M) + KL(Q||M) \right]$$ where $M = \frac{1}{2}(P+Q)$ and $KL$ is the Kullback-Leibler divergence.

Laplace smoothing is applied to WoE and IV calculations: $$p_{smoothed} = \frac{count + \alpha}{total + 2\alpha}$$

Note

Target variable must contain both 0 and 1 values.
Due to the probabilistic nature of sketches, results may vary slightly between runs. For deterministic results, consider setting fixed random seeds in the underlying C++ code.
Accuracy of frequency estimates depends on sketch_width and sketch_depth. Increase these parameters for higher precision at the cost of memory/computation.
This algorithm is particularly beneficial when dealing with high-cardinality categorical features or streaming data scenarios.
For small to medium datasets, deterministic algorithms like SBLP or MOB may provide more accurate results.

References

Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.

Lin, J., & Keogh, E., Wei, L., & Lonardi, S. (2007). Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery, 15(2), 107-144.

Examples

# Generate sample data
set.seed(123)
n <- 10000
feature <- sample(letters, n, replace = TRUE, prob = c(rep(0.04, 13), rep(0.02, 13)))
# Create a relationship where early letters have higher probability
target_probs <- ifelse(as.numeric(factor(feature)) <= 10, 0.7, 0.3)
target <- rbinom(n, 1, prob = target_probs)

# Perform sketch-based optimal binning
result <- ob_categorical_sketch(feature, target)
print(result[c("bin", "woe", "iv", "count")])
#> $bin
#> [1] "k%;%z%;%o%;%t"                 "u%;%v%;%y%;%l%;%m"            
#> [3] "n%;%p%;%x%;%w%;%r%;%s%;%q"     "h%;%j%;%f%;%g%;%b%;%d%;%c%;%i"
#> [5] "a%;%e"                        
#> 
#> $woe
#> [1] -1.0042752 -0.8308164 -0.8519467  0.8438619  0.8290098
#> 
#> $iv
#> [1] 0.12617874 0.12118256 0.11522899 0.27409018 0.06727459
#> 
#> $count
#> [1] 1354 1855 1682 4075 1034
#> 

# With custom sketch parameters for higher accuracy
result_high_acc <- ob_categorical_sketch(
  feature = feature,
  target = target,
  min_bins = 3,
  max_bins = 7,
  sketch_width = 4000,
  sketch_depth = 7
)

# Handling missing values
feature_with_na <- feature
feature_with_na[sample(length(feature_with_na), 200)] <- NA
result_na <- ob_categorical_sketch(feature_with_na, target)