
Optimal Binning for Categorical Variables using Sketch-based Algorithm
Source:R/obc_sketch.R
ob_categorical_sketch.RdThis function performs optimal binning for categorical variables using a Sketch-based algorithm designed for large-scale data processing. It employs probabilistic data structures (Count-Min Sketch) to efficiently estimate category frequencies and event rates, enabling near real-time binning on massive datasets.
Usage
ob_categorical_sketch(
feature,
target,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L,
sketch_width = 2000L,
sketch_depth = 5L
)Arguments
- feature
A character vector or factor representing the categorical predictor variable. Missing values (NA) will be converted to the string "N/A" and treated as a separate category.
- target
An integer vector containing binary outcome values (0 or 1). Must be the same length as
feature. Cannot contain missing values.- min_bins
Integer. Minimum number of bins to create. Must be at least 2. Default is 3.
- max_bins
Integer. Maximum number of bins to create. Must be greater than or equal to
min_bins. Default is 5.- bin_cutoff
Numeric. Minimum relative frequency threshold for categories to be considered "heavy hitters". Categories below this proportion will be grouped together. Value must be between 0 and 1. Default is 0.05 (5%).
- max_n_prebins
Integer. Maximum number of initial bins created during pre-binning phase. Controls early-stage complexity. Default is 20.
- bin_separator
Character string used to separate category names when multiple categories are merged into a single bin. Default is "%;%".
- convergence_threshold
Numeric. Threshold for determining algorithm convergence based on changes in total Information Value. Default is 1e-6.
- max_iterations
Integer. Maximum number of iterations for the optimization process. Default is 1000.
- sketch_width
Integer. Width of the Count-Min Sketch (number of counters per hash function). Larger values reduce estimation error but increase memory usage. Must be >= 100. Default is 2000.
- sketch_depth
Integer. Depth of the Count-Min Sketch (number of hash functions). Larger values reduce collision probability but increase computational overhead. Must be >= 3. Default is 5.
Value
A list containing the results of the optimal binning procedure:
idNumeric vector of bin identifiers (1 to n_bins)
binCharacter vector of bin labels, which are combinations of original categories separated by
bin_separatorwoeNumeric vector of Weight of Evidence values for each bin
ivNumeric vector of Information Values for each bin
countInteger vector of total observations in each bin
count_posInteger vector of positive outcomes in each bin
count_negInteger vector of negative outcomes in each bin
event_rateNumeric vector of the observed event rate in each bin
total_ivNumeric scalar. Total Information Value across all bins
convergedLogical. Whether the algorithm converged
iterationsInteger. Number of iterations performed
Details
The Sketch-based algorithm follows these steps:
Frequency Estimation: Uses Count-Min Sketch to approximate the frequency of each category in a single data pass.
Heavy Hitter Detection: Identifies frequently occurring categories (above a threshold defined by
bin_cutoff) using sketch estimates.Pre-binning: Creates initial bins from detected heavy categories, grouping rare categories separately.
Optimization: Applies iterative merging based on statistical divergence measures to optimize Information Value (IV) while respecting bin count constraints (
min_bins,max_bins).Monotonicity Enforcement: Ensures the final binning has monotonic Weight of Evidence (WoE).
Key advantages of this approach:
Memory Efficiency: Uses sub-linear space complexity, independent of dataset size.
Speed: Single-pass algorithm with constant-time updates.
Scalability: Suitable for streaming data or datasets too large to fit in memory.
Approximation: Trades perfect accuracy for significant gains in speed and memory usage.
Mathematical concepts:
The Count-Min Sketch uses multiple hash functions to map items to counters: $$CMS[i][h_i(x)] += 1 \quad \forall i \in \{1,\ldots,d\}$$ where \(d\) is the sketch depth and \(w\) is the sketch width.
Frequency estimates are obtained by taking the minimum across all counters: $$\hat{f}(x) = \min_{i} CMS[i][h_i(x)]$$
Statistical divergence between bins is measured using Jensen-Shannon divergence: $$JSD(P||Q) = \frac{1}{2} \left[ KL(P||M) + KL(Q||M) \right]$$ where \(M = \frac{1}{2}(P+Q)\) and \(KL\) is the Kullback-Leibler divergence.
Laplace smoothing is applied to WoE and IV calculations: $$p_{smoothed} = \frac{count + \alpha}{total + 2\alpha}$$
Note
Target variable must contain both 0 and 1 values.
Due to the probabilistic nature of sketches, results may vary slightly between runs. For deterministic results, consider setting fixed random seeds in the underlying C++ code.
Accuracy of frequency estimates depends on
sketch_widthandsketch_depth. Increase these parameters for higher precision at the cost of memory/computation.This algorithm is particularly beneficial when dealing with high-cardinality categorical features or streaming data scenarios.
For small to medium datasets, deterministic algorithms like SBLP or MOB may provide more accurate results.
References
Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.
Lin, J., & Keogh, E., Wei, L., & Lonardi, S. (2007). Experiencing SAX: a novel symbolic representation of time series. Data Mining and Knowledge Discovery, 15(2), 107-144.
Examples
# Generate sample data
set.seed(123)
n <- 10000
feature <- sample(letters, n, replace = TRUE, prob = c(rep(0.04, 13), rep(0.02, 13)))
# Create a relationship where early letters have higher probability
target_probs <- ifelse(as.numeric(factor(feature)) <= 10, 0.7, 0.3)
target <- rbinom(n, 1, prob = target_probs)
# Perform sketch-based optimal binning
result <- ob_categorical_sketch(feature, target)
print(result[c("bin", "woe", "iv", "count")])
#> $bin
#> [1] "k%;%z%;%o%;%t" "u%;%v%;%y%;%l%;%m"
#> [3] "n%;%p%;%x%;%w%;%r%;%s%;%q" "h%;%j%;%f%;%g%;%b%;%d%;%c%;%i"
#> [5] "a%;%e"
#>
#> $woe
#> [1] -1.0042752 -0.8308164 -0.8519467 0.8438619 0.8290098
#>
#> $iv
#> [1] 0.12617874 0.12118256 0.11522899 0.27409018 0.06727459
#>
#> $count
#> [1] 1354 1855 1682 4075 1034
#>
# With custom sketch parameters for higher accuracy
result_high_acc <- ob_categorical_sketch(
feature = feature,
target = target,
min_bins = 3,
max_bins = 7,
sketch_width = 4000,
sketch_depth = 7
)
# Handling missing values
feature_with_na <- feature
feature_with_na[sample(length(feature_with_na), 200)] <- NA
result_na <- ob_categorical_sketch(feature_with_na, target)