Optimal Binning for Numerical Variables using Sketch-based Algorithm
optimal_binning_numerical_sketch.Rd
This function performs optimal binning for numerical variables using a sketch-based approach, combining KLL Sketch for quantile approximation with Weight of Evidence (WOE) and Information Value (IV) methods.
Usage
optimal_binning_numerical_sketch(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
special_codes = "",
monotonic = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000L,
sketch_k = 200L
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A numeric vector of feature values.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency for a bin (default: 0.05).
- special_codes
String with special codes to be treated separately, separated by comma (default: "").
- monotonic
Whether to enforce monotonicity of WOE across bins (default: TRUE).
- convergence_threshold
Threshold for convergence in optimization (default: 1e-6).
- max_iterations
Maximum number of iterations for optimization (default: 1000).
- sketch_k
Parameter controlling the accuracy of the KLL sketch (default: 200).
Value
A list containing:
id: Numeric identifiers for each bin
bin_lower: Lower bounds of bins
bin_upper: Upper bounds of bins
woe: Weight of Evidence for each bin
iv: Information Value for each bin
count: Total counts for each bin
count_pos: Positive target counts for each bin
count_neg: Negative target counts for each bin
cutpoints: Selected cutting points between bins
converged: Logical value indicating whether the algorithm converged
iterations: Number of iterations run
Details
The algorithm uses a KLL (Karnin-Lang-Liberty) Sketch data structure to efficiently approximate the quantiles of numerical data, making it suitable for very large datasets or streaming scenarios. The sketch-based approach allows processing data in a single pass with sublinear memory usage.
The algorithm performs the following steps:
Input validation and preprocessing
Building a KLL sketch for the data
Extracting candidate cutpoints from the sketch
Selecting optimal cutpoints using either dynamic programming (for smaller datasets) or a greedy approach based on Information Value
Enforcing minimum bin size (bin_cutoff)
Calculating initial Weight of Evidence (WOE) and Information Value (IV)
Enforcing monotonicity of WOE across bins (if requested)
Optimizing the number of bins through iterative merging