Optimal Binning for Numerical Variables using Sketch-based Algorithm — optimal_binning_numerical

This function performs optimal binning for numerical variables using a sketch-based approach, combining KLL Sketch for quantile approximation with Weight of Evidence (WOE) and Information Value (IV) methods.

Usage

optimal_binning_numerical_sketch(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  special_codes = "",
  monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  sketch_k = 200L
)

Arguments

target: An integer vector of binary target values (0 or 1).
feature: A numeric vector of feature values.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency for a bin (default: 0.05).
special_codes: String with special codes to be treated separately, separated by comma (default: "").
monotonic: Whether to enforce monotonicity of WOE across bins (default: TRUE).
convergence_threshold: Threshold for convergence in optimization (default: 1e-6).
max_iterations: Maximum number of iterations for optimization (default: 1000).
sketch_k: Parameter controlling the accuracy of the KLL sketch (default: 200).

Value

A list containing:

id: Numeric identifiers for each bin
bin_lower: Lower bounds of bins
bin_upper: Upper bounds of bins
woe: Weight of Evidence for each bin
iv: Information Value for each bin
count: Total counts for each bin
count_pos: Positive target counts for each bin
count_neg: Negative target counts for each bin
cutpoints: Selected cutting points between bins
converged: Logical value indicating whether the algorithm converged
iterations: Number of iterations run

Details

The algorithm uses a KLL (Karnin-Lang-Liberty) Sketch data structure to efficiently approximate the quantiles of numerical data, making it suitable for very large datasets or streaming scenarios. The sketch-based approach allows processing data in a single pass with sublinear memory usage.

The algorithm performs the following steps:

Input validation and preprocessing
Building a KLL sketch for the data
Extracting candidate cutpoints from the sketch
Selecting optimal cutpoints using either dynamic programming (for smaller datasets) or a greedy approach based on Information Value
Enforcing minimum bin size (bin_cutoff)
Calculating initial Weight of Evidence (WOE) and Information Value (IV)
Enforcing monotonicity of WOE across bins (if requested)
Optimizing the number of bins through iterative merging

Examples

if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)

# Run optimal binning with sketch
result <- optimal_binning_numerical_sketch(feature, target)

# View results
print(result)
} # }