Optimal Binning for Numerical Variables using Sketch-based Algorithm

Implements optimal binning using the **KLL Sketch** (Karnin, Lang, Liberty, 2016), a probabilistic data structure for quantile approximation in data streams. This is the only method in the package that uses a fundamentally different algorithmic approach (streaming algorithms) compared to batch processing methods (MOB, MDLP, etc.).

The sketch-based approach enables:

Sublinear space complexity: O(k log N) vs O(N) for batch methods
Single-pass processing: Suitable for streaming data
Provable approximation guarantees: Quantile error $\epsilon \approx O(1/k)$

The method combines KLL Sketch for candidate generation with either Dynamic Programming (for small N <= 50) or greedy IV-based selection (for larger datasets), followed by monotonicity enforcement via the Pool Adjacent Violators Algorithm (PAVA).

Usage

ob_numerical_sketch(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000,
  sketch_k = 200
)

Arguments

feature: Numeric vector of feature values. Missing values (NA) are not permitted and will trigger an error. Infinite values (Inf, -Inf) and NaN are also not allowed.
target: Integer vector of binary target values (must contain only 0 and 1). Must have the same length as feature. Missing values are not permitted.
min_bins: Minimum number of bins (default: 3). Must be at least 2.
max_bins: Maximum number of bins (default: 5). Must be >= min_bins.
bin_cutoff: Minimum fraction of total observations per bin (default: 0.05). Must be in (0, 1). Bins with fewer observations will be merged with neighbors.
max_n_prebins: Maximum number of pre-bins to generate from quantiles (default: 20). This parameter controls the initial granularity of binning candidates. Higher values provide more flexibility but increase computational cost.
monotonic: Logical flag to enforce WoE monotonicity (default: TRUE). Uses PAVA (Pool Adjacent Violators Algorithm) for enforcement. Direction (increasing/ decreasing) is automatically detected from the data.
convergence_threshold: Convergence threshold for IV change (default: 1e-6). Optimization stops when the change in total IV between iterations falls below this value.
max_iterations: Maximum iterations for bin optimization (default: 1000). Prevents infinite loops in the optimization process.
sketch_k: Integer parameter controlling sketch accuracy (default: 200). Larger values improve quantile precision but increase memory usage. Approximation error: $\epsilon \approx 1/k$ (200 → 0.5% error). Valid range: [10, 1000]. Typical values: 50 (fast), 200 (balanced), 500 (precise).

Value

A list of class c("OptimalBinningSketch", "OptimalBinning") containing:

id: Numeric vector of bin identifiers (1-based indexing).
bin_lower: Numeric vector of lower bin boundaries (inclusive).
bin_upper: Numeric vector of upper bin boundaries (inclusive for last bin, exclusive for others).
woe: Numeric vector of Weight of Evidence values. Monotonic if monotonic = TRUE.
iv: Numeric vector of Information Value contributions per bin.
count: Integer vector of total observations per bin.
count_pos: Integer vector of positive class (target = 1) counts per bin.
count_neg: Integer vector of negative class (target = 0) counts per bin.
cutpoints: Numeric vector of bin split points (length = number of bins - 1). These are the internal boundaries between bins.
converged: Logical flag indicating whether optimization converged.
iterations: Integer number of optimization iterations performed.

Details

Algorithm Overview

The sketch-based binning algorithm executes in four phases:

Phase 1: KLL Sketch Construction

The KLL Sketch maintains a compressed, multi-level representation of the data distribution:

$$\text{Sketch} = \{\text{Compactor}_0, \text{Compactor}_1, \ldots, \text{Compactor}_L\}$$

where each $\text{Compactor}_\ell$ stores items with weight $2^\ell$. When a compactor exceeds capacity $k$ (controlled by sketch_k), it is compacted.

Theoretical Guarantees (Karnin et al., 2016):

For a quantile $q$ with estimated value $\hat{q}$:

$$|\text{rank}(\hat{q}) - q \cdot N| \le \epsilon \cdot N$$

where $\epsilon \approx O(1/k)$ and space complexity is $O(k \log(N/k))$.

Phase 2: Candidate Extraction

Approximately 40 quantiles are extracted from the sketch using a non-uniform grid with higher resolution in distribution tails.

Phase 3: Optimal Cutpoint Selection

For small datasets (N <= 50), Dynamic Programming maximizes total IV. For larger datasets, a greedy IV-based selection is used.

Phase 4: Bin Refinement

Bins are refined through frequency constraint enforcement, monotonicity enforcement (if requested), and bin count optimization to minimize IV loss.

Computational Complexity

Time: $O(N \log k + N \times C + k^2 \times I)$
Space: $O(k \log N)$ for large N

When to Use Sketch-based Binning

Use: Large datasets (N > 10^6) with memory constraints or streaming data
Avoid: Small datasets (N < 1000) where approximation error may dominate

References

Karnin, Z., Lang, K., & Liberty, E. (2016). "Optimal Quantile Approximation in Streams". Proceedings of the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 71-78. doi:10.1109/FOCS.2016.20
Greenwald, M., & Khanna, S. (2001). "Space-efficient online computation of quantile summaries". ACM SIGMOD Record, 30(2), 58-66. doi:10.1145/376284.375670
Barlow, R. E., Bartholomew, D. J., Bremner, J. M., & Brunk, H. D. (1972). Statistical Inference Under Order Restrictions. Wiley.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley. doi:10.1002/9781119201731

Author

Lopes, J. E.

Examples

# \donttest{
# Example 1: Basic usage with simulated data
set.seed(123)
feature <- rnorm(500, mean = 100, sd = 20)
target <- rbinom(500, 1, prob = plogis((feature - 100) / 20))

result <- ob_numerical_sketch(
  feature = feature,
  target = target,
  min_bins = 3,
  max_bins = 5
)

# Display results
print(data.frame(
  Bin = result$id,
  Count = result$count,
  WoE = round(result$woe, 4),
  IV = round(result$iv, 4)
))
#>   Bin Count     WoE     IV
#> 1   1   198 -1.1237 0.4490
#> 2   2    24  0.0391 0.0001
#> 3   3   278  0.7424 0.2952

# Example 2: Comparing different sketch_k values
set.seed(456)
x <- rnorm(1000, 50, 15)
y <- rbinom(1000, 1, prob = 0.3)

result_k50 <- ob_numerical_sketch(x, y, sketch_k = 50)
result_k200 <- ob_numerical_sketch(x, y, sketch_k = 200)

cat("K=50 IV:", sum(result_k50$iv), "\n")
#> K=50 IV: 0.0100206 
cat("K=200 IV:", sum(result_k200$iv), "\n")
#> K=200 IV: 0.006318715 
# }