Skip to contents

This function performs optimal binning for numerical variables using a sketch-based approach, combining KLL Sketch for quantile approximation with Weight of Evidence (WOE) and Information Value (IV) methods.

Usage

optimal_binning_numerical_sketch(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  special_codes = "",
  monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  sketch_k = 200L
)

Arguments

target

An integer vector of binary target values (0 or 1).

feature

A numeric vector of feature values.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum frequency for a bin (default: 0.05).

special_codes

String with special codes to be treated separately, separated by comma (default: "").

monotonic

Whether to enforce monotonicity of WOE across bins (default: TRUE).

convergence_threshold

Threshold for convergence in optimization (default: 1e-6).

max_iterations

Maximum number of iterations for optimization (default: 1000).

sketch_k

Parameter controlling the accuracy of the KLL sketch (default: 200).

Value

A list containing:

  • id: Numeric identifiers for each bin

  • bin_lower: Lower bounds of bins

  • bin_upper: Upper bounds of bins

  • woe: Weight of Evidence for each bin

  • iv: Information Value for each bin

  • count: Total counts for each bin

  • count_pos: Positive target counts for each bin

  • count_neg: Negative target counts for each bin

  • cutpoints: Selected cutting points between bins

  • converged: Logical value indicating whether the algorithm converged

  • iterations: Number of iterations run

Details

The algorithm uses a KLL (Karnin-Lang-Liberty) Sketch data structure to efficiently approximate the quantiles of numerical data, making it suitable for very large datasets or streaming scenarios. The sketch-based approach allows processing data in a single pass with sublinear memory usage.

The algorithm performs the following steps:

  1. Input validation and preprocessing

  2. Building a KLL sketch for the data

  3. Extracting candidate cutpoints from the sketch

  4. Selecting optimal cutpoints using either dynamic programming (for smaller datasets) or a greedy approach based on Information Value

  5. Enforcing minimum bin size (bin_cutoff)

  6. Calculating initial Weight of Evidence (WOE) and Information Value (IV)

  7. Enforcing monotonicity of WOE across bins (if requested)

  8. Optimizing the number of bins through iterative merging

Examples

if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)

# Run optimal binning with sketch
result <- optimal_binning_numerical_sketch(feature, target)

# View results
print(result)
} # }