Skip to contents

This function performs optimal binning for categorical variables using a sketch-based approach, combining Count-Min Sketch for frequency estimation with Weight of Evidence (WOE) and Information Value (IV) methods.

Usage

optimal_binning_categorical_sketch(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  sketch_width = 2000L,
  sketch_depth = 5L
)

Arguments

target

An integer vector of binary target values (0 or 1).

feature

A character vector of categorical feature values.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum frequency for a category to be considered as a separate bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins before merging (default: 20).

bin_separator

String used to separate category names when merging bins (default: "%;%").

convergence_threshold

Threshold for convergence in optimization (default: 1e-6).

max_iterations

Maximum number of iterations for optimization (default: 1000).

sketch_width

Width of the Count-Min Sketch (default: 2000).

sketch_depth

Depth of the Count-Min Sketch (default: 5).

Value

A list containing:

  • id: Numeric identifiers for each bin

  • bin: A character vector of bin labels

  • woe: A numeric vector of Weight of Evidence values for each bin

  • iv: A numeric vector of Information Value for each bin

  • count: An integer vector of total counts for each bin

  • count_pos: An integer vector of positive target counts for each bin

  • count_neg: An integer vector of negative target counts for each bin

  • event_rate: A numeric vector of event rates for each bin

  • converged: A logical value indicating whether the algorithm converged

  • iterations: An integer indicating the number of iterations run

  • total_iv: The total Information Value of the binning

Details

The algorithm uses a Count-Min Sketch data structure to efficiently approximate frequency counts for categorical variables, making it suitable for very large datasets or streaming scenarios. The sketch-based approach allows processing data in a single pass with sublinear memory usage.

Statistical Background

Weight of Evidence (WoE) measures the predictive power of a categorical level:

$$WoE_i = \ln\left(\frac{P(X_i | Y = 1)}{P(X_i | Y = 0)}\right) = \ln\left(\frac{n_{i+}/n_+}{n_{i-}/n_-}\right)$$

Where:

  • \(n_{i+}\) is the number of events (Y=1) in bin i

  • \(n_{i-}\) is the number of non-events (Y=0) in bin i

  • \(n_+\) is the total number of events

  • \(n_-\) is the total number of non-events

Information Value (IV) measures the predictive power of the entire variable:

$$IV_i = \left(\frac{n_{i+}}{n_+} - \frac{n_{i-}}{n_-}\right) \times WoE_i$$ $$IV_{total} = \sum_{i=1}^{n} IV_i$$

Algorithm Steps

The algorithm performs the following steps:

  1. Input validation and preprocessing

  2. Building frequency sketches for the data using Count-Min Sketch

  3. Initial pre-binning based on frequency estimates ("heavy hitters")

  4. Enforcing minimum bin size (bin_cutoff)

  5. Calculating initial Weight of Evidence (WoE) and Information Value (IV)

  6. Enforcing monotonicity of WoE across bins

  7. Optimizing the number of bins through iterative merging using statistical divergence

Statistical Improvements

The implementation uses several statistical enhancements:

  • Laplace smoothing for WoE and IV calculation to handle rare events

  • Jensen-Shannon divergence for measuring statistical similarity between bins

  • Adaptive merging strategy that alternates between IV loss and statistical divergence

  • Conservative frequency estimation using the Count-Min Sketch properties

Due to the approximation nature of sketches, there might be slight inconsistencies in counts compared to the exact algorithm, but the trade-off is significantly improved memory efficiency and speed for large datasets.

References

  • Cormode, G., & Muthukrishnan, S. (2005). An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1), 58-75.

  • Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

  • Beltrán, C., et al. (2022). Weight of Evidence (WoE) and Information Value (IV) implementations for predictive modeling in credit risk assessment. Journal of Credit Risk.

Examples

if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:5], 1000, replace = TRUE)

# Run optimal binning with sketch
result <- optimal_binning_categorical_sketch(feature, target)

# View results
print(result)
} # }