Skip to contents

Performs optimal binning for numerical variables using a Branch and Bound approach. This method transforms continuous features into discrete bins by maximizing the statistical relationship with a binary target variable while maintaining interpretability constraints. The algorithm optimizes Weight of Evidence (WoE) and Information Value (IV) metrics commonly used in risk modeling, credit scoring, and statistical analysis.

Usage

optimal_binning_numerical_bb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  is_monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target

An integer binary vector (0 or 1) representing the target variable.

feature

A numeric vector of feature values to be binned.

min_bins

Minimum number of bins to generate (default: 3).

max_bins

Maximum number of bins to generate (default: 5).

bin_cutoff

Minimum frequency fraction for each bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins generated before optimization (default: 20).

is_monotonic

Logical value indicating whether to enforce monotonicity in WoE (default: TRUE).

convergence_threshold

Convergence threshold for total Information Value (IV) change (default: 1e-6).

max_iterations

Maximum number of iterations allowed for the optimization process (default: 1000).

Value

A list containing:

id

Numeric identifiers for each bin (1-based).

bin

Character vector with the intervals of each bin (e.g., (-Inf; 0], (0; +Inf)).

woe

Numeric vector with the Weight of Evidence values for each bin.

iv

Numeric vector with the Information Value contribution for each bin.

count

Integer vector with the total number of observations in each bin.

count_pos

Integer vector with the number of positive observations in each bin.

count_neg

Integer vector with the number of negative observations in each bin.

cutpoints

Numeric vector of cut points between bins (excluding infinity).

converged

Logical value indicating whether the algorithm converged.

iterations

Number of iterations executed by the optimization algorithm.

total_iv

The total Information Value of the binning solution.

Details

Algorithm Overview

The implementation follows a five-phase approach:

  1. Input Validation: Ensures data integrity and parameter validity.

  2. Pre-Binning:

    • Creates initial bins using quantile-based division

    • Handles special cases for limited unique values

    • Uses binary search for efficient observation assignment

  3. Statistical Stabilization:

    • Merges bins with frequencies below the specified threshold

    • Ensures each bin has sufficient observations for reliable statistics

  4. Monotonicity Enforcement (optional):

    • Ensures WoE values follow a consistent trend (increasing or decreasing)

    • Improves interpretability and aligns with business expectations

    • Selects optimal monotonicity direction based on IV preservation

  5. Branch and Bound Optimization:

    • Iteratively merges bins with minimal IV contribution

    • Continues until reaching the target number of bins or convergence

    • Preserves predictive power while reducing complexity

Mathematical Foundation

The algorithm optimizes two key metrics:

  1. Weight of Evidence (WoE) for bin \(i\): $$WoE_i = \ln\left(\frac{p_i/P}{n_i/N}\right)$$

    Where:

    • \(p_i\): Number of positive cases in bin \(i\)

    • \(P\): Total number of positive cases

    • \(n_i\): Number of negative cases in bin \(i\)

    • \(N\): Total number of negative cases

  2. Information Value (IV) for bin \(i\): $$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$

    The total Information Value is the sum across all bins: $$IV_{total} = \sum_{i=1}^{k} IV_i$$

  3. Smoothing: The implementation uses Laplace smoothing to handle zero counts: $$\frac{p_i + \alpha}{P + k\alpha}, \frac{n_i + \alpha}{N + k\alpha}$$

    Where:

    • \(\alpha\): Small constant (0.5 in this implementation)

    • \(k\): Number of bins

Branch and Bound Strategy

The core optimization uses a greedy iterative approach:

  1. Start with more bins than needed (from pre-binning)

  2. Identify the bin with the smallest IV contribution

  3. Merge this bin with an adjacent bin

  4. Recompute WoE and IV values

  5. If monotonicity is required, enforce it

  6. Repeat until target number of bins is reached or convergence

This approach minimizes information loss while reducing model complexity.

References

Belson, W. A. (1959). Matching and prediction on the principle of biological classification. Journal of the Royal Statistical Society: Series C (Applied Statistics), 8(2), 65-75.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics.

Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization Techniques: A Recent Survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.

Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. Proceedings of the Twelfth International Conference on Machine Learning, 194-202.

Bertsimas, D., & Dunn, J. (2017). Optimal classification trees. Machine Learning, 106(7), 1039-1082.

Examples

if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
n <- 10000
feature <- rnorm(n)
# Create target with logistic relationship
target <- rbinom(n, 1, plogis(0.5 * feature))

# Apply optimal binning
result <- optimal_binning_numerical_bb(target, feature, min_bins = 3, max_bins = 5)
print(result)

# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv

# Example with custom parameters
result2 <- optimal_binning_numerical_bb(
  target = target,
  feature = feature,
  min_bins = 2,
  max_bins = 8,
  bin_cutoff = 0.02,
  is_monotonic = TRUE
)
} # }