Skip to contents

Implements an advanced binning algorithm for numerical variables inspired by K-means clustering principles. This method transforms continuous features into optimal discrete bins that maximize predictive power while maintaining interpretability constraints. The algorithm is particularly valuable for risk modeling, credit scoring, and feature engineering in classification tasks.

Usage

optimal_binning_numerical_kmb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  enforce_monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target

An integer vector of binary target values (0 or 1).

feature

A numeric vector of feature values to be binned.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum frequency fraction for each bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins before optimization (default: 20).

enforce_monotonic

Whether to enforce monotonicity in WoE values (default: TRUE).

convergence_threshold

Convergence threshold for the algorithm (default: 1e-6).

max_iterations

Maximum number of iterations allowed (default: 1000).

Value

A list containing:

id

Numeric identifiers for each bin (1-based).

bin

Character vector of bin intervals.

woe

Numeric vector of Weight of Evidence (WoE) values for each bin.

iv

Numeric vector of Information Value (IV) contribution for each bin.

count

Integer vector of total observations in each bin.

count_pos

Integer vector of positive target observations in each bin.

count_neg

Integer vector of negative target observations in each bin.

centroids

Numeric vector of bin centroids (mean of feature values in each bin).

cutpoints

Numeric vector of cut points between bins (excluding infinity bounds).

converged

Logical indicating if the algorithm converged.

iterations

Integer number of iterations performed by the algorithm.

total_iv

Total Information Value of the binning solution.

Details

Algorithm Overview

The K-means Binning (KMB) algorithm transforms a continuous feature into discrete bins through several coordinated steps:

  1. Initialization: Creates initial bins using a clustering-inspired approach, placing bin boundaries to create approximately equal-width bins or based on quantiles of the distribution.

  2. Data Assignment: Assigns observations to appropriate bins and calculates statistics including positive/negative counts and event rates.

  3. Bin Optimization:

    • Merges low-frequency bins to ensure statistical stability

    • Enforces monotonicity in Weight of Evidence (WoE) values (optional)

    • Adjusts the number of bins to fall within specified bounds

  4. Metrics Calculation: Computes WoE and Information Value (IV) for each bin

Mathematical Foundation

The algorithm optimizes two key metrics from information theory:

  1. Weight of Evidence (WoE) for bin \(i\): $$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$

    Where:

    • \(p_i\): Number of positive cases in bin \(i\)

    • \(P\): Total number of positive cases

    • \(n_i\): Number of negative cases in bin \(i\)

    • \(N\): Total number of negative cases

    • \(\alpha\): Smoothing factor (0.5 in this implementation)

    • \(k\): Number of bins

  2. Information Value (IV) for bin \(i\): $$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$

    The total Information Value is the sum across all bins: $$IV_{total} = \sum_{i=1}^{k} IV_i$$

K-means Connection

The algorithm draws inspiration from K-means clustering in several ways:

  • Initial bin boundaries are positioned similar to how K-means initializes centroids

  • Data points are assigned to the nearest bin, resembling the assignment step in K-means

  • Bin statistics (like centroids) are updated based on the assigned observations

  • The algorithm iteratively refines bins to optimize a global objective

While traditional K-means minimizes within-cluster variance, KMB optimizes predictive power through Information Value while respecting constraints on bin sizes and monotonicity.

Handling Edge Cases

The implementation includes special handling for:

  • Features with very few unique values

  • Missing values (NaN)

  • Extreme outliers and infinity values

  • Empty or near-empty bins

  • Imbalanced target distributions

References

Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.

Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1022-1027.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics.

García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734-750.

Zhu, X., Zhu, Y., Wang, H., & Zeng, Y. (2020). Credit risk evaluation model with optimal discretization and feature selection. Mathematics, 8(4), 638.

Examples

if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)

# Basic usage
result <- optimal_binning_numerical_kmb(target, feature)
print(result)

# Custom parameters
result_custom <- optimal_binning_numerical_kmb(
  target = target,
  feature = feature,
  min_bins = 2,
  max_bins = 8,
  bin_cutoff = 0.03,
  enforce_monotonic = TRUE
)

# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
} # }