Optimal Binning for Numerical Variables using K-means Binning (KMB) — optimal_binning_numerical

Implements an advanced binning algorithm for numerical variables inspired by K-means clustering principles. This method transforms continuous features into optimal discrete bins that maximize predictive power while maintaining interpretability constraints. The algorithm is particularly valuable for risk modeling, credit scoring, and feature engineering in classification tasks.

Usage

optimal_binning_numerical_kmb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  enforce_monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: An integer vector of binary target values (0 or 1).
feature: A numeric vector of feature values to be binned.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency fraction for each bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins before optimization (default: 20).
enforce_monotonic: Whether to enforce monotonicity in WoE values (default: TRUE).
convergence_threshold: Convergence threshold for the algorithm (default: 1e-6).
max_iterations: Maximum number of iterations allowed (default: 1000).

Value

A list containing:

id: Numeric identifiers for each bin (1-based).
bin: Character vector of bin intervals.
woe: Numeric vector of Weight of Evidence (WoE) values for each bin.
iv: Numeric vector of Information Value (IV) contribution for each bin.
count: Integer vector of total observations in each bin.
count_pos: Integer vector of positive target observations in each bin.
count_neg: Integer vector of negative target observations in each bin.
centroids: Numeric vector of bin centroids (mean of feature values in each bin).
cutpoints: Numeric vector of cut points between bins (excluding infinity bounds).
converged: Logical indicating if the algorithm converged.
iterations: Integer number of iterations performed by the algorithm.
total_iv: Total Information Value of the binning solution.

Details

Algorithm Overview

The K-means Binning (KMB) algorithm transforms a continuous feature into discrete bins through several coordinated steps:

Initialization: Creates initial bins using a clustering-inspired approach, placing bin boundaries to create approximately equal-width bins or based on quantiles of the distribution.
Data Assignment: Assigns observations to appropriate bins and calculates statistics including positive/negative counts and event rates.
Bin Optimization:
- Merges low-frequency bins to ensure statistical stability
- Enforces monotonicity in Weight of Evidence (WoE) values (optional)
- Adjusts the number of bins to fall within specified bounds
Metrics Calculation: Computes WoE and Information Value (IV) for each bin

Mathematical Foundation

The algorithm optimizes two key metrics from information theory:

Weight of Evidence (WoE) for bin $i$: $$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$

Where:
- $p_i$: Number of positive cases in bin $i$
- $P$: Total number of positive cases
- $n_i$: Number of negative cases in bin $i$
- $N$: Total number of negative cases
- $\alpha$: Smoothing factor (0.5 in this implementation)
- $k$: Number of bins
Information Value (IV) for bin $i$: $$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$

The total Information Value is the sum across all bins: $$IV_{total} = \sum_{i=1}^{k} IV_i$$

K-means Connection

The algorithm draws inspiration from K-means clustering in several ways:

Initial bin boundaries are positioned similar to how K-means initializes centroids
Data points are assigned to the nearest bin, resembling the assignment step in K-means
Bin statistics (like centroids) are updated based on the assigned observations
The algorithm iteratively refines bins to optimize a global objective

While traditional K-means minimizes within-cluster variance, KMB optimizes predictive power through Information Value while respecting constraints on bin sizes and monotonicity.

Handling Edge Cases

The implementation includes special handling for:

Features with very few unique values
Missing values (NaN)
Extreme outliers and infinity values
Empty or near-empty bins
Imbalanced target distributions

References

Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.

Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1022-1027.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics.

García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734-750.

Zhu, X., Zhu, Y., Wang, H., & Zeng, Y. (2020). Credit risk evaluation model with optimal discretization and feature selection. Mathematics, 8(4), 638.

Examples

if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)

# Basic usage
result <- optimal_binning_numerical_kmb(target, feature)
print(result)

# Custom parameters
result_custom <- optimal_binning_numerical_kmb(
  target = target,
  feature = feature,
  min_bins = 2,
  max_bins = 8,
  bin_cutoff = 0.03,
  enforce_monotonic = TRUE
)

# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
} # }