Optimal Binning for Numerical Variables using K-means Binning (KMB)
optimal_binning_numerical_kmb.Rd
Implements an advanced binning algorithm for numerical variables inspired by K-means clustering principles. This method transforms continuous features into optimal discrete bins that maximize predictive power while maintaining interpretability constraints. The algorithm is particularly valuable for risk modeling, credit scoring, and feature engineering in classification tasks.
Usage
optimal_binning_numerical_kmb(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
enforce_monotonic = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A numeric vector of feature values to be binned.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency fraction for each bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before optimization (default: 20).
- enforce_monotonic
Whether to enforce monotonicity in WoE values (default: TRUE).
- convergence_threshold
Convergence threshold for the algorithm (default: 1e-6).
- max_iterations
Maximum number of iterations allowed (default: 1000).
Value
A list containing:
- id
Numeric identifiers for each bin (1-based).
- bin
Character vector of bin intervals.
- woe
Numeric vector of Weight of Evidence (WoE) values for each bin.
- iv
Numeric vector of Information Value (IV) contribution for each bin.
- count
Integer vector of total observations in each bin.
- count_pos
Integer vector of positive target observations in each bin.
- count_neg
Integer vector of negative target observations in each bin.
- centroids
Numeric vector of bin centroids (mean of feature values in each bin).
- cutpoints
Numeric vector of cut points between bins (excluding infinity bounds).
- converged
Logical indicating if the algorithm converged.
- iterations
Integer number of iterations performed by the algorithm.
- total_iv
Total Information Value of the binning solution.
Details
Algorithm Overview
The K-means Binning (KMB) algorithm transforms a continuous feature into discrete bins through several coordinated steps:
Initialization: Creates initial bins using a clustering-inspired approach, placing bin boundaries to create approximately equal-width bins or based on quantiles of the distribution.
Data Assignment: Assigns observations to appropriate bins and calculates statistics including positive/negative counts and event rates.
Bin Optimization:
Merges low-frequency bins to ensure statistical stability
Enforces monotonicity in Weight of Evidence (WoE) values (optional)
Adjusts the number of bins to fall within specified bounds
Metrics Calculation: Computes WoE and Information Value (IV) for each bin
Mathematical Foundation
The algorithm optimizes two key metrics from information theory:
Weight of Evidence (WoE) for bin \(i\): $$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$
Where:
\(p_i\): Number of positive cases in bin \(i\)
\(P\): Total number of positive cases
\(n_i\): Number of negative cases in bin \(i\)
\(N\): Total number of negative cases
\(\alpha\): Smoothing factor (0.5 in this implementation)
\(k\): Number of bins
Information Value (IV) for bin \(i\): $$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$
The total Information Value is the sum across all bins: $$IV_{total} = \sum_{i=1}^{k} IV_i$$
K-means Connection
The algorithm draws inspiration from K-means clustering in several ways:
Initial bin boundaries are positioned similar to how K-means initializes centroids
Data points are assigned to the nearest bin, resembling the assignment step in K-means
Bin statistics (like centroids) are updated based on the assigned observations
The algorithm iteratively refines bins to optimize a global objective
While traditional K-means minimizes within-cluster variance, KMB optimizes predictive power through Information Value while respecting constraints on bin sizes and monotonicity.
References
Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035.
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1022-1027.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics.
García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734-750.
Zhu, X., Zhu, Y., Wang, H., & Zeng, Y. (2020). Credit risk evaluation model with optimal discretization and feature selection. Mathematics, 8(4), 638.
Examples
if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)
# Basic usage
result <- optimal_binning_numerical_kmb(target, feature)
print(result)
# Custom parameters
result_custom <- optimal_binning_numerical_kmb(
target = target,
feature = feature,
min_bins = 2,
max_bins = 8,
bin_cutoff = 0.03,
enforce_monotonic = TRUE
)
# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
} # }