Optimal Binning for Numerical Variables using Equal-Width Binning
optimal_binning_numerical_ewb.Rd
Performs optimal binning for numerical variables using equal-width intervals as a starting point, followed by a suite of optimization steps. This method balances predictive power and interpretability by creating statistically stable bins with a strong relationship to the target variable. The algorithm is particularly useful for risk modeling, credit scoring, and feature engineering in classification tasks.
Usage
optimal_binning_numerical_ewb(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
is_monotonic = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
Integer binary vector (0 or 1) representing the target variable.
- feature
Numeric vector with the values of the feature to be binned.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum fraction of observations each bin must contain (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before optimization (default: 20).
- is_monotonic
Logical indicating whether to enforce monotonicity in WoE (default: TRUE).
- convergence_threshold
Convergence threshold for optimization process (default: 1e-6).
- max_iterations
Maximum number of iterations allowed (default: 1000).
Value
A list containing:
- id
Numeric identifiers for each bin (1-based indexing).
- bin
Character vector with the interval specification of each bin (e.g., "(-Inf;0.5]").
- woe
Numeric vector with the Weight of Evidence values for each bin.
- iv
Numeric vector with the Information Value contribution for each bin.
- count
Integer vector with the total number of observations in each bin.
- count_pos
Integer vector with the number of positive observations in each bin.
- count_neg
Integer vector with the number of negative observations in each bin.
- cutpoints
Numeric vector with the cut points between bins (excluding infinity).
- converged
Logical value indicating whether the algorithm converged.
- iterations
Number of iterations performed by the algorithm.
- total_iv
Total Information Value of the binning solution.
Details
Algorithm Overview
The implementation follows a multi-stage approach:
Pre-processing:
Validation of inputs and handling of missing values
Special processing for features with few unique values
Equal-Width Binning:
Division of the feature range into intervals of equal width
Initial assignment of observations to bins
Statistical Optimization:
Merging of rare bins with frequencies below threshold
WoE monotonicity enforcement (optional)
Optimization to meet maximum bins constraint
Metric Calculation:
Weight of Evidence (WoE) and Information Value (IV) computation
Mathematical Foundation
The algorithm uses two key metrics from information theory:
Weight of Evidence (WoE) for bin \(i\): $$WoE_i = \ln\left(\frac{p_i/P}{n_i/N}\right)$$
Where:
\(p_i\): Number of positive cases in bin \(i\)
\(P\): Total number of positive cases
\(n_i\): Number of negative cases in bin \(i\)
\(N\): Total number of negative cases
Information Value (IV) for bin \(i\): $$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$
The total Information Value is the sum across all bins: $$IV_{total} = \sum_{i=1}^{k} IV_i$$
Laplace Smoothing: To handle zero counts, the algorithm employs Laplace smoothing: $$\frac{p_i + \alpha}{P + k\alpha}, \frac{n_i + \alpha}{N + k\alpha}$$
Where:
\(\alpha\): Smoothing factor (0.5 in this implementation)
\(k\): Number of bins
Monotonicity Enforcement
When is_monotonic = TRUE
, the algorithm ensures that WoE values either consistently
increase or decrease across bins. This property is desirable for:
Interpretability: Monotonic relationships are easier to explain
Robustness: Reduces overfitting and improves stability
Business logic: Aligns with domain knowledge expectations
The algorithm determines the preferred monotonicity direction (increasing or decreasing) based on the initial bins and proceeds to merge bins that violate this pattern while minimizing information loss.
References
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and Unsupervised Discretization of Continuous Features. Proceedings of the Twelfth International Conference on Machine Learning, 194-202.
García, S., Luengo, J., Sáez, J. A., López, V., & Herrera, F. (2013). A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering, 25(4), 734-750.
Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization Techniques: A Recent Survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Thomas, L. C. (2009). Consumer Credit Models: Pricing, Profit and Portfolios. Oxford University Press.
Zeng, Y. (2014). Univariate feature selection and binner. arXiv preprint arXiv:1410.5420.
Examples
if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)
# Basic usage
result <- optimal_binning_numerical_ewb(target, feature)
print(result)
# Custom parameters
result_custom <- optimal_binning_numerical_ewb(
target = target,
feature = feature,
min_bins = 2,
max_bins = 8,
bin_cutoff = 0.03,
is_monotonic = TRUE
)
# Extract cutpoints for use in prediction
cutpoints <- result$cutpoints
# Calculate total information value
total_iv <- result$total_iv
} # }