Optimal Binning for Numerical Variables using Isotonic Regression
optimal_binning_numerical_ir.Rd
Implements an advanced binning algorithm for numerical variables using isotonic regression to ensure monotonicity in bin event rates. This method is particularly valuable for risk modeling, credit scoring, and other applications where monotonic relationships between features and target variables are expected or preferred.
Usage
optimal_binning_numerical_ir(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
auto_monotonicity = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
Binary integer vector (0 or 1) representing the target variable.
- feature
Numeric vector of values to be binned.
- min_bins
Minimum number of bins to generate (default: 3).
- max_bins
Maximum number of bins allowed (default: 5).
- bin_cutoff
Minimum frequency fraction for each bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before optimization (default: 20).
- auto_monotonicity
Automatically determine monotonicity direction (default: TRUE).
- convergence_threshold
Convergence threshold for optimization (default: 1e-6).
- max_iterations
Maximum number of iterations allowed (default: 1000).
Value
A list containing:
- id
Numeric identifiers for each bin (1-based).
- bin
Character vector with the bin intervals.
- woe
Numeric vector with Weight of Evidence values for each bin.
- iv
Numeric vector with Information Value contribution for each bin.
- count
Integer vector with the total number of observations in each bin.
- count_pos
Integer vector with the positive class counts in each bin.
- count_neg
Integer vector with the negative class counts in each bin.
- cutpoints
Numeric vector with the bin cutpoints (excluding ±Inf).
- converged
Logical value indicating whether the algorithm converged.
- iterations
Integer with the number of optimization iterations performed.
- total_iv
Total Information Value of the binning solution.
- monotone_increasing
Logical indicating whether monotonically increasing (TRUE) or decreasing (FALSE).
Details
Algorithm Overview
The algorithm transforms a continuous feature into discrete bins that maximize the relationship with a binary target while enforcing monotonicity constraints. It operates through several phases:
Pre-Binning: Initial segmentation based on quantiles or unique feature values
Frequency Stabilization: Merging of low-frequency bins to ensure statistical reliability
Monotonicity Enforcement: Application of isotonic regression via Pool Adjacent Violators (PAV)
Bin Optimization: Adjustments to meet constraints on minimum and maximum bin count
Information Value Calculation: Computation of WoE and IV metrics for each bin
Mathematical Foundation
The core mathematical concepts employed in this algorithm are:
1. Isotonic Regression
Isotonic regression solves the following optimization problem:
$$\min_{\mu} \sum_{i=1}^{n} w_i (y_i - \mu_i)^2$$
Subject to: $$\mu_1 \leq \mu_2 \leq \ldots \leq \mu_n$$ (for increasing monotonicity)
Where:
\(y_i\) is the original event rate in bin \(i\)
\(w_i\) is the weight (observation count) of bin \(i\)
\(\mu_i\) is the isotonic (monotone) estimate for bin \(i\)
2. Weight of Evidence (WoE)
For each bin \(i\), the Weight of Evidence is defined as:
$$WoE_i = \ln\left(\frac{p_i/P}{n_i/N}\right)$$
Where:
\(p_i\): Number of positive cases in bin \(i\)
\(P\): Total number of positive cases
\(n_i\): Number of negative cases in bin \(i\)
\(N\): Total number of negative cases
Key Features
Automatic Monotonicity Direction: Determines optimal monotonicity (increasing/decreasing) based on data
Robust Handling of Edge Cases: Special processing for few unique values, missing data, etc.
Optimal Information Preservation: Merges bins to minimize information loss while meeting constraints
Statistical Reliability: Ensures each bin has sufficient observations for stable estimates
References
Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140-147.
Robertson, T., Wright, F. T., & Dykstra, R. L. (1988). Order restricted statistical inference. Wiley.
de Leeuw, J., Hornik, K., & Mair, P. (2009). Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods. Journal of Statistical Software, 32(5), 1-24.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. Society for Industrial and Applied Mathematics.
Belkin, M., Hsu, D., & Mitra, P. (2018). Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate. Advances in Neural Information Processing Systems.
Examples
if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
n <- 1000
target <- sample(0:1, n, replace = TRUE)
feature <- rnorm(n)
# Basic usage
result <- optimal_binning_numerical_ir(target, feature)
print(result)
# Custom settings
result_custom <- optimal_binning_numerical_ir(
target = target,
feature = feature,
min_bins = 2,
max_bins = 6,
bin_cutoff = 0.03,
auto_monotonicity = TRUE
)
# Access specific components
bins <- result$bin
woe_values <- result$woe
is_increasing <- result$monotone_increasing
} # }