Optimal Binning for Numerical Variables using Monotonic Risk Binning with Likelihood Ratio Pre-binning (MRBLP)
optimal_binning_numerical_mrblp.Rd
This function implements an optimal binning algorithm for numerical variables using Monotonic Risk Binning with Likelihood Ratio Pre-binning (MRBLP). It transforms a continuous feature into discrete bins while preserving the monotonic relationship with the target variable and maximizing the predictive power.
Usage
optimal_binning_numerical_mrblp(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
convergence_threshold = 1e-06,
max_iterations = 1000L,
laplace_smoothing = 0.5
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A numeric vector of the continuous feature to be binned.
- min_bins
Integer. The minimum number of bins to create (default: 3).
- max_bins
Integer. The maximum number of bins to create (default: 5).
- bin_cutoff
Numeric. The minimum proportion of observations in each bin (default: 0.05).
- max_n_prebins
Integer. The maximum number of pre-bins to create during the initial binning step (default: 20).
- convergence_threshold
Numeric. The threshold for convergence in the monotonic binning step (default: 1e-6).
- max_iterations
Integer. The maximum number of iterations for the monotonic binning step (default: 1000).
- laplace_smoothing
Numeric. Smoothing parameter for WoE calculation (default: 0.5).
Value
A list containing the following elements:
- id
Bin identifiers (1-based)
- bin
A character vector of bin ranges
- woe
A numeric vector of Weight of Evidence (WoE) values for each bin
- iv
A numeric vector of Information Value (IV) for each bin
- count
An integer vector of the total count of observations in each bin
- count_pos
An integer vector of the count of positive observations in each bin
- count_neg
An integer vector of the count of negative observations in each bin
- event_rate
A numeric vector with the proportion of positive cases in each bin
- cutpoints
A numeric vector of cutpoints used to create the bins
- total_iv
The total Information Value of all bins combined
- converged
A logical value indicating whether the algorithm converged
- iterations
An integer value indicating the number of iterations run
Details
Mathematical Framework:
Weight of Evidence (WoE): For a bin i
with Laplace smoothing alpha
:
$$WoE_i = \ln\left(\frac{n_{1i} + \alpha}{n_{1} + m\alpha} \cdot \frac{n_{0} + m\alpha}{n_{0i} + \alpha}\right)$$
Where:
\(n_{1i}\) is the count of positive cases in bin \(i\)
\(n_{0i}\) is the count of negative cases in bin \(i\)
\(n_{1}\) is the total count of positive cases
\(n_{0}\) is the total count of negative cases
\(m\) is the number of bins
\(\alpha\) is the Laplace smoothing parameter
Information Value (IV): Summarizes predictive power across all bins: $$IV = \sum_{i} (P(X|Y=1) - P(X|Y=0)) \times WoE_i$$
Algorithm Steps:
Pre-binning: Initial bins are created using equal-frequency binning.
Merge Small Bins: Bins with frequency below the threshold are merged.
Enforce Monotonicity: Bins that violate monotonicity in WoE are merged.
Adjust Bin Count: Bins are merged/split to respect min_bins and max_bins.
Calculate Metrics: Final WoE and IV values are computed with Laplace smoothing.
References
Belcastro, L., Marozzo, F., Talia, D., & Trunfio, P. (2020). "Big Data Analytics on Clouds." In Handbook of Big Data Technologies (pp. 101-142). Springer, Cham.
Zeng, Y. (2014). "Optimal Binning for Scoring Modeling." Computational Economics, 44(1), 137-149.
Good, I.J. (1952). "Rational Decisions." Journal of the Royal Statistical Society, Series B, 14, 107-114. (Origin of Laplace smoothing/additive smoothing)
Examples
if (FALSE) { # \dontrun{
# Generate sample data
set.seed(42)
n <- 10000
feature <- rnorm(n)
target <- rbinom(n, 1, plogis(0.5 + 0.5 * feature))
# Run optimal binning
result <- optimal_binning_numerical_mrblp(target, feature)
# View binning results
print(result)
# Plot Weight of Evidence against bins
plot(result$woe, type = "b", xlab = "Bin", ylab = "WoE",
main = "Weight of Evidence by Bin")
abline(h = 0, lty = 2)
} # }