Optimal Binning for Numerical Variables using Monotonic Risk Binning with Likelihood Ratio Pre-binning (MRBLP) — optimal_binning_numerical

This function implements an optimal binning algorithm for numerical variables using Monotonic Risk Binning with Likelihood Ratio Pre-binning (MRBLP). It transforms a continuous feature into discrete bins while preserving the monotonic relationship with the target variable and maximizing the predictive power.

Usage

optimal_binning_numerical_mrblp(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  laplace_smoothing = 0.5
)

Arguments

target: An integer vector of binary target values (0 or 1).
feature: A numeric vector of the continuous feature to be binned.
min_bins: Integer. The minimum number of bins to create (default: 3).
max_bins: Integer. The maximum number of bins to create (default: 5).
bin_cutoff: Numeric. The minimum proportion of observations in each bin (default: 0.05).
max_n_prebins: Integer. The maximum number of pre-bins to create during the initial binning step (default: 20).
convergence_threshold: Numeric. The threshold for convergence in the monotonic binning step (default: 1e-6).
max_iterations: Integer. The maximum number of iterations for the monotonic binning step (default: 1000).
laplace_smoothing: Numeric. Smoothing parameter for WoE calculation (default: 0.5).

Value

A list containing the following elements:

id: Bin identifiers (1-based)
bin: A character vector of bin ranges
woe: A numeric vector of Weight of Evidence (WoE) values for each bin
iv: A numeric vector of Information Value (IV) for each bin
count: An integer vector of the total count of observations in each bin
count_pos: An integer vector of the count of positive observations in each bin
count_neg: An integer vector of the count of negative observations in each bin
event_rate: A numeric vector with the proportion of positive cases in each bin
cutpoints: A numeric vector of cutpoints used to create the bins
total_iv: The total Information Value of all bins combined
converged: A logical value indicating whether the algorithm converged
iterations: An integer value indicating the number of iterations run

Details

Mathematical Framework:

Weight of Evidence (WoE): For a bin i with Laplace smoothing alpha: $$WoE_i = \ln\left(\frac{n_{1i} + \alpha}{n_{1} + m\alpha} \cdot \frac{n_{0} + m\alpha}{n_{0i} + \alpha}\right)$$ Where:

$n_{1i}$ is the count of positive cases in bin $i$
$n_{0i}$ is the count of negative cases in bin $i$
$n_{1}$ is the total count of positive cases
$n_{0}$ is the total count of negative cases
$m$ is the number of bins
$\alpha$ is the Laplace smoothing parameter

Information Value (IV): Summarizes predictive power across all bins: $$IV = \sum_{i} (P(X|Y=1) - P(X|Y=0)) \times WoE_i$$

Algorithm Steps:

Pre-binning: Initial bins are created using equal-frequency binning.
Merge Small Bins: Bins with frequency below the threshold are merged.
Enforce Monotonicity: Bins that violate monotonicity in WoE are merged.
Adjust Bin Count: Bins are merged/split to respect min_bins and max_bins.
Calculate Metrics: Final WoE and IV values are computed with Laplace smoothing.

References

Belcastro, L., Marozzo, F., Talia, D., & Trunfio, P. (2020). "Big Data Analytics on Clouds." In Handbook of Big Data Technologies (pp. 101-142). Springer, Cham.
Zeng, Y. (2014). "Optimal Binning for Scoring Modeling." Computational Economics, 44(1), 137-149.
Good, I.J. (1952). "Rational Decisions." Journal of the Royal Statistical Society, Series B, 14, 107-114. (Origin of Laplace smoothing/additive smoothing)

Examples

if (FALSE) { # \dontrun{
# Generate sample data
set.seed(42)
n <- 10000
feature <- rnorm(n)
target <- rbinom(n, 1, plogis(0.5 + 0.5 * feature))

# Run optimal binning
result <- optimal_binning_numerical_mrblp(target, feature)

# View binning results
print(result)

# Plot Weight of Evidence against bins
plot(result$woe, type = "b", xlab = "Bin", ylab = "WoE",
     main = "Weight of Evidence by Bin")
abline(h = 0, lty = 2)
} # }