Skip to contents

This function implements an optimal binning algorithm for numerical variables using an Unsupervised Binning approach based on Standard Deviation (UBSD) with Weight of Evidence (WoE) and Information Value (IV) criteria. The algorithm creates interpretable bins that maximize predictive power while ensuring monotonicity of WoE values.

Usage

optimal_binning_numerical_ubsd(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  laplace_smoothing = 0.5
)

Arguments

target

A numeric vector of binary target values (should contain exactly two unique values: 0 and 1).

feature

A numeric vector of feature values to be binned.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum frequency of observations in each bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins for initial standard deviation-based discretization (default: 20).

convergence_threshold

Threshold for convergence of the total IV (default: 1e-6).

max_iterations

Maximum number of iterations for the algorithm (default: 1000).

laplace_smoothing

Smoothing parameter for WoE calculation (default: 0.5).

Value

A list containing the following elements:

id

Numeric vector of bin identifiers (1-based).

bin

A character vector of bin names.

woe

A numeric vector of Weight of Evidence values for each bin.

iv

A numeric vector of Information Value for each bin.

count

An integer vector of the total count of observations in each bin.

count_pos

An integer vector of the count of positive observations in each bin.

count_neg

An integer vector of the count of negative observations in each bin.

event_rate

A numeric vector of the proportion of positive cases in each bin.

cutpoints

A numeric vector of cut points used to generate the bins.

total_iv

A numeric value of the total Information Value.

converged

A logical value indicating whether the algorithm converged.

iterations

An integer value indicating the number of iterations run.

Details

Mathematical Framework:

Weight of Evidence (WoE): For a bin i with Laplace smoothing alpha: $$WoE_i = \ln\left(\frac{n_{1i} + \alpha}{n_{1} + m\alpha} \cdot \frac{n_{0} + m\alpha}{n_{0i} + \alpha}\right)$$ Where:

  • \(n_{1i}\) is the count of positive cases in bin \(i\)

  • \(n_{0i}\) is the count of negative cases in bin \(i\)

  • \(n_{1}\) is the total count of positive cases

  • \(n_{0}\) is the total count of negative cases

  • \(m\) is the number of bins

  • \(\alpha\) is the Laplace smoothing parameter

Information Value (IV): Summarizes predictive power across all bins: $$IV = \sum_{i} (P(X|Y=1) - P(X|Y=0)) \times WoE_i$$

Algorithm Steps:

  1. Initial Binning: Create bins using statistical properties of the data (mean and standard deviation)

  2. Merge Small Bins: Combine bins with frequency below the threshold to ensure statistical stability

  3. Calculate WoE/IV: Compute Weight of Evidence and Information Value with Laplace smoothing

  4. Enforce Monotonicity: Merge bins to ensure monotonic relationship between feature and target

  5. Adjust Bin Count: Ensure the number of bins is within the specified range

  6. Validate Bins: Perform statistical checks on the final binning solution

References

  • Thomas, L.C. (2009). "Consumer Credit Models: Pricing, Profit, and Portfolios." Oxford University Press.

  • Scott, D.W. (2015). "Multivariate Density Estimation: Theory, Practice, and Visualization." John Wiley & Sons.

  • Good, I.J. (1952). "Rational Decisions." Journal of the Royal Statistical Society, Series B, 14, 107-114.

  • Belcastro, L., Marozzo, F., Talia, D., & Trunfio, P. (2020). "Big Data Analytics." Handbook of Big Data Technologies, Springer.

Examples

if (FALSE) { # \dontrun{
# Generate sample data
set.seed(123)
n <- 10000
feature <- rnorm(n)
target <- rbinom(n, 1, plogis(0.5 * feature))

# Apply optimal binning
result <- optimal_binning_numerical_ubsd(target, feature, min_bins = 3, max_bins = 5)

# View binning results
print(result)

# Plot WoE against bins
barplot(result$woe, names.arg = result$bin, las = 2,
        main = "Weight of Evidence by Bin", ylab = "WoE")
abline(h = 0, lty = 2)
} # }