Optimal Binning for Numerical Variables using Unsupervised Binning with Standard Deviation
optimal_binning_numerical_ubsd.Rd
This function implements an optimal binning algorithm for numerical variables using an Unsupervised Binning approach based on Standard Deviation (UBSD) with Weight of Evidence (WoE) and Information Value (IV) criteria. The algorithm creates interpretable bins that maximize predictive power while ensuring monotonicity of WoE values.
Usage
optimal_binning_numerical_ubsd(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
convergence_threshold = 1e-06,
max_iterations = 1000L,
laplace_smoothing = 0.5
)
Arguments
- target
A numeric vector of binary target values (should contain exactly two unique values: 0 and 1).
- feature
A numeric vector of feature values to be binned.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency of observations in each bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins for initial standard deviation-based discretization (default: 20).
- convergence_threshold
Threshold for convergence of the total IV (default: 1e-6).
- max_iterations
Maximum number of iterations for the algorithm (default: 1000).
- laplace_smoothing
Smoothing parameter for WoE calculation (default: 0.5).
Value
A list containing the following elements:
- id
Numeric vector of bin identifiers (1-based).
- bin
A character vector of bin names.
- woe
A numeric vector of Weight of Evidence values for each bin.
- iv
A numeric vector of Information Value for each bin.
- count
An integer vector of the total count of observations in each bin.
- count_pos
An integer vector of the count of positive observations in each bin.
- count_neg
An integer vector of the count of negative observations in each bin.
- event_rate
A numeric vector of the proportion of positive cases in each bin.
- cutpoints
A numeric vector of cut points used to generate the bins.
- total_iv
A numeric value of the total Information Value.
- converged
A logical value indicating whether the algorithm converged.
- iterations
An integer value indicating the number of iterations run.
Details
Mathematical Framework:
Weight of Evidence (WoE): For a bin i
with Laplace smoothing alpha
:
$$WoE_i = \ln\left(\frac{n_{1i} + \alpha}{n_{1} + m\alpha} \cdot \frac{n_{0} + m\alpha}{n_{0i} + \alpha}\right)$$
Where:
\(n_{1i}\) is the count of positive cases in bin \(i\)
\(n_{0i}\) is the count of negative cases in bin \(i\)
\(n_{1}\) is the total count of positive cases
\(n_{0}\) is the total count of negative cases
\(m\) is the number of bins
\(\alpha\) is the Laplace smoothing parameter
Information Value (IV): Summarizes predictive power across all bins: $$IV = \sum_{i} (P(X|Y=1) - P(X|Y=0)) \times WoE_i$$
Algorithm Steps:
Initial Binning: Create bins using statistical properties of the data (mean and standard deviation)
Merge Small Bins: Combine bins with frequency below the threshold to ensure statistical stability
Calculate WoE/IV: Compute Weight of Evidence and Information Value with Laplace smoothing
Enforce Monotonicity: Merge bins to ensure monotonic relationship between feature and target
Adjust Bin Count: Ensure the number of bins is within the specified range
Validate Bins: Perform statistical checks on the final binning solution
References
Thomas, L.C. (2009). "Consumer Credit Models: Pricing, Profit, and Portfolios." Oxford University Press.
Scott, D.W. (2015). "Multivariate Density Estimation: Theory, Practice, and Visualization." John Wiley & Sons.
Good, I.J. (1952). "Rational Decisions." Journal of the Royal Statistical Society, Series B, 14, 107-114.
Belcastro, L., Marozzo, F., Talia, D., & Trunfio, P. (2020). "Big Data Analytics." Handbook of Big Data Technologies, Springer.
Examples
if (FALSE) { # \dontrun{
# Generate sample data
set.seed(123)
n <- 10000
feature <- rnorm(n)
target <- rbinom(n, 1, plogis(0.5 * feature))
# Apply optimal binning
result <- optimal_binning_numerical_ubsd(target, feature, min_bins = 3, max_bins = 5)
# View binning results
print(result)
# Plot WoE against bins
barplot(result$woe, names.arg = result$bin, las = 2,
main = "Weight of Evidence by Bin", ylab = "WoE")
abline(h = 0, lty = 2)
} # }