Optimal Binning for Numerical Variables using Local Density Binning (LDB)
optimal_binning_numerical_ldb.Rd
Implements the Local Density Binning (LDB) algorithm for optimal binning of numerical variables. This method adapts bin boundaries based on the local density structure of the data while maximizing the predictive relationship with a binary target variable. LDB is particularly effective for features with non-uniform distributions or multiple modes.
Usage
optimal_binning_numerical_ldb(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
enforce_monotonic = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
A binary integer vector (0 or 1) representing the target variable.
- feature
A numeric vector representing the feature to be binned.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency fraction for each bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before optimization (default: 20).
- enforce_monotonic
Whether to enforce monotonic WoE across bins (default: TRUE).
- convergence_threshold
Convergence threshold for optimization (default: 1e-6).
- max_iterations
Maximum iterations allowed (default: 1000).
Value
A list containing:
- id
Numeric identifiers for each bin (1-based).
- bin
Character vector with bin intervals.
- woe
Numeric vector with Weight of Evidence values for each bin.
- iv
Numeric vector with Information Value contribution for each bin.
- count
Integer vector with the total number of observations in each bin.
- count_pos
Integer vector with the positive class count in each bin.
- count_neg
Integer vector with the negative class count in each bin.
- event_rate
Numeric vector with the event rate (proportion of positives) in each bin.
- cutpoints
Numeric vector with the bin boundaries (excluding infinities).
- converged
Logical indicating whether the algorithm converged.
- iterations
Integer count of iterations performed.
- total_iv
Numeric total Information Value of the binning solution.
- monotonicity
Character indicating the monotonicity direction ("increasing", "decreasing", or "none").
Details
Algorithm Overview
The Local Density Binning algorithm operates in several phases:
Density Analysis: Analyzes the local density structure of the feature to identify regions of high and low density, placing bin boundaries preferentially at density minima.
Initial Binning: Creates initial bins based on density minima and/or quantiles.
Statistical Optimization:
Merges bins with frequencies below threshold for stability
Enforces monotonicity in Weight of Evidence (optional)
Adjusts to meet constraints on minimum and maximum bin count
Information Value Calculation: Computes predictive metrics for each bin
Mathematical Foundation
The algorithm employs several statistical concepts:
1. Kernel Density Estimation
To identify the local density structure:
$$f_h(x) = \frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{x-x_i}{h}\right)$$
Where:
\(K\) is a kernel function (Gaussian kernel in this implementation)
\(h\) is the bandwidth parameter (selected using Silverman's rule of thumb)
\(n\) is the number of observations
2. Weight of Evidence (WoE)
For assessing the predictive power of each bin:
$$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$
Where:
\(p_i\): Number of positive cases in bin \(i\)
\(P\): Total number of positive cases
\(n_i\): Number of negative cases in bin \(i\)
\(N\): Total number of negative cases
\(\alpha\): Smoothing factor (0.5 in this implementation)
\(k\): Number of bins
Advantages of Local Density Binning
Respects Data Structure: Places bin boundaries at natural gaps in the distribution
Adapts to Multimodality: Handles features with multiple modes effectively
Maximizes Information: Optimizes binning for predictive power
Statistical Stability: Ensures sufficient observations in each bin
Interpretability: Produces monotonic WoE patterns when requested
References
Bin, Y., Liang, S., Chen, Z., Yang, S., & Zhang, L. (2019). Density-based supervised discretization for continuous feature. Knowledge-Based Systems, 166, 1-17.
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC.
Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the Twelfth International Conference on Machine Learning, 194-202.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Thomas, L. C. (2009). Consumer Credit Models: Pricing, Profit and Portfolios. Oxford University Press.
Examples
if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)
# Basic usage
result <- optimal_binning_numerical_ldb(target, feature)
print(result)
# Custom parameters
result_custom <- optimal_binning_numerical_ldb(
target = target,
feature = feature,
min_bins = 2,
max_bins = 8,
bin_cutoff = 0.03,
enforce_monotonic = TRUE
)
# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
monotonicity <- result$monotonicity
} # }