Optimal Binning for Numerical Variables using Local Polynomial Density Binning (LPDB)
optimal_binning_numerical_lpdb.Rd
Implements an advanced binning algorithm for numerical variables that combines local polynomial density estimation with information-theoretic optimization. This method adapts bin boundaries to the natural structure of the data while maximizing predictive power for a binary target variable. LPDB is particularly effective for complex distributions with multiple modes or regions of varying density.
Usage
optimal_binning_numerical_lpdb(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
polynomial_degree = 3L,
enforce_monotonic = TRUE,
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
A binary integer vector (0 or 1) representing the target variable.
- feature
A numeric vector representing the feature to be binned.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency fraction for each bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before optimization (default: 20).
- polynomial_degree
Degree of polynomial used for density estimation (default: 3).
- enforce_monotonic
Whether to enforce monotonic relationship in WoE (default: TRUE).
- convergence_threshold
Convergence threshold for optimization (default: 1e-6).
- max_iterations
Maximum iterations allowed (default: 1000).
Value
A list containing:
- id
Numeric identifiers for each bin (1-based).
- bin
Character vector with bin intervals.
- woe
Numeric vector with Weight of Evidence values for each bin.
- iv
Numeric vector with Information Value contribution for each bin.
- count
Integer vector with the total number of observations in each bin.
- count_pos
Integer vector with the positive class count in each bin.
- count_neg
Integer vector with the negative class count in each bin.
- event_rate
Numeric vector with the event rate (proportion of positives) in each bin.
- centroids
Numeric vector with the centroid (mean value) of each bin.
- cutpoints
Numeric vector with the bin boundaries (excluding infinities).
- converged
Logical indicating whether the algorithm converged.
- iterations
Integer count of iterations performed.
- total_iv
Numeric total Information Value of the binning solution.
- monotonicity
Character indicating monotonicity direction ("increasing", "decreasing", or "none").
Details
Algorithm Overview
The Local Polynomial Density Binning algorithm operates through several coordinated phases:
Density Analysis: Uses polynomial regression techniques to estimate the local density structure of the feature distribution, identifying natural groupings in the data.
Critical Point Detection: Locates important points in the density curve (minima, maxima, and inflection points) as potential bin boundaries.
Initial Binning: Creates preliminary bins based on these critical points, ensuring they respect the natural structure of the data.
Statistical Optimization:
Merges bins with frequencies below threshold to ensure statistical reliability
Enforces monotonicity in Weight of Evidence (optional)
Adjusts bin count to meet minimum and maximum constraints
Information Value Calculation: Computes predictive metrics for the final binning solution
Mathematical Foundation
The algorithm employs several advanced statistical concepts:
1. Local Polynomial Density Estimation
For density estimation at point \(x\):
$$f_h(x) = \frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{x-x_i}{h}\right)$$
Where:
\(K\) is a kernel function (Gaussian kernel in this implementation)
\(h\) is the bandwidth parameter (calculated using Silverman's rule)
\(n\) is the number of observations
2. Critical Point Detection
The algorithm identifies key points in the density curve:
Local Minima: Natural boundaries between clusters (density valleys)
Inflection Points: Regions where density curvature changes
Local Maxima: Centers of high-density regions
3. Weight of Evidence (WoE) Calculation
For bin \(i\), with Laplace smoothing:
$$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$
Where:
\(p_i\): Number of positive cases in bin \(i\)
\(P\): Total number of positive cases
\(n_i\): Number of negative cases in bin \(i\)
\(N\): Total number of negative cases
\(\alpha\): Smoothing factor (0.5 in this implementation)
\(k\): Number of bins
Advantages
Adaptive to Data Structure: Places bin boundaries at natural density transitions
Handles Complex Distributions: Effective for multimodal or skewed features
Information Preservation: Optimizes binning for maximum predictive power
Statistical Stability: Ensures sufficient observations in each bin
Interpretability: Supports monotonic relationships between feature and target
References
Fan, J., & Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall.
Loader, C. (1999). Local Regression and Likelihood. Springer-Verlag.
Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall.
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396.
Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Examples
if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)
# Basic usage
result <- optimal_binning_numerical_lpdb(target, feature)
print(result)
# Custom parameters
result_custom <- optimal_binning_numerical_lpdb(
target = target,
feature = feature,
min_bins = 2,
max_bins = 8,
bin_cutoff = 0.03,
polynomial_degree = 5,
enforce_monotonic = TRUE
)
# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
} # }