Optimal Binning for Numerical Variables using Local Polynomial Density Binning (LPDB) — optimal_binning_numerical

Implements an advanced binning algorithm for numerical variables that combines local polynomial density estimation with information-theoretic optimization. This method adapts bin boundaries to the natural structure of the data while maximizing predictive power for a binary target variable. LPDB is particularly effective for complex distributions with multiple modes or regions of varying density.

Usage

optimal_binning_numerical_lpdb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  polynomial_degree = 3L,
  enforce_monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: A binary integer vector (0 or 1) representing the target variable.
feature: A numeric vector representing the feature to be binned.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency fraction for each bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins before optimization (default: 20).
polynomial_degree: Degree of polynomial used for density estimation (default: 3).
enforce_monotonic: Whether to enforce monotonic relationship in WoE (default: TRUE).
convergence_threshold: Convergence threshold for optimization (default: 1e-6).
max_iterations: Maximum iterations allowed (default: 1000).

Value

A list containing:

id: Numeric identifiers for each bin (1-based).
bin: Character vector with bin intervals.
woe: Numeric vector with Weight of Evidence values for each bin.
iv: Numeric vector with Information Value contribution for each bin.
count: Integer vector with the total number of observations in each bin.
count_pos: Integer vector with the positive class count in each bin.
count_neg: Integer vector with the negative class count in each bin.
event_rate: Numeric vector with the event rate (proportion of positives) in each bin.
centroids: Numeric vector with the centroid (mean value) of each bin.
cutpoints: Numeric vector with the bin boundaries (excluding infinities).
converged: Logical indicating whether the algorithm converged.
iterations: Integer count of iterations performed.
total_iv: Numeric total Information Value of the binning solution.
monotonicity: Character indicating monotonicity direction ("increasing", "decreasing", or "none").

Details

Algorithm Overview

The Local Polynomial Density Binning algorithm operates through several coordinated phases:

Density Analysis: Uses polynomial regression techniques to estimate the local density structure of the feature distribution, identifying natural groupings in the data.
Critical Point Detection: Locates important points in the density curve (minima, maxima, and inflection points) as potential bin boundaries.
Initial Binning: Creates preliminary bins based on these critical points, ensuring they respect the natural structure of the data.
Statistical Optimization:
- Merges bins with frequencies below threshold to ensure statistical reliability
- Enforces monotonicity in Weight of Evidence (optional)
- Adjusts bin count to meet minimum and maximum constraints
Information Value Calculation: Computes predictive metrics for the final binning solution

Mathematical Foundation

The algorithm employs several advanced statistical concepts:

1. Local Polynomial Density Estimation

For density estimation at point $x$:

$$f_h(x) = \frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{x-x_i}{h}\right)$$

Where:

$K$ is a kernel function (Gaussian kernel in this implementation)
$h$ is the bandwidth parameter (calculated using Silverman's rule)
$n$ is the number of observations

2. Critical Point Detection

The algorithm identifies key points in the density curve:

Local Minima: Natural boundaries between clusters (density valleys)
Inflection Points: Regions where density curvature changes
Local Maxima: Centers of high-density regions

3. Weight of Evidence (WoE) Calculation

For bin $i$, with Laplace smoothing:

$$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$

Where:

$p_i$: Number of positive cases in bin $i$
$P$: Total number of positive cases
$n_i$: Number of negative cases in bin $i$
$N$: Total number of negative cases
$\alpha$: Smoothing factor (0.5 in this implementation)
$k$: Number of bins

4. Information Value (IV)

Overall predictive power measure:

$$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$

$$IV_{total} = \sum_{i=1}^{k} IV_i$$

Advantages

Adaptive to Data Structure: Places bin boundaries at natural density transitions
Handles Complex Distributions: Effective for multimodal or skewed features
Information Preservation: Optimizes binning for maximum predictive power
Statistical Stability: Ensures sufficient observations in each bin
Interpretability: Supports monotonic relationships between feature and target

References

Fan, J., & Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall.

Loader, C. (1999). Local Regression and Likelihood. Springer-Verlag.

Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall.

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Examples

if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)

# Basic usage
result <- optimal_binning_numerical_lpdb(target, feature)
print(result)

# Custom parameters
result_custom <- optimal_binning_numerical_lpdb(
  target = target,
  feature = feature,
  min_bins = 2,
  max_bins = 8,
  bin_cutoff = 0.03,
  polynomial_degree = 5,
  enforce_monotonic = TRUE
)

# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
} # }