Skip to contents

Implements an advanced binning algorithm for numerical variables that combines local polynomial density estimation with information-theoretic optimization. This method adapts bin boundaries to the natural structure of the data while maximizing predictive power for a binary target variable. LPDB is particularly effective for complex distributions with multiple modes or regions of varying density.

Usage

optimal_binning_numerical_lpdb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  polynomial_degree = 3L,
  enforce_monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target

A binary integer vector (0 or 1) representing the target variable.

feature

A numeric vector representing the feature to be binned.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum frequency fraction for each bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins before optimization (default: 20).

polynomial_degree

Degree of polynomial used for density estimation (default: 3).

enforce_monotonic

Whether to enforce monotonic relationship in WoE (default: TRUE).

convergence_threshold

Convergence threshold for optimization (default: 1e-6).

max_iterations

Maximum iterations allowed (default: 1000).

Value

A list containing:

id

Numeric identifiers for each bin (1-based).

bin

Character vector with bin intervals.

woe

Numeric vector with Weight of Evidence values for each bin.

iv

Numeric vector with Information Value contribution for each bin.

count

Integer vector with the total number of observations in each bin.

count_pos

Integer vector with the positive class count in each bin.

count_neg

Integer vector with the negative class count in each bin.

event_rate

Numeric vector with the event rate (proportion of positives) in each bin.

centroids

Numeric vector with the centroid (mean value) of each bin.

cutpoints

Numeric vector with the bin boundaries (excluding infinities).

converged

Logical indicating whether the algorithm converged.

iterations

Integer count of iterations performed.

total_iv

Numeric total Information Value of the binning solution.

monotonicity

Character indicating monotonicity direction ("increasing", "decreasing", or "none").

Details

Algorithm Overview

The Local Polynomial Density Binning algorithm operates through several coordinated phases:

  1. Density Analysis: Uses polynomial regression techniques to estimate the local density structure of the feature distribution, identifying natural groupings in the data.

  2. Critical Point Detection: Locates important points in the density curve (minima, maxima, and inflection points) as potential bin boundaries.

  3. Initial Binning: Creates preliminary bins based on these critical points, ensuring they respect the natural structure of the data.

  4. Statistical Optimization:

    • Merges bins with frequencies below threshold to ensure statistical reliability

    • Enforces monotonicity in Weight of Evidence (optional)

    • Adjusts bin count to meet minimum and maximum constraints

  5. Information Value Calculation: Computes predictive metrics for the final binning solution

Mathematical Foundation

The algorithm employs several advanced statistical concepts:

1. Local Polynomial Density Estimation

For density estimation at point \(x\):

$$f_h(x) = \frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{x-x_i}{h}\right)$$

Where:

  • \(K\) is a kernel function (Gaussian kernel in this implementation)

  • \(h\) is the bandwidth parameter (calculated using Silverman's rule)

  • \(n\) is the number of observations

2. Critical Point Detection

The algorithm identifies key points in the density curve:

  • Local Minima: Natural boundaries between clusters (density valleys)

  • Inflection Points: Regions where density curvature changes

  • Local Maxima: Centers of high-density regions

3. Weight of Evidence (WoE) Calculation

For bin \(i\), with Laplace smoothing:

$$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$

Where:

  • \(p_i\): Number of positive cases in bin \(i\)

  • \(P\): Total number of positive cases

  • \(n_i\): Number of negative cases in bin \(i\)

  • \(N\): Total number of negative cases

  • \(\alpha\): Smoothing factor (0.5 in this implementation)

  • \(k\): Number of bins

4. Information Value (IV)

Overall predictive power measure:

$$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$

$$IV_{total} = \sum_{i=1}^{k} IV_i$$

Advantages

  • Adaptive to Data Structure: Places bin boundaries at natural density transitions

  • Handles Complex Distributions: Effective for multimodal or skewed features

  • Information Preservation: Optimizes binning for maximum predictive power

  • Statistical Stability: Ensures sufficient observations in each bin

  • Interpretability: Supports monotonic relationships between feature and target

References

Fan, J., & Gijbels, I. (1996). Local Polynomial Modelling and Its Applications. Chapman and Hall.

Loader, C. (1999). Local Regression and Likelihood. Springer-Verlag.

Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models. Chapman and Hall.

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Examples

if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)

# Basic usage
result <- optimal_binning_numerical_lpdb(target, feature)
print(result)

# Custom parameters
result_custom <- optimal_binning_numerical_lpdb(
  target = target,
  feature = feature,
  min_bins = 2,
  max_bins = 8,
  bin_cutoff = 0.03,
  polynomial_degree = 5,
  enforce_monotonic = TRUE
)

# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
} # }