Optimal Binning for Numerical Variables using Local Density Binning (LDB) — optimal_binning_numerical

Implements the Local Density Binning (LDB) algorithm for optimal binning of numerical variables. This method adapts bin boundaries based on the local density structure of the data while maximizing the predictive relationship with a binary target variable. LDB is particularly effective for features with non-uniform distributions or multiple modes.

Usage

optimal_binning_numerical_ldb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  enforce_monotonic = TRUE,
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: A binary integer vector (0 or 1) representing the target variable.
feature: A numeric vector representing the feature to be binned.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency fraction for each bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins before optimization (default: 20).
enforce_monotonic: Whether to enforce monotonic WoE across bins (default: TRUE).
convergence_threshold: Convergence threshold for optimization (default: 1e-6).
max_iterations: Maximum iterations allowed (default: 1000).

Value

A list containing:

id: Numeric identifiers for each bin (1-based).
bin: Character vector with bin intervals.
woe: Numeric vector with Weight of Evidence values for each bin.
iv: Numeric vector with Information Value contribution for each bin.
count: Integer vector with the total number of observations in each bin.
count_pos: Integer vector with the positive class count in each bin.
count_neg: Integer vector with the negative class count in each bin.
event_rate: Numeric vector with the event rate (proportion of positives) in each bin.
cutpoints: Numeric vector with the bin boundaries (excluding infinities).
converged: Logical indicating whether the algorithm converged.
iterations: Integer count of iterations performed.
total_iv: Numeric total Information Value of the binning solution.
monotonicity: Character indicating the monotonicity direction ("increasing", "decreasing", or "none").

Details

Algorithm Overview

The Local Density Binning algorithm operates in several phases:

Density Analysis: Analyzes the local density structure of the feature to identify regions of high and low density, placing bin boundaries preferentially at density minima.
Initial Binning: Creates initial bins based on density minima and/or quantiles.
Statistical Optimization:
- Merges bins with frequencies below threshold for stability
- Enforces monotonicity in Weight of Evidence (optional)
- Adjusts to meet constraints on minimum and maximum bin count
Information Value Calculation: Computes predictive metrics for each bin

Mathematical Foundation

The algorithm employs several statistical concepts:

1. Kernel Density Estimation

To identify the local density structure:

$$f_h(x) = \frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{x-x_i}{h}\right)$$

Where:

$K$ is a kernel function (Gaussian kernel in this implementation)
$h$ is the bandwidth parameter (selected using Silverman's rule of thumb)
$n$ is the number of observations

2. Weight of Evidence (WoE)

For assessing the predictive power of each bin:

$$WoE_i = \ln\left(\frac{(p_i + \alpha) / (P + k\alpha)}{(n_i + \alpha) / (N + k\alpha)}\right)$$

Where:

$p_i$: Number of positive cases in bin $i$
$P$: Total number of positive cases
$n_i$: Number of negative cases in bin $i$
$N$: Total number of negative cases
$\alpha$: Smoothing factor (0.5 in this implementation)
$k$: Number of bins

3. Information Value (IV)

For quantifying overall predictive power:

$$IV_i = \left(\frac{p_i}{P} - \frac{n_i}{N}\right) \times WoE_i$$

$$IV_{total} = \sum_{i=1}^{k} IV_i$$

Advantages of Local Density Binning

Respects Data Structure: Places bin boundaries at natural gaps in the distribution
Adapts to Multimodality: Handles features with multiple modes effectively
Maximizes Information: Optimizes binning for predictive power
Statistical Stability: Ensures sufficient observations in each bin
Interpretability: Produces monotonic WoE patterns when requested

References

Bin, Y., Liang, S., Chen, Z., Yang, S., & Zhang, L. (2019). Density-based supervised discretization for continuous feature. Knowledge-Based Systems, 166, 1-17.

Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396.

Silverman, B. W. (1986). Density Estimation for Statistics and Data Analysis. Chapman and Hall/CRC.

Dougherty, J., Kohavi, R., & Sahami, M. (1995). Supervised and unsupervised discretization of continuous features. Proceedings of the Twelfth International Conference on Machine Learning, 194-202.

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Thomas, L. C. (2009). Consumer Credit Models: Pricing, Profit and Portfolios. Oxford University Press.

Examples

if (FALSE) { # \dontrun{
# Generate synthetic data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- rnorm(1000)

# Basic usage
result <- optimal_binning_numerical_ldb(target, feature)
print(result)

# Custom parameters
result_custom <- optimal_binning_numerical_ldb(
  target = target,
  feature = feature,
  min_bins = 2,
  max_bins = 8,
  bin_cutoff = 0.03,
  enforce_monotonic = TRUE
)

# Access specific components
bins <- result$bin
woe_values <- result$woe
total_iv <- result$total_iv
monotonicity <- result$monotonicity
} # }