Optimal Binning for Categorical Variables using Information Value Dynamic Programming — optimal_binning_categorical

Implements optimal binning for categorical variables using a dynamic programming approach to maximize Information Value (IV). The algorithm finds the globally optimal binning solution within the constraints of minimum and maximum bin counts.

Usage

optimal_binning_categorical_ivb(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: Integer binary vector (0 or 1) representing the response variable.
feature: Character vector or factor containing the categorical values of the explanatory variable.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency for a separate bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins before optimization (default: 20).
bin_separator: Separator for merged category names (default: "%;%").
convergence_threshold: Convergence threshold for IV (default: 1e-6).
max_iterations: Maximum number of iterations in the search for the optimal solution (default: 1000).

Value

A list containing:

id: Numeric vector of bin identifiers.
bin: Character vector with the names of the formed bins.
woe: Numeric vector with the Weight of Evidence (WoE) of each bin.
iv: Numeric vector with the Information Value (IV) of each bin.
count: Total count per bin.
count_pos: Positive class count per bin.
count_neg: Negative class count per bin.
total_iv: Total Information Value of the binning.
converged: Boolean indicating whether the algorithm converged.
iterations: Number of iterations performed.

Details

This implementation uses dynamic programming to find the optimal set of bins that maximizes the total Information Value. The algorithm guarantees a global optimum within the constraints of minimum and maximum bin counts.

The mathematical formulation of the dynamic programming algorithm is:

$$DP[i][k] = \max_{j<i} \{DP[j][k-1] + IV(j+1,i)\}$$

where:

$DP[i][k]$ is the maximum IV achievable using k bins for the first i categories
$IV(j+1,i)$ is the IV of a bin containing categories from index j+1 to i

The Weight of Evidence (WoE) for a bin is defined as:

$$WoE_i = \ln\left(\frac{n^+_i/N^+}{n^-_i/N^-}\right)$$

where:

$n^+_i$ is the number of positive cases in bin i
$n^-_i$ is the number of negative cases in bin i
$N^+$ is the total number of positive cases
$N^-$ is the total number of negative cases

The Information Value (IV) is:

$$IV = \sum_{i=1}^{n} (p_i - q_i) \times WoE_i$$

where:

$p_i = n^+_i/N^+$ is the proportion of positive cases in bin i
$q_i = n^-_i/N^-$ is the proportion of negative cases in bin i

This algorithm employs Bayesian smoothing for improved stability with small sample sizes or rare categories. The smoothing applies pseudo-counts based on the overall class prevalence.

The algorithm includes these main steps:

Preprocess data and calculate category statistics
Merge rare categories based on frequency threshold
Sort categories by event rate for monotonicity
Run dynamic programming to find optimal bin boundaries
Apply post-processing to ensure monotonicity of WoE
Calculate final WoE and IV values for each bin

Advantages over greedy approaches:

Guaranteed global optimum (within bin count constraints)
Better handling of complex patterns in the data
More stable results with small sample sizes

References

Beltrami, M., Mach, M., & Dall'Aglio, M. (2021). Monotonic Optimal Binning Algorithm for Credit Risk Modeling. Risks, 9(3), 58.
Siddiqi, N. (2006). Credit risk scorecards: developing and implementing intelligent credit scoring (Vol. 3). John Wiley & Sons.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.
Lin, X., Wang, G., & Zhang, T. (2022). Efficient monotonic binning for predictive modeling in high-dimensional spaces. Knowledge-Based Systems, 235, 107629.
Fisher, W. D. (1958). On grouping for maximum homogeneity. Journal of the American Statistical Association, 53(284), 789-798.
Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.

Examples

if (FALSE) { # \dontrun{
# Example data
target <- c(1,0,1,1,0,1,0,0,1,1)
feature <- c("A","B","A","C","B","D","C","A","D","B")

# Run optimal binning
result <- optimal_binning_categorical_ivb(target, feature, min_bins = 2, max_bins = 4)

# View results
print(result)
} # }