Optimal Binning for Categorical Variables using Similarity-Based Logistic Partitioning (SBLP) — optimal_binning_categorical

This function implements an advanced optimal binning algorithm for categorical variables using a Similarity-Based Logistic Partitioning (SBLP) approach. It groups categorical predictors into bins that maximize the Information Value (IV) while maintaining monotonicity with respect to target rates. The algorithm is designed to handle various edge cases including rare categories, missing values, and numerical stability issues.

Usage

optimal_binning_categorical_sblp(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  bin_separator = "%;%",
  alpha = 0.5
)

Arguments

target: Integer binary vector (0 or 1) representing the response variable.
feature: Character vector with the categories of the explanatory variable.
min_bins: Minimum number of bins (default: 3).
max_bins: Maximum number of bins (default: 5).
bin_cutoff: Minimum frequency proportion for a category to be considered as a separate bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins before the partitioning process (default: 20).
convergence_threshold: Threshold for algorithm convergence (default: 1e-6).
max_iterations: Maximum number of iterations of the algorithm (default: 1000).
bin_separator: Separator used to concatenate category names within bins (default: ";").
alpha: Laplace smoothing parameter for WoE/IV calculation (default: 0.5).

Value

A list containing:

id: Numeric vector with bin identifiers.
bin: String vector with the names of the bins (concatenated categories).
woe: Numeric vector with the Weight of Evidence (WoE) values for each bin.
iv: Numeric vector with the Information Value (IV) values for each bin.
count: Integer vector with the total count of observations in each bin.
count_pos: Integer vector with the count of positive cases (target=1) in each bin.
count_neg: Integer vector with the count of negative cases (target=0) in each bin.
rate: Numeric vector with the event rate for each bin.
total_iv: Total Information Value (IV) of the binning solution.
converged: Logical value indicating whether the algorithm converged.
iterations: Integer value indicating the number of iterations executed.

Details

The SBLP algorithm operates in several phases:

Preprocessing Phase:
- Computes initial counts and target rates for each category
- Handles missing values by creating a special "MISSING" category
- Identifies and merges rare categories (frequency < bin_cutoff) based on similarity in target rates
- Ensures the number of pre-bins doesn't exceed max_n_prebins by merging similar categories
- Sorts categories by target rate for monotonic processing
Optimal Binning Phase:
- Uses dynamic programming to find the optimal partitioning that maximizes total IV
- The DP recurrence relation is: $DP[i,j] = \max_{s<i}(DP[s,j-1] + IV(\{s+1,...,i\}))$
- Efficiently implements the DP algorithm with O(nk) space complexity
- Ensures at least min_bins and at most max_bins through bin splitting/merging
Refinement Phase:
- Checks for monotonicity in WoE/target rates
- Adjusts binning if needed to ensure monotonicity while preserving min_bins
- Uses a two-attempt approach: first merges adjacent non-monotonic bins, then if still needed, resorts to force-sorting all categories by rate
- Iteratively refines the solution until convergence or max_iterations
Output Generation Phase:
- Applies Laplace smoothing to handle potential zero counts
- Calculates final WoE and IV values with enhanced numerical stability
- Builds output with comprehensive bin statistics

Mathematical Formulation:

With Laplace smoothing, Weight of Evidence (WoE) is calculated as: $$WoE_i = \ln\left(\frac{P(Bin_i|Y=1)}{P(Bin_i|Y=0)}\right) = \ln\left(\frac{(n_{1i} + \alpha)/(n_1 + \alpha k)}{(n_{0i} + \alpha)/(n_0 + \alpha k)}\right)$$

Information Value (IV) for a bin is calculated as: $$IV_i = (P(Bin_i|Y=1) - P(Bin_i|Y=0)) \times WoE_i = \left(\frac{n_{1i} + \alpha}{n_1 + \alpha k} - \frac{n_{0i} + \alpha}{n_0 + \alpha k}\right) \times WoE_i$$

Total Information Value is the sum of IV for all bins: $$IVtotal = \sum_{i=1}^{k} IV_i$$

Where:

$n_{1i}$ is the number of events (target=1) in bin i
$n_{0i}$ is the number of non-events (target=0) in bin i
$n_1$ is the total number of events
$n_0$ is the total number of non-events
$\alpha$ is the Laplace smoothing parameter
$k$ is the number of bins

The algorithm aims to maximize IVtotal while ensuring a monotonic relationship between bins and target rates.

Edge Cases Handled:

Empty or all-NA input data
Single category features
Highly imbalanced target distributions
Features with many unique categories
Categories with zero observations in one target class

References

Beltratti, A., Margarita, S., & Terna, P. (1996). Neural Networks for Economic and Financial Modelling. International Thomson Computer Press.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley.
Thomas, L.C. (2009). Consumer Credit Models: Pricing, Profit and Portfolios. Oxford University Press.
Hand, D.J., & Adams, N.M. (2014). Data Mining for Credit Scoring: State of the Science. Research Handbook on Computational Methods for Credit Scoring, 69-114.
Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization Techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
Lin, K., Mandel, M., Dmitriev, P., & Murray, G. (2019). Dynamic Optimization for Predictive Binning. Proceedings of the 13th ACM Conference on Recommender Systems, 242-250.

Examples

if (FALSE) { # \dontrun{
# Basic usage with default parameters
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:10], 1000, replace = TRUE, 
prob = c(0.3, 0.1, 0.1, 0.05, 0.05, 0.1, 0.1, 0.05, 0.05, 0.1))
result <- optimal_binning_categorical_sblp(target, feature)
print(result)

# Handling missing values
feature_with_na <- feature
feature_with_na[sample(1:1000, 50)] <- NA
result_na <- optimal_binning_categorical_sblp(target, feature_with_na)

# Custom parameters for more refined binning
result2 <- optimal_binning_categorical_sblp(
  target = target,
  feature = feature,
  min_bins = 4,
  max_bins = 8,
  bin_cutoff = 0.03,
  max_n_prebins = 15,
  alpha = 0.1
)

# Handling imbalanced data
imbalanced_target <- sample(0:1, 1000, replace = TRUE, prob = c(0.9, 0.1))
result3 <- optimal_binning_categorical_sblp(imbalanced_target, feature)
} # }