Optimal Binning for Categorical Variables using Similarity-Based Logistic Partitioning (SBLP)
optimal_binning_categorical_sblp.Rd
This function implements an advanced optimal binning algorithm for categorical variables using a Similarity-Based Logistic Partitioning (SBLP) approach. It groups categorical predictors into bins that maximize the Information Value (IV) while maintaining monotonicity with respect to target rates. The algorithm is designed to handle various edge cases including rare categories, missing values, and numerical stability issues.
Usage
optimal_binning_categorical_sblp(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
convergence_threshold = 1e-06,
max_iterations = 1000L,
bin_separator = "%;%",
alpha = 0.5
)
Arguments
- target
Integer binary vector (0 or 1) representing the response variable.
- feature
Character vector with the categories of the explanatory variable.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum frequency proportion for a category to be considered as a separate bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before the partitioning process (default: 20).
- convergence_threshold
Threshold for algorithm convergence (default: 1e-6).
- max_iterations
Maximum number of iterations of the algorithm (default: 1000).
- bin_separator
Separator used to concatenate category names within bins (default: ";").
- alpha
Laplace smoothing parameter for WoE/IV calculation (default: 0.5).
Value
A list containing:
id: Numeric vector with bin identifiers.
bin: String vector with the names of the bins (concatenated categories).
woe: Numeric vector with the Weight of Evidence (WoE) values for each bin.
iv: Numeric vector with the Information Value (IV) values for each bin.
count: Integer vector with the total count of observations in each bin.
count_pos: Integer vector with the count of positive cases (target=1) in each bin.
count_neg: Integer vector with the count of negative cases (target=0) in each bin.
rate: Numeric vector with the event rate for each bin.
total_iv: Total Information Value (IV) of the binning solution.
converged: Logical value indicating whether the algorithm converged.
iterations: Integer value indicating the number of iterations executed.
Details
The SBLP algorithm operates in several phases:
Preprocessing Phase:
Computes initial counts and target rates for each category
Handles missing values by creating a special "MISSING" category
Identifies and merges rare categories (frequency < bin_cutoff) based on similarity in target rates
Ensures the number of pre-bins doesn't exceed max_n_prebins by merging similar categories
Sorts categories by target rate for monotonic processing
Optimal Binning Phase:
Uses dynamic programming to find the optimal partitioning that maximizes total IV
The DP recurrence relation is: \(DP[i,j] = \max_{s<i}(DP[s,j-1] + IV(\{s+1,...,i\}))\)
Efficiently implements the DP algorithm with O(nk) space complexity
Ensures at least min_bins and at most max_bins through bin splitting/merging
Refinement Phase:
Checks for monotonicity in WoE/target rates
Adjusts binning if needed to ensure monotonicity while preserving min_bins
Uses a two-attempt approach: first merges adjacent non-monotonic bins, then if still needed, resorts to force-sorting all categories by rate
Iteratively refines the solution until convergence or max_iterations
Output Generation Phase:
Applies Laplace smoothing to handle potential zero counts
Calculates final WoE and IV values with enhanced numerical stability
Builds output with comprehensive bin statistics
Mathematical Formulation:
With Laplace smoothing, Weight of Evidence (WoE) is calculated as: $$WoE_i = \ln\left(\frac{P(Bin_i|Y=1)}{P(Bin_i|Y=0)}\right) = \ln\left(\frac{(n_{1i} + \alpha)/(n_1 + \alpha k)}{(n_{0i} + \alpha)/(n_0 + \alpha k)}\right)$$
Information Value (IV) for a bin is calculated as: $$IV_i = (P(Bin_i|Y=1) - P(Bin_i|Y=0)) \times WoE_i = \left(\frac{n_{1i} + \alpha}{n_1 + \alpha k} - \frac{n_{0i} + \alpha}{n_0 + \alpha k}\right) \times WoE_i$$
Total Information Value is the sum of IV for all bins: $$IVtotal = \sum_{i=1}^{k} IV_i$$
Where:
\(n_{1i}\) is the number of events (target=1) in bin i
\(n_{0i}\) is the number of non-events (target=0) in bin i
\(n_1\) is the total number of events
\(n_0\) is the total number of non-events
\(\alpha\) is the Laplace smoothing parameter
\(k\) is the number of bins
The algorithm aims to maximize IVtotal while ensuring a monotonic relationship between bins and target rates.
Edge Cases Handled:
Empty or all-NA input data
Single category features
Highly imbalanced target distributions
Features with many unique categories
Categories with zero observations in one target class
References
Beltratti, A., Margarita, S., & Terna, P. (1996). Neural Networks for Economic and Financial Modelling. International Thomson Computer Press.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley.
Thomas, L.C. (2009). Consumer Credit Models: Pricing, Profit and Portfolios. Oxford University Press.
Hand, D.J., & Adams, N.M. (2014). Data Mining for Credit Scoring: State of the Science. Research Handbook on Computational Methods for Credit Scoring, 69-114.
Kotsiantis, S., & Kanellopoulos, D. (2006). Discretization Techniques: A recent survey. GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
Lin, K., Mandel, M., Dmitriev, P., & Murray, G. (2019). Dynamic Optimization for Predictive Binning. Proceedings of the 13th ACM Conference on Recommender Systems, 242-250.
Examples
if (FALSE) { # \dontrun{
# Basic usage with default parameters
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:10], 1000, replace = TRUE,
prob = c(0.3, 0.1, 0.1, 0.05, 0.05, 0.1, 0.1, 0.05, 0.05, 0.1))
result <- optimal_binning_categorical_sblp(target, feature)
print(result)
# Handling missing values
feature_with_na <- feature
feature_with_na[sample(1:1000, 50)] <- NA
result_na <- optimal_binning_categorical_sblp(target, feature_with_na)
# Custom parameters for more refined binning
result2 <- optimal_binning_categorical_sblp(
target = target,
feature = feature,
min_bins = 4,
max_bins = 8,
bin_cutoff = 0.03,
max_n_prebins = 15,
alpha = 0.1
)
# Handling imbalanced data
imbalanced_target <- sample(0:1, 1000, replace = TRUE, prob = c(0.9, 0.1))
result3 <- optimal_binning_categorical_sblp(imbalanced_target, feature)
} # }