Optimal Binning for Categorical Variables using Monotonic Optimal Binning (MOB)
optimal_binning_categorical_mob.Rd
Performs optimal binning for categorical variables using the Monotonic Optimal Binning (MOB) approach with enhanced statistical robustness. This implementation includes Bayesian smoothing for better stability with small samples, adaptive monotonicity enforcement, and sophisticated bin merging strategies.
Usage
optimal_binning_categorical_mob(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A character vector of categorical feature values.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum proportion of observations in a bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins (default: 20).
- bin_separator
Separator used for merging category names (default: "%;%").
- convergence_threshold
Convergence threshold for the algorithm (default: 1e-6).
- max_iterations
Maximum number of iterations for the algorithm (default: 1000).
Value
A list containing the following elements:
id: Numeric vector of bin identifiers.
bin: Character vector of bin names (merged categories).
woe: Numeric vector of Weight of Evidence (WoE) values for each bin.
iv: Numeric vector of Information Value (IV) for each bin.
count: Integer vector of total counts for each bin.
count_pos: Integer vector of positive target counts for each bin.
count_neg: Integer vector of negative target counts for each bin.
total_iv: Total Information Value of the binning.
converged: Logical value indicating whether the algorithm converged.
iterations: Integer value indicating the number of iterations run.
Details
This enhanced version of the Monotonic Optimal Binning (MOB) algorithm implements several key improvements over traditional approaches:
Mathematical Framework:
The Weight of Evidence (WoE) with Bayesian smoothing is calculated as:
$$WoE_i = \ln\left(\frac{p_i^*}{q_i^*}\right)$$
where:
\(p_i^* = \frac{n_i^+ + \alpha \cdot \pi}{N^+ + \alpha}\) is the smoothed proportion of events in bin i
\(q_i^* = \frac{n_i^- + \alpha \cdot (1-\pi)}{N^- + \alpha}\) is the smoothed proportion of non-events in bin i
\(\pi = \frac{N^+}{N^+ + N^-}\) is the overall event rate
\(\alpha\) is the prior strength parameter (default: 0.5)
\(n_i^+\) is the count of events in bin i
\(n_i^-\) is the count of non-events in bin i
\(N^+\) is the total number of events
\(N^-\) is the total number of non-events
The Information Value (IV) for each bin is calculated as:
$$IV_i = (p_i^* - q_i^*) \times WoE_i$$
Algorithm Phases:
Initialization: Calculate statistics for each category with Bayesian smoothing.
Pre-binning: Create initial bins sorted by WoE.
Rare Category Handling: Merge categories with frequency below bin_cutoff using a similarity-based approach.
Monotonicity Enforcement: Ensure monotonic WoE across bins using adaptive thresholds and severity-based prioritization.
Bin Optimization: Reduce number of bins to max_bins while maintaining monotonicity.
Solution Tracking: Maintain the best solution found during optimization.
Key Features:
Bayesian smoothing for robust WoE estimation with small samples
Similarity-based bin merging rather than just adjacent bins
Adaptive monotonicity enforcement with violation severity prioritization
Best solution tracking to ensure optimal results
Efficient uniqueness handling for categories
Comprehensive edge case handling
Strict enforcement of max_bins parameter
References
Belotti, T., Crook, J. (2009). Credit Scoring with Macroeconomic Variables Using Survival Analysis. Journal of the Operational Research Society, 60(12), 1699-1707.
Mironchyk, P., Tchistiakov, V. (2017). Monotone optimal binning algorithm for credit risk modeling. arXiv preprint arXiv:1711.05095.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.
Thomas, L.C., Edelman, D.B., & Crook, J.N. (2002). Credit Scoring and its Applications. SIAM.
Examples
if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:5], 1000, replace = TRUE)
# Run optimal binning
result <- optimal_binning_categorical_mob(target, feature)
# View results
print(result)
# Force exactly 2 bins
result2 <- optimal_binning_categorical_mob(
target, feature,
min_bins = 2,
max_bins = 2
)
} # }