
Optimal Binning for Categorical Variables using Monotonic Binning Algorithm
Source:R/obc_mba.R
ob_categorical_mba.RdPerforms supervised discretization of categorical variables using the Monotonic Binning Algorithm (MBA), which enforces strict Weight of Evidence monotonicity while optimizing Information Value through intelligent bin merging strategies. This implementation includes Bayesian smoothing for numerical stability and adaptive thresholding for robust monotonicity enforcement.
Usage
ob_categorical_mba(
feature,
target,
min_bins = 3,
max_bins = 5,
bin_cutoff = 0.05,
max_n_prebins = 20,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000
)Arguments
- feature
A character vector or factor representing the categorical predictor variable to be binned. Missing values are automatically converted to the category
"NA".- target
An integer vector of binary outcomes (0/1) corresponding to each observation in
feature. Missing values are not permitted.- min_bins
Integer. Minimum number of bins to produce. Must be >= 2. The algorithm will not merge below this threshold. Defaults to 3.
- max_bins
Integer. Maximum number of bins to produce. Must be >=
min_bins. The algorithm reduces bins until this constraint is met. Defaults to 5.- bin_cutoff
Numeric. Minimum proportion of total observations required for a category to remain separate. Categories below this threshold are pre-merged with similar categories. Must be in (0, 1). Defaults to 0.05.
- max_n_prebins
Integer. Maximum number of initial bins before the main optimization phase. Controls computational complexity. Must be >=
max_bins. Defaults to 20.- bin_separator
Character string used to concatenate category names when multiple categories are merged into a single bin. Defaults to "%;%".
- convergence_threshold
Numeric. Convergence tolerance based on Information Value change between iterations. Algorithm stops when \(|\Delta IV| <\)
convergence_threshold. Must be > 0. Defaults to 1e-6.- max_iterations
Integer. Maximum number of optimization iterations. Prevents infinite loops. Must be > 0. Defaults to 1000.
Value
A list containing the binning results with the following components:
idInteger vector of bin identifiers (1-indexed)
binCharacter vector of bin labels (merged category names)
woeNumeric vector of Weight of Evidence values per bin
ivNumeric vector of Information Value contribution per bin
countInteger vector of total observations per bin
count_posInteger vector of positive cases (target=1) per bin
count_negInteger vector of negative cases (target=0) per bin
total_ivNumeric total Information Value of the binning solution
convergedLogical indicating algorithm convergence
iterationsInteger count of optimization iterations performed
Details
The Monotonic Binning Algorithm (MBA) implements a sophisticated approach to categorical binning that guarantees strict Weight of Evidence monotonicity through intelligent violation detection and repair mechanisms.
Algorithm Workflow:
Input validation and preprocessing
Initial bin creation (one category per bin)
Pre-binning limitation to
max_n_prebinsRare category merging (frequencies <
bin_cutoff)Bayesian-smoothed WoE calculation
Strict monotonicity enforcement with adaptive thresholds
IV-optimized bin merging to meet
max_binsconstraintFinal consistency verification
Monotonicity Enforcement:
MBA enforces strict monotonicity through an iterative repair process:
Sort bins by current WoE values
Calculate adaptive threshold: \(\tau = \min(\epsilon, 0.01\bar{\Delta})\)
Identify violations: \(sign(WoE_i - WoE_{i-1}) \neq sign(WoE_{i+1} - WoE_i)\)
Rank violations by severity: \(s_i = |WoE_i - WoE_{i-1}| + |WoE_{i+1} - WoE_i|\)
Repair most severe violations by merging adjacent bins
Repeat until no violations remain or
min_binsreached
Bayesian Smoothing:
To ensure numerical stability and prevent overfitting, MBA applies Bayesian smoothing to WoE and IV calculations:
$$p'_i = \frac{n_{i,pos} + \alpha_p}{N_{pos} + \alpha_{total}}$$ $$n'_i = \frac{n_{i,neg} + \alpha_n}{N_{neg} + \alpha_{total}}$$
where priors are proportional to overall prevalence:
$$\alpha_p = \alpha_{total} \times \frac{N_{pos}}{N_{pos} + N_{neg}}$$ $$\alpha_n = \alpha_{total} - \alpha_p$$
with \(\alpha_{total} = 1.0\) as the prior strength parameter.
Intelligent Bin Merging:
When reducing bins to meet the max_bins constraint, MBA employs an
IV-loss minimization strategy:
$$\Delta IV_{i,j} = IV_i + IV_j - IV_{merged}(i,j)$$
The pair with minimum \(\Delta IV\) is merged to preserve maximum predictive information.
Computational Complexity:
Time: \(O(k^2 \cdot m)\) where \(k\) = bins, \(m\) = iterations
Space: \(O(k^2)\) for IV loss cache
Cache hit rate typically > 75% for \(k > 10\)
Key Features:
Guaranteed monotonicity: Strict enforcement with adaptive thresholds
Bayesian regularization: Robust to sparse bins and class imbalance
Intelligent merging: Preserves maximum information during reduction
Adaptive thresholds: Context-aware violation detection
Consistency verification: Final integrity checks
References
Mironchyk, P., & Tchistiakov, V. (2017). Monotone optimal binning algorithm for credit risk modeling. SSRN Electronic Journal. doi:10.2139/ssrn.2978774
Siddiqi, N. (2017). Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards (2nd ed.). Wiley.
Good, I. J. (1965). The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press.
Zeng, G. (2014). A necessary condition for a good binning algorithm in credit scoring. Applied Mathematical Sciences, 8(65), 3229-3242.
See also
ob_categorical_jedi for joint entropy-driven optimization,
ob_categorical_dp for dynamic programming approach,
ob_categorical_cm for ChiMerge-based binning
Examples
# \donttest{
# Example 1: Basic monotonic binning with guaranteed WoE ordering
set.seed(42)
n_obs <- 1500
# Simulate risk ratings with natural monotonic relationship
ratings <- c("AAA", "AA", "A", "BBB", "BB", "B", "CCC")
default_probs <- c(0.01, 0.02, 0.05, 0.10, 0.20, 0.35, 0.50)
cat_feature <- sample(ratings, n_obs,
replace = TRUE,
prob = c(0.05, 0.10, 0.20, 0.25, 0.20, 0.15, 0.05)
)
bin_target <- sapply(cat_feature, function(x) {
rbinom(1, 1, default_probs[which(ratings == x)])
})
# Apply MBA algorithm
result_mba <- ob_categorical_mba(
cat_feature,
bin_target,
min_bins = 3,
max_bins = 5
)
# Display results with guaranteed monotonic WoE
print(data.frame(
Bin = result_mba$bin,
WoE = round(result_mba$woe, 3),
IV = round(result_mba$iv, 4),
Count = result_mba$count,
EventRate = round(result_mba$count_pos / result_mba$count, 3)
))
#> Bin WoE IV Count EventRate
#> 1 A%;%AA%;%AAA -1.305 0.3529 486 0.053
#> 2 BBB -0.700 0.1015 394 0.094
#> 3 BB 0.211 0.0098 307 0.205
#> 4 B 1.045 0.2117 217 0.373
#> 5 CCC 1.725 0.2845 96 0.542
cat("\nMonotonicity check (WoE differences):\n")
#>
#> Monotonicity check (WoE differences):
woe_diffs <- diff(result_mba$woe)
cat(" Differences:", paste(round(woe_diffs, 4), collapse = ", "), "\n")
#> Differences: 0.6051, 0.9113, 0.8342, 0.6795
cat(" All positive (increasing):", all(woe_diffs >= -1e-10), "\n")
#> All positive (increasing): TRUE
cat(" Total IV:", round(result_mba$total_iv, 4), "\n")
#> Total IV: 0.9604
cat(" Converged:", result_mba$converged, "\n")
#> Converged: TRUE
# Example 2: Comparison with non-monotonic methods
set.seed(123)
n_obs_comp <- 2000
sectors <- c("Tech", "Health", "Finance", "Manufacturing", "Retail")
cat_feature_comp <- sample(sectors, n_obs_comp, replace = TRUE)
bin_target_comp <- rbinom(n_obs_comp, 1, 0.15)
# MBA (strictly monotonic)
result_mba_comp <- ob_categorical_mba(
cat_feature_comp, bin_target_comp,
min_bins = 3, max_bins = 4
)
# Standard binning (may not be monotonic)
result_std_comp <- ob_categorical_cm(
cat_feature_comp, bin_target_comp,
min_bins = 3, max_bins = 4
)
cat("\nMonotonicity comparison:\n")
#>
#> Monotonicity comparison:
cat(
" MBA WoE differences:",
paste(round(diff(result_mba_comp$woe), 4), collapse = ", "), "\n"
)
#> MBA WoE differences: 0.1479, 0.0667, 0.0646
cat(" MBA monotonic:", all(diff(result_mba_comp$woe) >= -1e-10), "\n")
#> MBA monotonic: TRUE
cat(
" Std WoE differences:",
paste(round(diff(result_std_comp$woe), 4), collapse = ", "), "\n"
)
#> Std WoE differences: 0.1431, 0.0695, 0.0645
cat(" Std monotonic:", all(diff(result_std_comp$woe) >= -1e-10), "\n")
#> Std monotonic: TRUE
# Example 3: Bayesian smoothing with sparse data
set.seed(789)
n_obs_sparse <- 400
# Small sample with rare categories
categories <- c("A", "B", "C", "D", "E", "F")
cat_probs <- c(0.30, 0.25, 0.20, 0.15, 0.07, 0.03)
cat_feature_sparse <- sample(categories, n_obs_sparse,
replace = TRUE,
prob = cat_probs
)
bin_target_sparse <- rbinom(n_obs_sparse, 1, 0.08) # 8% event rate
result_mba_sparse <- ob_categorical_mba(
cat_feature_sparse,
bin_target_sparse,
min_bins = 2,
max_bins = 4,
bin_cutoff = 0.02
)
cat("\nBayesian smoothing (sparse data):\n")
#>
#> Bayesian smoothing (sparse data):
cat(" Sample size:", n_obs_sparse, "\n")
#> Sample size: 400
cat(" Events:", sum(bin_target_sparse), "\n")
#> Events: 29
cat(" Final bins:", length(result_mba_sparse$bin), "\n\n")
#> Final bins: 4
#>
# Show how smoothing prevents extreme WoE values
for (i in seq_along(result_mba_sparse$bin)) {
cat(sprintf(
" Bin %d: events=%d/%d, WoE=%.3f (smoothed)\n",
i,
result_mba_sparse$count_pos[i],
result_mba_sparse$count[i],
result_mba_sparse$woe[i]
))
}
#> Bin 1: events=4/81, WoE=-0.421 (smoothed)
#> Bin 2: events=9/122, WoE=0.003 (smoothed)
#> Bin 3: events=13/168, WoE=0.054 (smoothed)
#> Bin 4: events=3/29, WoE=0.368 (smoothed)
# Example 4: High cardinality with pre-binning
set.seed(456)
n_obs_hc <- 3000
# Simulate ZIP codes (high cardinality)
zips <- paste0("ZIP_", sprintf("%04d", 1:50))
cat_feature_hc <- sample(zips, n_obs_hc, replace = TRUE)
bin_target_hc <- rbinom(n_obs_hc, 1, 0.12)
result_mba_hc <- ob_categorical_mba(
cat_feature_hc,
bin_target_hc,
min_bins = 4,
max_bins = 6,
max_n_prebins = 20,
bin_cutoff = 0.01
)
cat("\nHigh cardinality performance:\n")
#>
#> High cardinality performance:
cat(" Original categories:", length(unique(cat_feature_hc)), "\n")
#> Original categories: 50
cat(" Final bins:", length(result_mba_hc$bin), "\n")
#> Final bins: 6
cat(
" Largest merged bin contains:",
max(sapply(strsplit(result_mba_hc$bin, "%;%"), length)), "categories\n"
)
#> Largest merged bin contains: 19 categories
# Verify monotonicity in high-cardinality case
woe_monotonic <- all(diff(result_mba_hc$woe) >= -1e-10)
cat(" WoE monotonic:", woe_monotonic, "\n")
#> WoE monotonic: TRUE
# Example 5: Convergence behavior
set.seed(321)
n_obs_conv <- 1000
business_sizes <- c("Micro", "Small", "Medium", "Large", "Enterprise")
cat_feature_conv <- sample(business_sizes, n_obs_conv, replace = TRUE)
bin_target_conv <- rbinom(n_obs_conv, 1, 0.18)
# Test different convergence thresholds
thresholds <- c(1e-3, 1e-6, 1e-9)
for (thresh in thresholds) {
result_conv <- ob_categorical_mba(
cat_feature_conv,
bin_target_conv,
min_bins = 2,
max_bins = 4,
convergence_threshold = thresh,
max_iterations = 50
)
cat(sprintf("\nThreshold %.0e:\n", thresh))
cat(" Final bins:", length(result_conv$bin), "\n")
cat(" Total IV:", round(result_conv$total_iv, 4), "\n")
cat(" Converged:", result_conv$converged, "\n")
cat(" Iterations:", result_conv$iterations, "\n")
# Check monotonicity preservation
monotonic <- all(diff(result_conv$woe) >= -1e-10)
cat(" Monotonic:", monotonic, "\n")
}
#>
#> Threshold 1e-03:
#> Final bins: 4
#> Total IV: 0.0404
#> Converged: TRUE
#> Iterations: 1
#> Monotonic: TRUE
#>
#> Threshold 1e-06:
#> Final bins: 4
#> Total IV: 0.0404
#> Converged: FALSE
#> Iterations: 1
#> Monotonic: TRUE
#>
#> Threshold 1e-09:
#> Final bins: 4
#> Total IV: 0.0404
#> Converged: FALSE
#> Iterations: 1
#> Monotonic: TRUE
# Example 6: Missing value handling
set.seed(555)
cat_feature_na <- cat_feature
na_indices <- sample(n_obs, 75) # 5% missing
cat_feature_na[na_indices] <- NA
result_mba_na <- ob_categorical_mba(
cat_feature_na,
bin_target,
min_bins = 3,
max_bins = 5
)
# Locate NA bin
na_bin_idx <- grep("NA", result_mba_na$bin)
if (length(na_bin_idx) > 0) {
cat("\nMissing value treatment:\n")
cat(" NA bin:", result_mba_na$bin[na_bin_idx], "\n")
cat(" NA count:", result_mba_na$count[na_bin_idx], "\n")
cat(
" NA event rate:",
round(result_mba_na$count_pos[na_bin_idx] /
result_mba_na$count[na_bin_idx], 3), "\n"
)
cat(" NA WoE:", round(result_mba_na$woe[na_bin_idx], 3), "\n")
cat(
" Monotonicity preserved:",
all(diff(result_mba_na$woe) >= -1e-10), "\n"
)
}
#>
#> Missing value treatment:
#> NA bin: NA%;%BB
#> NA count: 367
#> NA event rate: 0.202
#> NA WoE: 0.189
#> Monotonicity preserved: TRUE
# }