
Optimal Binning for Categorical Variables using a User-Defined Technique (UDT)
Source:R/obc_udt.R
ob_categorical_udt.RdThis function performs optimal binning for categorical variables using a User-Defined Technique (UDT) that combines frequency-based grouping with statistical similarity measures to create meaningful bins for predictive modeling.
Usage
ob_categorical_udt(
feature,
target,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L
)Arguments
- feature
A character vector or factor representing the categorical predictor variable. Missing values (NA) will be converted to the string "NA" and treated as a separate category.
- target
An integer vector containing binary outcome values (0 or 1). Must be the same length as
feature. Cannot contain missing values.- min_bins
Integer. Minimum number of bins to create. Must be at least 1. Default is 3.
- max_bins
Integer. Maximum number of bins to create. Must be greater than or equal to
min_bins. Default is 5.- bin_cutoff
Numeric. Minimum relative frequency threshold for individual categories. Categories with frequency below this proportion will be merged into a collective "rare" bin before optimization. Value must be between 0 and 1. Default is 0.05 (5%).
- max_n_prebins
Integer. Upper limit on initial bins after frequency filtering. Controls computational complexity in early stages. Default is 20.
- bin_separator
Character string used to separate category names when multiple categories are combined into a single bin. Default is "%;%".
- convergence_threshold
Numeric. Threshold for determining algorithm convergence based on relative changes in total Information Value. Default is 1e-6.
- max_iterations
Integer. Maximum number of iterations permitted for the optimization routine. Default is 1000.
Value
A list containing the results of the optimal binning procedure:
idNumeric vector of bin identifiers (1 to n_bins)
binCharacter vector of bin labels, which are combinations of original categories separated by
bin_separatorwoeNumeric vector of Weight of Evidence values for each bin
ivNumeric vector of Information Values for each bin
countInteger vector of total observations in each bin
count_posInteger vector of positive outcomes in each bin
count_negInteger vector of negative outcomes in each bin
event_rateNumeric vector of the observed event rate in each bin
total_ivNumeric scalar. Total Information Value across all bins
convergedLogical. Whether the algorithm converged
iterationsInteger. Number of iterations executed
Details
The UDT algorithm follows these steps:
Initialization: Each unique category is initially placed in its own bin.
Frequency Filtering: Categories below the
bin_cutofffrequency threshold are grouped into a single "rare" bin.Iterative Optimization: Bins are progressively merged based on statistical similarity (measured by Jensen-Shannon divergence) until the desired number of bins (
max_bins) is achieved.Monotonicity Enforcement: Final bins are sorted by Weight of Evidence to ensure consistent trends.
Key characteristics of this implementation:
Flexible Framework: Designed as a customizable foundation for categorical binning approaches.
Statistical Rigor: Uses information-theoretic measures to guide bin combination decisions.
Robust Estimation: Implements Laplace smoothing to ensure stable WoE/IV calculations even with sparse data.
Efficiency Focus: Employs targeted merging strategies to minimize computational overhead.
Mathematical foundations:
Laplace-smoothed probability estimates: $$p_{smoothed} = \frac{count + \alpha}{total + 2\alpha}$$
Weight of Evidence calculation: $$WoE = \ln\left(\frac{p_{pos,smoothed}}{p_{neg,smoothed}}\right)$$
Information Value computation: $$IV = (p_{pos,smoothed} - p_{neg,smoothed}) \times WoE$$
Jensen-Shannon divergence between bins: $$JSD(P||Q) = \frac{1}{2}[KL(P||M) + KL(Q||M)]$$ where \(M = \frac{1}{2}(P+Q)\) and \(KL\) denotes Kullback-Leibler divergence.
Note
Target variable must contain both 0 and 1 values.
For datasets with 1 or 2 unique categories, no optimization occurs beyond basic WoE/IV calculation.
The algorithm does not perform bin splitting; it only merges existing bins to respect
max_bins.Rare category pooling improves stability of WoE estimates for infrequent values.
Examples
# Generate sample data with skewed category distribution
set.seed(789)
n <- 3000
# Power-law distributed categories
categories <- c(
rep("X1", 1200), rep("X2", 800), rep("X3", 400),
sample(LETTERS[4:20], 600, replace = TRUE)
)
feature <- sample(categories, n, replace = TRUE)
# Target probabilities based on category importance
probs <- ifelse(grepl("X", feature), 0.7,
ifelse(grepl("[A-C]", feature), 0.5, 0.3)
)
target <- rbinom(n, 1, prob = probs)
# Perform user-defined technique binning
result <- ob_categorical_udt(feature, target)
print(result[c("bin", "woe", "iv", "count")])
#> $bin
#> [1] "K%;%S%;%H%;%F%;%O%;%T%;%I%;%G%;%M%;%P%;%L%;%E%;%Q%;%R"
#> [2] "N%;%J%;%D"
#> [3] "X3"
#> [4] "X2"
#> [5] "X1"
#>
#> $woe
#> [1] -1.3704280 -1.2269348 0.2610244 0.3379009 0.3845215
#>
#> $iv
#> [1] 0.322320167 0.037475936 0.009463915 0.029143017 0.055138785
#>
#> $count
#> [1] 513 73 431 802 1181
#>
# Adjust parameters for finer control
result_custom <- ob_categorical_udt(
feature = feature,
target = target,
min_bins = 2,
max_bins = 7,
bin_cutoff = 0.03
)
# Handling missing values
feature_with_na <- feature
feature_with_na[sample(length(feature_with_na), 150)] <- NA
result_na <- ob_categorical_udt(feature_with_na, target)