This function performs optimal binning for categorical variables using the Similarity-Based Logistic Partitioning (SBLP) algorithm. This approach combines logistic properties (sorting categories by event rate) with dynamic programming to find the optimal partition that maximizes Information Value (IV).
Usage
ob_categorical_sblp(
feature,
target,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
convergence_threshold = 1e-06,
max_iterations = 1000L,
bin_separator = "%;%",
alpha = 0.5
)Arguments
- feature
A character vector or factor representing the categorical predictor variable. Missing values (NA) will be converted to the string "NA" and treated as a separate category.
- target
An integer vector containing binary outcome values (0 or 1). Must be the same length as
feature. Cannot contain missing values.- min_bins
Integer. Minimum number of bins to create. Must be at least 2. Default is 3.
- max_bins
Integer. Maximum number of bins to create. Must be greater than or equal to
min_bins. Default is 5.- bin_cutoff
Numeric. Minimum relative frequency threshold for individual categories. Categories with frequency below this proportion will be merged with similar categories before the main optimization. Value must be between 0 and 1. Default is 0.05 (5%).
- max_n_prebins
Integer. Maximum number of initial bins/groups allowed before the dynamic programming optimization. If the number of unique categories exceeds this, similar adjacent categories are pre-merged. Default is 20.
- convergence_threshold
Numeric. Threshold for determining algorithm convergence based on changes in total Information Value. Default is 1e-6.
- max_iterations
Integer. Maximum number of iterations for the optimization process. Default is 1000.
- bin_separator
Character string used to separate category names when multiple categories are merged into a single bin. Default is "%;%".
- alpha
Numeric. Laplace smoothing parameter added to counts to avoid division by zero and stabilize WoE calculations for sparse data. Must be non-negative. Default is 0.5.
Value
A list containing the results of the optimal binning procedure:
idNumeric vector of bin identifiers (1 to n_bins)
binCharacter vector of bin labels, which are combinations of original categories separated by
bin_separatorwoeNumeric vector of Weight of Evidence values for each bin
ivNumeric vector of Information Values for each bin
countInteger vector of total observations in each bin
count_posInteger vector of positive outcomes in each bin
count_negInteger vector of negative outcomes in each bin
rateNumeric vector of the observed event rate in each bin
total_ivNumeric scalar. Total Information Value across all bins
convergedLogical. Whether the algorithm converged
iterationsInteger. Number of iterations performed
Details
The SBLP algorithm follows these steps:
Preprocessing: Handling of missing values and calculation of initial statistics.
Rare Category Consolidation: Categories with frequency below
bin_cutoffare merged with statistically similar categories based on their target rates.Sorting: Unique categories (or merged groups) are sorted by their empirical event rate (probability of target=1).
Dynamic Programming: An optimal partitioning algorithm (similar to Jenks Natural Breaks but optimizing IV) is applied to the sorted sequence to determine the cutpoints that maximize the total IV.
Refinement: Post-processing ensures constraints like monotonicity and minimum bin size are met.
A key feature of this implementation is the use of Laplace Smoothing
(controlled by the alpha parameter) to prevent infinite WoE values
and stabilize estimates for categories with small counts.
Mathematical definitions with smoothing:
The smoothed event rate \(p_i\) for a bin is calculated as: $$p_i = \frac{n_{pos} + \alpha}{n_{total} + 2\alpha}$$
The Weight of Evidence (WoE) is computed using smoothed proportions: $$WoE_i = \ln\left(\frac{p_i^{(1)}}{p_i^{(0)}}\right)$$
where \(p_i^{(1)}\) and \(p_i^{(0)}\) are the smoothed distributions of positive and negative classes across bins.
Note
Target variable must contain both 0 and 1 values.
Unlike heuristic methods, this algorithm uses Dynamic Programming which guarantees an optimal partition given the sorted order of categories.
Monotonicity is generally enforced by the sorting step, but strictly checked and corrected in the final output.
Examples
# Generate sample data
set.seed(123)
n <- 1000
feature <- sample(letters[1:8], n, replace = TRUE)
# Create a relationship where 'a' and 'b' have high probability
target <- rbinom(n, 1, prob = ifelse(feature %in% c("a", "b"), 0.8, 0.2))
# Perform optimal binning
result <- ob_categorical_sblp(feature, target)
print(result[c("bin", "woe", "iv", "count")])
#> $bin
#> [1] "f" "d%;%e" "c%;%g%;%h" "a" "b"
#>
#> $woe
#> [1] -1.1094145 -0.8567210 -0.6674771 1.8184450 2.2121155
#>
#> $iv
#> [1] 0.1173476 0.1453568 0.1523956 0.4263670 0.5956975
#>
#> $count
#> [1] 120 235 389 128 128
#>
# Using a higher smoothing parameter (alpha)
result_smooth <- ob_categorical_sblp(
feature = feature,
target = target,
alpha = 1.0
)
# Handling missing values
feature_with_na <- feature
feature_with_na[sample(length(feature_with_na), 50)] <- NA
result_na <- ob_categorical_sblp(feature_with_na, target)
