Optimal Binning for Categorical Variables using a User-Defined Technique (UDT)
optimal_binning_categorical_udt.Rd
This function performs binning for categorical variables using a user-defined technique (UDT). The algorithm creates bins with optimal predictive power (measured by Information Value) while maintaining monotonicity of Weight of Evidence and avoiding the creation of artificial categories. Enhanced with statistical robustness features like Laplace smoothing and Jensen-Shannon divergence.
Usage
optimal_binning_categorical_udt(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
bin_separator = "%;%",
convergence_threshold = 1e-06,
max_iterations = 1000L
)
Arguments
- target
Integer binary vector (0 or 1) representing the response variable.
- feature
Character vector representing the categories of the explanatory variable.
- min_bins
Minimum number of desired bins (default: 3).
- max_bins
Maximum number of desired bins (default: 5).
- bin_cutoff
Minimum proportion of observations to consider a category as a separate bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before the main binning step (default: 20).
- bin_separator
String used to separate names of categories grouped in the same bin (default: "%;%").
- convergence_threshold
Threshold for stopping criteria based on IV convergence (default: 1e-6).
- max_iterations
Maximum number of iterations in the optimization process (default: 1000).
Value
A list containing:
id: Numeric identifiers for each bin.
bin: String vector with bin names representing grouped categories.
woe: Numeric vector with Weight of Evidence values for each bin.
iv: Numeric vector with Information Value for each bin.
count: Integer vector with the total count of observations in each bin.
count_pos: Integer vector with the count of positive cases (target=1) in each bin.
count_neg: Integer vector with the count of negative cases (target=0) in each bin.
event_rate: Numeric vector with the proportion of positive cases in each bin.
converged: Logical value indicating if the algorithm converged.
iterations: Integer value indicating the number of optimization iterations executed.
total_iv: The total Information Value of the binning solution.
Details
Statistical Methodology
The UDT algorithm optimizes binning based on statistical concepts of Weight of Evidence and Information Value with Laplace smoothing for robustness:
Weight of Evidence measures the predictive power of a bin: $$WoE_i = \ln\left(\frac{(n_{i+} + \alpha)/(n_+ + 2\alpha)}{(n_{i-} + \alpha)/(n_- + 2\alpha)}\right)$$
Where:
\(n_{i+}\) is the number of positive cases (target=1) in bin i
\(n_{i-}\) is the number of negative cases (target=0) in bin i
\(n_+\) is the total number of positive cases
\(n_-\) is the total number of negative cases
\(\alpha\) is the Laplace smoothing parameter (default: 0.5)
Information Value measures the overall predictive power: $$IV_i = \left(\frac{n_{i+}}{n_+} - \frac{n_{i-}}{n_-}\right) \times WoE_i$$ $$IV_{total} = \sum_{i=1}^{k} |IV_i|$$
Algorithm Steps
Input validation and creation of initial bins (one bin per unique category)
Special handling for variables with 1-2 unique levels
Merge low-frequency categories below the bin_cutoff threshold
Calculate WoE and IV for each bin using Laplace smoothing
Iteratively merge similar bins based on Jensen-Shannon divergence until constraints are satisfied
Ensure WoE monotonicity across bins for better interpretability
The process continues until convergence or max_iterations is reached
The algorithm uses Jensen-Shannon divergence to measure statistical similarity between bins: $$JS(P||Q) = \frac{1}{2}KL(P||M) + \frac{1}{2}KL(Q||M)$$
Where:
\(KL\) is the Kullback-Leibler divergence
\(M = \frac{1}{2}(P+Q)\) is the midpoint distribution
\(P\) and \(Q\) are the event rate distributions of two bins
References
Beltrán, C., et al. (2022). Weight of Evidence (WoE) and Information Value (IV): A novel implementation for predictive modeling in credit scoring. Expert Systems with Applications, 183, 115351.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.