Optimal Binning for Categorical Variables using a User-Defined Technique (UDT) — optimal_binning_categorical

This function performs binning for categorical variables using a user-defined technique (UDT). The algorithm creates bins with optimal predictive power (measured by Information Value) while maintaining monotonicity of Weight of Evidence and avoiding the creation of artificial categories. Enhanced with statistical robustness features like Laplace smoothing and Jensen-Shannon divergence.

Usage

optimal_binning_categorical_udt(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: Integer binary vector (0 or 1) representing the response variable.
feature: Character vector representing the categories of the explanatory variable.
min_bins: Minimum number of desired bins (default: 3).
max_bins: Maximum number of desired bins (default: 5).
bin_cutoff: Minimum proportion of observations to consider a category as a separate bin (default: 0.05).
max_n_prebins: Maximum number of pre-bins before the main binning step (default: 20).
bin_separator: String used to separate names of categories grouped in the same bin (default: "%;%").
convergence_threshold: Threshold for stopping criteria based on IV convergence (default: 1e-6).
max_iterations: Maximum number of iterations in the optimization process (default: 1000).

Value

A list containing:

id: Numeric identifiers for each bin.
bin: String vector with bin names representing grouped categories.
woe: Numeric vector with Weight of Evidence values for each bin.
iv: Numeric vector with Information Value for each bin.
count: Integer vector with the total count of observations in each bin.
count_pos: Integer vector with the count of positive cases (target=1) in each bin.
count_neg: Integer vector with the count of negative cases (target=0) in each bin.
event_rate: Numeric vector with the proportion of positive cases in each bin.
converged: Logical value indicating if the algorithm converged.
iterations: Integer value indicating the number of optimization iterations executed.
total_iv: The total Information Value of the binning solution.

Details

Statistical Methodology

The UDT algorithm optimizes binning based on statistical concepts of Weight of Evidence and Information Value with Laplace smoothing for robustness:

Weight of Evidence measures the predictive power of a bin: $$WoE_i = \ln\left(\frac{(n_{i+} + \alpha)/(n_+ + 2\alpha)}{(n_{i-} + \alpha)/(n_- + 2\alpha)}\right)$$

Where:

$n_{i+}$ is the number of positive cases (target=1) in bin i
$n_{i-}$ is the number of negative cases (target=0) in bin i
$n_+$ is the total number of positive cases
$n_-$ is the total number of negative cases
$\alpha$ is the Laplace smoothing parameter (default: 0.5)

Information Value measures the overall predictive power: $$IV_i = \left(\frac{n_{i+}}{n_+} - \frac{n_{i-}}{n_-}\right) \times WoE_i$$ $$IV_{total} = \sum_{i=1}^{k} |IV_i|$$

Algorithm Steps

Input validation and creation of initial bins (one bin per unique category)
- Special handling for variables with 1-2 unique levels
Merge low-frequency categories below the bin_cutoff threshold
Calculate WoE and IV for each bin using Laplace smoothing
Iteratively merge similar bins based on Jensen-Shannon divergence until constraints are satisfied
Ensure WoE monotonicity across bins for better interpretability
The process continues until convergence or max_iterations is reached

The algorithm uses Jensen-Shannon divergence to measure statistical similarity between bins: $$JS(P||Q) = \frac{1}{2}KL(P||M) + \frac{1}{2}KL(Q||M)$$

Where:

$KL$ is the Kullback-Leibler divergence
$M = \frac{1}{2}(P+Q)$ is the midpoint distribution
$P$ and $Q$ are the event rate distributions of two bins

Important Notes

Missing values in the feature are handled as a special category
The algorithm naturally handles sparse data through Laplace smoothing
No splitting is performed to avoid creating artificial category names
Uniqueness of categories within bins is guaranteed

References

Beltrán, C., et al. (2022). Weight of Evidence (WoE) and Information Value (IV): A novel implementation for predictive modeling in credit scoring. Expert Systems with Applications, 183, 115351.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

Examples

if (FALSE) { # \dontrun{
set.seed(123)
target <- sample(0:1, 1000, replace = TRUE)
feature <- sample(LETTERS[1:5], 1000, replace = TRUE)
result <- optimal_binning_categorical_udt(target, feature)
print(result)
} # }