Optimal Binning for Categorical Variables with Multinomial Target using JEDI-MWoE

Implements an optimized categorical binning algorithm that extends the JEDI (Joint Entropy Discretization and Integration) framework to handle multinomial response variables using M-WOE (Multinomial Weight of Evidence). This implementation provides a robust solution for categorical feature discretization in multinomial classification problems while maintaining monotonic relationships and optimizing information value.

Usage

optimal_binning_categorical_jedi_mwoe(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000L
)

Arguments

target: Integer vector of class labels (0 to n_classes-1). Must be consecutive integers starting from 0.
feature: Character vector of categorical values to be binned. Must have the same length as target.
min_bins: Minimum number of bins in the output (default: 3). Will be automatically adjusted if number of unique categories is less than min_bins. Value must be >= 1.
max_bins: Maximum number of bins allowed in the output (default: 5). Must be >= min_bins. Algorithm will merge bins if necessary to meet this constraint.
bin_cutoff: Minimum relative frequency threshold for individual bins (default: 0.05). Categories with frequency below this threshold will be candidates for merging. Value must be between 0 and 1.
max_n_prebins: Maximum number of pre-bins before optimization (default: 20). Controls initial complexity before optimization phase. Must be >= min_bins.
bin_separator: String separator used when combining category names (default: "%;%"). Used to create readable bin labels.
convergence_threshold: Convergence threshold for Information Value change (default: 1e-6). Algorithm stops when IV change is below this value.
max_iterations: Maximum number of optimization iterations (default: 1000). Prevents infinite loops in edge cases.

Value

A list containing:

id: Numeric identifiers for each bin.
bin: Character vector of bin names (concatenated categories).
woe: Numeric matrix (n_bins × n_classes) of M-WOE values for each class.
iv: Numeric matrix (n_bins × n_classes) of IV contributions for each class.
count: Integer vector of total observation counts per bin.
class_counts: Integer matrix (n_bins × n_classes) of counts per class per bin.
class_rates: Numeric matrix (n_bins × n_classes) of class rates per bin.
converged: Logical indicating whether algorithm converged.
iterations: Integer count of optimization iterations performed.
n_classes: Integer indicating number of classes detected.
total_iv: Numeric vector of total IV per class.

Details

The algorithm implements a sophisticated binning strategy based on information theory and extends the traditional binary WOE to handle multiple classes.

Mathematical Framework

M-WOE Calculation (with Laplace smoothing): For each bin i and class k: $$M-WOE_{i,k} = \ln\left(\frac{P(X = x_i|Y = k)}{P(X = x_i|Y \neq k)}\right)$$ $$= \ln\left(\frac{(n_{k,i} + \alpha)/(N_k + 2\alpha)}{(\sum_{j \neq k} n_{j,i} + \alpha)/(\sum_{j \neq k} N_j + 2\alpha)}\right)$$

where:

$n_{k,i}$ is the count of class k in bin i
$N_k$ is the total count of class k
$\alpha$ is the Laplace smoothing parameter (default: 0.5)
The denominator represents the proportion in all other classes combined

Information Value: For each class k: $$IV_k = \sum_{i=1}^{n} \left(P(X = x_i|Y = k) - P(X = x_i|Y \neq k)\right) \times M-WOE_{i,k}$$
Jensen-Shannon Divergence: For measuring statistical similarity between bins: $$JS(P||Q) = \frac{1}{2}KL(P||M) + \frac{1}{2}KL(Q||M)$$

where:

$KL$ is the Kullback-Leibler divergence
$M = \frac{1}{2}(P+Q)$ is the midpoint distribution
$P$ and $Q$ are the class distributions of two bins

Optimization Objective: $$maximize \sum_{k=1}^{K} IV_k$$ subject to:
- Monotonicity constraints for each class
- Minimum bin size constraints
- Number of bins constraints

Algorithm Phases

Initial Binning: Creates individual bins for unique categories
Low Frequency Treatment: Merges rare categories based on bin_cutoff
Monotonicity Optimization: Iteratively merges bins while maintaining monotonicity
Final Adjustment: Ensures constraints on number of bins are met

Merging Strategy

The algorithm alternates between two merging strategies:

Statistical similarity-based merging using Jensen-Shannon divergence
Information value-based merging that minimizes IV loss

Statistical Robustness

Employs Laplace smoothing for stable probability estimates
Uses epsilon protection against numerical instability
Detects and resolves monotonicity violations efficiently

Note

Performance Considerations:

Time complexity: O(n_classes * n_samples * log(n_samples))
Space complexity: O(n_classes * n_bins)
For large datasets, initial binning phase may be memory-intensive

Edge Cases:

Single category: Returns original category as single bin
All samples in one class: Creates degenerate case with warning
Missing values: Treated as a special category "MISSING"

References

Beltrami, M. et al. (2021). JEDI: Joint Entropy Discretization and Integration. arXiv preprint arXiv:2101.03228.
Thomas, L.C. (2009). Consumer Credit Models: Pricing, Profit and Portfolios. Oxford University Press.
Good, I.J. (1950). Probability and the Weighing of Evidence. Charles Griffin & Company.
Kullback, S. (1959). Information Theory and Statistics. John Wiley & Sons.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

Examples

# Basic usage with 3 classes
feature <- c("A", "B", "A", "C", "B", "D", "A")
target <- c(0, 1, 2, 1, 0, 2, 1)
result <- optimal_binning_categorical_jedi_mwoe(target, feature)

# With custom parameters
result <- optimal_binning_categorical_jedi_mwoe(
  target = target,
  feature = feature,
  min_bins = 2,
  max_bins = 4,
  bin_cutoff = 0.1,
  max_n_prebins = 15,
  convergence_threshold = 1e-8
)