Optimal Binning for Categorical Variables using JEDI Algorithm

Performs supervised discretization of categorical variables using the Joint Entropy-Driven Information Maximization (JEDI) algorithm. This advanced method combines information-theoretic optimization with intelligent bin merging strategies, employing Bayesian smoothing for numerical stability and adaptive monotonicity enforcement to produce robust, interpretable binning solutions.

Usage

ob_categorical_jedi(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  bin_separator = "%;%",
  convergence_threshold = 1e-06,
  max_iterations = 1000
)

Arguments

feature: A character vector or factor representing the categorical predictor variable to be binned. Missing values are automatically converted to the category "NA".
target: An integer vector of binary outcomes (0/1) corresponding to each observation in feature. Missing values are not permitted.
min_bins: Integer. Minimum number of bins to produce. Must be >= 2. The algorithm will not merge below this threshold. Defaults to 3.
max_bins: Integer. Maximum number of bins to produce. Must be >= min_bins. The algorithm iteratively merges until this constraint is satisfied. Defaults to 5.
bin_cutoff: Numeric. Minimum proportion of total observations required for a category to remain separate during initialization. Categories below this threshold are pre-merged into an "Others" bin. Must be in (0, 1). Defaults to 0.05.
max_n_prebins: Integer. Maximum number of initial bins before the main optimization phase. Controls computational complexity for high-cardinality features. Must be >= min_bins. Defaults to 20.
bin_separator: Character string used to concatenate category names when multiple categories are merged into a single bin. Defaults to "%;%".
convergence_threshold: Numeric. Convergence tolerance based on Information Value change between iterations. Algorithm stops when $|\Delta IV| <$ convergence_threshold. Must be > 0. Defaults to 1e-6.
max_iterations: Integer. Maximum number of optimization iterations. Prevents infinite loops in edge cases. Must be > 0. Defaults to 1000.

Value

A list containing the binning results with the following components:

id: Integer vector of bin identifiers (1-indexed)
bin: Character vector of bin labels (merged category names)
woe: Numeric vector of Weight of Evidence values per bin
iv: Numeric vector of Information Value contribution per bin
count: Integer vector of total observations per bin
count_pos: Integer vector of positive cases (target=1) per bin
count_neg: Integer vector of negative cases (target=0) per bin
total_iv: Numeric total Information Value of the binning solution
converged: Logical indicating algorithm convergence
iterations: Integer count of optimization iterations performed

Details

The JEDI (Joint Entropy-Driven Information Maximization) algorithm represents a sophisticated approach to categorical binning that jointly optimizes Information Value while maintaining monotonic Weight of Evidence constraints through intelligent violation detection and repair strategies.

Algorithm Workflow:

Input validation and preprocessing
Initial bin creation (one category per bin)
Rare category merging (frequencies < bin_cutoff)
WoE-based monotonic sorting
Pre-bin limitation via minimal IV-loss merging
Main optimization loop:
- Monotonicity violation detection (peaks and valleys)
- Violation severity quantification
- Intelligent merge selection (minimize IV loss)
- Convergence monitoring
- Best solution tracking
Final constraint satisfaction (max_bins enforcement)
Bayesian-smoothed metric computation

Joint Entropy-Driven Optimization:

Unlike greedy algorithms that optimize locally, JEDI considers the global impact of each merge on total Information Value:

$$IV_{total} = \sum_{i=1}^{k} (p_i - n_i) \times \ln\left(\frac{p_i}{n_i}\right)$$

For each potential merge of bins $j$ and $j+1$, JEDI evaluates:

$$\Delta IV_{j,j+1} = IV_{current} - IV_{merged}(j,j+1)$$

The pair with minimum $\Delta IV$ (least information loss) is selected.

Violation Detection and Repair:

JEDI identifies two types of monotonicity violations:

Peaks: $WoE_i > WoE_{i-1}$ and $WoE_i > WoE_{i+1}$
Valleys: $WoE_i < WoE_{i-1}$ and $WoE_i < WoE_{i+1}$

For each violation, severity is quantified as:

$$severity_i = \max\{|WoE_i - WoE_{i-1}|, |WoE_i - WoE_{i+1}|\}$$

The algorithm prioritizes fixing the most severe violation first, evaluating both forward merge $(i, i+1)$ and backward merge $(i-1, i)$ to select the option that minimizes information loss.

Bayesian Smoothing:

To ensure numerical stability with sparse bins, JEDI applies Bayesian smoothing:

$$p'_i = \frac{n_{i,pos} + \alpha_p}{N_{pos} + \alpha_{total}}$$ $$n'_i = \frac{n_{i,neg} + \alpha_n}{N_{neg} + \alpha_{total}}$$

where prior pseudocounts are proportional to overall prevalence:

$$\alpha_p = \alpha_{total} \times \frac{N_{pos}}{N_{pos} + N_{neg}}$$ $$\alpha_n = \alpha_{total} - \alpha_p$$

with $\alpha_{total} = 1.0$ as the prior strength parameter.

Adaptive Monotonicity Threshold:

Rather than using a fixed threshold, JEDI computes a context-aware tolerance:

$$\bar{\Delta} = \frac{1}{k-1}\sum_{i=1}^{k-1}|WoE_{i+1} - WoE_i|$$ $$\tau = \min(\epsilon, 0.01\bar{\Delta})$$

This adaptive approach prevents over-merging when natural WoE gaps are small.

Computational Complexity:

Time: $O(k^2 \cdot m)$ where $k$ = bins, $m$ = iterations
Space: $O(k^2)$ for IV cache
Cache hit rate typically > 70% for $k > 10$

Key Innovations:

Joint optimization: Global IV consideration (vs. local greedy)
Smart violation repair: Severity-based prioritization
Bidirectional merge evaluation: Forward vs. backward analysis
Best solution tracking: Retains optimal intermediate states
Adaptive thresholds: Context-aware monotonicity tolerance

Comparison with Related Methods:

Method	Optimization	Monotonicity	Speed
JEDI	Joint/Global	Adaptive	Medium
IVB	DP (Exact)	Enforced	Slow
GMB	Greedy/Local	Enforced	Fast
ChiMerge	Statistical	Optional	Fast

References

Cover, T. M., & Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience. doi:10.1002/047174882X

Kullback, S. (1959). Information Theory and Statistics. Wiley.

Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulation and solution approach. Expert Systems with Applications, 158, 113508. doi:10.1016/j.eswa.2020.113508

Good, I. J. (1965). The Estimation of Probabilities: An Essay on Modern Bayesian Methods. MIT Press.

Zeng, G. (2014). A necessary condition for a good binning algorithm in credit scoring. Applied Mathematical Sciences, 8(65), 3229-3242.

Examples

# \donttest{
# Example 1: Basic JEDI optimization
set.seed(42)
n_obs <- 1500

# Simulate employment types with risk gradient
employment <- c(
  "Permanent", "Contract", "Temporary", "SelfEmployed",
  "Unemployed", "Student", "Retired"
)
risk_rates <- c(0.03, 0.08, 0.15, 0.12, 0.35, 0.25, 0.10)

cat_feature <- sample(employment, n_obs,
  replace = TRUE,
  prob = c(0.35, 0.20, 0.15, 0.12, 0.08, 0.06, 0.04)
)
bin_target <- sapply(cat_feature, function(x) {
  rbinom(1, 1, risk_rates[which(employment == x)])
})

# Apply JEDI algorithm
result_jedi <- ob_categorical_jedi(
  cat_feature,
  bin_target,
  min_bins = 3,
  max_bins = 5
)

# Display results
print(data.frame(
  Bin = result_jedi$bin,
  WoE = round(result_jedi$woe, 3),
  IV = round(result_jedi$iv, 4),
  Count = result_jedi$count,
  EventRate = round(result_jedi$count_pos / result_jedi$count, 3)
))
#>                                 Bin    WoE     IV Count EventRate
#> 1                         Permanent -1.841 0.6226   544     0.020
#> 2 Retired%;%SelfEmployed%;%Contract -0.026 0.0002   540     0.113
#> 3                         Temporary  0.376 0.0218   200     0.160
#> 4                           Student  1.050 0.1183   110     0.273
#> 5                        Unemployed  1.489 0.2595   106     0.368

cat("\nTotal IV (jointly optimized):", round(result_jedi$total_iv, 4), "\n")
#> 
#> Total IV (jointly optimized): 1.0224 
cat("Converged:", result_jedi$converged, "\n")
#> Converged: TRUE 
cat("Iterations:", result_jedi$iterations, "\n")
#> Iterations: 0 

# Example 2: Method comparison (JEDI vs alternatives)
set.seed(123)
n_obs_comp <- 2000

departments <- c(
  "Sales", "IT", "HR", "Finance", "Operations",
  "Marketing", "Legal", "R&D"
)
cat_feature_comp <- sample(departments, n_obs_comp, replace = TRUE)
bin_target_comp <- rbinom(n_obs_comp, 1, 0.12)

# JEDI (joint optimization)
result_jedi_comp <- ob_categorical_jedi(
  cat_feature_comp, bin_target_comp,
  min_bins = 3, max_bins = 4
)

# IVB (exact DP)
result_ivb_comp <- ob_categorical_ivb(
  cat_feature_comp, bin_target_comp,
  min_bins = 3, max_bins = 4
)

# GMB (greedy)
result_gmb_comp <- ob_categorical_gmb(
  cat_feature_comp, bin_target_comp,
  min_bins = 3, max_bins = 4
)

cat("\nMethod comparison (Total IV):\n")
#> 
#> Method comparison (Total IV):
cat(
  "  JEDI:", round(result_jedi_comp$total_iv, 4),
  "- converged:", result_jedi_comp$converged, "\n"
)
#>   JEDI: 0.0427 - converged: TRUE 
cat(
  "  IVB:", round(result_ivb_comp$total_iv, 4),
  "- converged:", result_ivb_comp$converged, "\n"
)
#>   IVB: 0.0427 - converged: TRUE 
cat(
  "  GMB:", round(result_gmb_comp$total_iv, 4),
  "- converged:", result_gmb_comp$converged, "\n"
)
#>   GMB: 0.0436 - converged: TRUE 

# Example 3: Bayesian smoothing with sparse data
set.seed(789)
n_obs_sparse <- 400

# Small sample with rare events
categories <- c("A", "B", "C", "D", "E", "F", "G")
cat_probs <- c(0.25, 0.20, 0.18, 0.15, 0.12, 0.07, 0.03)

cat_feature_sparse <- sample(categories, n_obs_sparse,
  replace = TRUE,
  prob = cat_probs
)
bin_target_sparse <- rbinom(n_obs_sparse, 1, 0.05) # 5% event rate

result_jedi_sparse <- ob_categorical_jedi(
  cat_feature_sparse,
  bin_target_sparse,
  min_bins = 2,
  max_bins = 4,
  bin_cutoff = 0.02
)

cat("\nBayesian smoothing (sparse data):\n")
#> 
#> Bayesian smoothing (sparse data):
cat("  Sample size:", n_obs_sparse, "\n")
#>   Sample size: 400 
cat("  Total events:", sum(bin_target_sparse), "\n")
#>   Total events: 19 
cat("  Event rate:", round(mean(bin_target_sparse), 4), "\n")
#>   Event rate: 0.0475 
cat("  Bins created:", length(result_jedi_sparse$bin), "\n\n")
#>   Bins created: 4 
#> 

# Show how smoothing prevents extreme WoE values
for (i in seq_along(result_jedi_sparse$bin)) {
  cat(sprintf(
    "  Bin %d: events=%d/%d, WoE=%.3f (smoothed)\n",
    i,
    result_jedi_sparse$count_pos[i],
    result_jedi_sparse$count[i],
    result_jedi_sparse$woe[i]
  ))
}
#>   Bin 1: events=1/78, WoE=-1.353 (smoothed)
#>   Bin 2: events=3/111, WoE=-0.606 (smoothed)
#>   Bin 3: events=8/151, WoE=0.090 (smoothed)
#>   Bin 4: events=7/60, WoE=0.944 (smoothed)

# Example 4: Violation detection and repair
set.seed(456)
n_obs_viol <- 1200

# Create feature with non-monotonic risk pattern
risk_categories <- c(
  "VeryLow", "Low", "MediumHigh", "Medium", # Intentional non-monotonic
  "High", "VeryHigh"
)
actual_risks <- c(0.02, 0.05, 0.20, 0.12, 0.25, 0.40) # MediumHigh > Medium

cat_feature_viol <- sample(risk_categories, n_obs_viol, replace = TRUE)
bin_target_viol <- sapply(cat_feature_viol, function(x) {
  rbinom(1, 1, actual_risks[which(risk_categories == x)])
})

result_jedi_viol <- ob_categorical_jedi(
  cat_feature_viol,
  bin_target_viol,
  min_bins = 3,
  max_bins = 5,
  max_iterations = 50
)

cat("\nViolation detection and repair:\n")
#> 
#> Violation detection and repair:
cat("  Original categories:", length(unique(cat_feature_viol)), "\n")
#>   Original categories: 6 
cat("  Final bins:", length(result_jedi_viol$bin), "\n")
#>   Final bins: 5 
cat("  Iterations to convergence:", result_jedi_viol$iterations, "\n")
#>   Iterations to convergence: 0 
cat("  Monotonicity achieved:", result_jedi_viol$converged, "\n\n")
#>   Monotonicity achieved: TRUE 
#> 

# Check final WoE monotonicity
woe_diffs <- diff(result_jedi_viol$woe)
cat(
  "  WoE differences between bins:",
  paste(round(woe_diffs, 3), collapse = ", "), "\n"
)
#>   WoE differences between bins: 0.721, 0.344, 0.978, 0.964 
cat("  All positive (monotonic):", all(woe_diffs >= -1e-6), "\n")
#>   All positive (monotonic): TRUE 

# Example 5: High cardinality performance
set.seed(321)
n_obs_hc <- 3000

# Simulate product categories (high cardinality)
products <- paste0("Product_", sprintf("%03d", 1:50))

cat_feature_hc <- sample(products, n_obs_hc, replace = TRUE)
bin_target_hc <- rbinom(n_obs_hc, 1, 0.08)

# Measure JEDI performance
time_jedi_hc <- system.time({
  result_jedi_hc <- ob_categorical_jedi(
    cat_feature_hc,
    bin_target_hc,
    min_bins = 4,
    max_bins = 7,
    max_n_prebins = 20,
    bin_cutoff = 0.02
  )
})

cat("\nHigh cardinality performance:\n")
#> 
#> High cardinality performance:
cat("  Original categories:", length(unique(cat_feature_hc)), "\n")
#>   Original categories: 50 
cat("  Final bins:", length(result_jedi_hc$bin), "\n")
#>   Final bins: 7 
cat("  Execution time:", round(time_jedi_hc[3], 3), "seconds\n")
#>   Execution time: 0.001 seconds
cat("  Total IV:", round(result_jedi_hc$total_iv, 4), "\n")
#>   Total IV: 0.1435 
cat("  Converged:", result_jedi_hc$converged, "\n")
#>   Converged: TRUE 

# Show merged categories
for (i in seq_along(result_jedi_hc$bin)) {
  n_merged <- length(strsplit(result_jedi_hc$bin[i], "%;%")[[1]])
  if (n_merged > 1) {
    cat(sprintf("  Bin %d: %d categories merged\n", i, n_merged))
  }
}
#>   Bin 1: 4 categories merged
#>   Bin 2: 6 categories merged
#>   Bin 3: 26 categories merged
#>   Bin 4: 5 categories merged
#>   Bin 5: 3 categories merged
#>   Bin 6: 3 categories merged
#>   Bin 7: 3 categories merged

# Example 6: Convergence behavior
set.seed(555)
n_obs_conv <- 1000

education_levels <- c(
  "Elementary", "HighSchool", "Vocational",
  "Bachelor", "Master", "PhD"
)

cat_feature_conv <- sample(education_levels, n_obs_conv,
  replace = TRUE,
  prob = c(0.10, 0.30, 0.20, 0.25, 0.12, 0.03)
)
bin_target_conv <- rbinom(n_obs_conv, 1, 0.15)

# Test different convergence thresholds
thresholds <- c(1e-3, 1e-6, 1e-9)

for (thresh in thresholds) {
  result_conv <- ob_categorical_jedi(
    cat_feature_conv,
    bin_target_conv,
    min_bins = 2,
    max_bins = 4,
    convergence_threshold = thresh,
    max_iterations = 100
  )

  cat(sprintf("\nThreshold %.0e:\n", thresh))
  cat("  Final bins:", length(result_conv$bin), "\n")
  cat("  Total IV:", round(result_conv$total_iv, 4), "\n")
  cat("  Converged:", result_conv$converged, "\n")
  cat("  Iterations:", result_conv$iterations, "\n")
}
#> 
#> Threshold 1e-03:
#>   Final bins: 4 
#>   Total IV: 0.0206 
#>   Converged: TRUE 
#>   Iterations: 0 
#> 
#> Threshold 1e-06:
#>   Final bins: 4 
#>   Total IV: 0.0206 
#>   Converged: TRUE 
#>   Iterations: 0 
#> 
#> Threshold 1e-09:
#>   Final bins: 4 
#>   Total IV: 0.0206 
#>   Converged: TRUE 
#>   Iterations: 0 

# Example 7: Missing value handling
set.seed(999)
cat_feature_na <- cat_feature
na_indices <- sample(n_obs, 75) # 5% missing
cat_feature_na[na_indices] <- NA

result_jedi_na <- ob_categorical_jedi(
  cat_feature_na,
  bin_target,
  min_bins = 3,
  max_bins = 5
)

# Locate NA bin
na_bin_idx <- grep("NA", result_jedi_na$bin)
if (length(na_bin_idx) > 0) {
  cat("\nMissing value treatment:\n")
  cat("  NA bin:", result_jedi_na$bin[na_bin_idx], "\n")
  cat("  NA count:", result_jedi_na$count[na_bin_idx], "\n")
  cat(
    "  NA event rate:",
    round(result_jedi_na$count_pos[na_bin_idx] /
      result_jedi_na$count[na_bin_idx], 3), "\n"
  )
  cat("  NA WoE:", round(result_jedi_na$woe[na_bin_idx], 3), "\n")
  cat("  NA IV contribution:", round(result_jedi_na$iv[na_bin_idx], 4), "\n")
}
#> 
#> Missing value treatment:
#>   NA bin: Retired%;%SelfEmployed%;%NA%;%Contract 
#>   NA count: 587 
#>   NA event rate: 0.112 
#>   NA WoE: -0.031 
#>   NA IV contribution: 4e-04 
# }