Skip to contents

Performs optimal binning for categorical variables using a dynamic programming approach with linear constraints. This algorithm finds the optimal grouping of categories that maximizes the Information Value (IV) while respecting constraints on the number of bins and monotonicity.

Usage

optimal_binning_categorical_dp(
  target,
  feature,
  min_bins = 3L,
  max_bins = 5L,
  bin_cutoff = 0.05,
  max_n_prebins = 20L,
  convergence_threshold = 1e-06,
  max_iterations = 1000L,
  bin_separator = "%;%",
  monotonic_trend = "auto"
)

Arguments

target

An integer vector of binary target values (0 or 1).

feature

A character vector of categorical feature values.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 5).

bin_cutoff

Minimum proportion of total observations for a bin (default: 0.05).

max_n_prebins

Maximum number of pre-bins before merging (default: 20).

convergence_threshold

Convergence threshold for the dynamic programming algorithm (default: 0.0000001).

max_iterations

Maximum number of iterations for the dynamic programming algorithm (default: 1000).

bin_separator

Separator for concatenating category names in bins (default: "%;%).

monotonic_trend

Force monotonic trend ('auto', 'ascending', 'descending', 'none'). Default: 'auto'.

Value

A data frame containing binning information with the following columns:

id

Bin identifier (integer)

bin

Bin name containing the concatenated categories

woe

Weight of Evidence value for the bin

iv

Information Value contribution of the bin

count

Total number of observations in the bin

count_pos

Number of positive events (target=1) in the bin

count_neg

Number of negative events (target=0) in the bin

event_rate

Rate of positive events in the bin (count_pos/count)

total_iv

Total Information Value across all bins

converged

Logical indicating whether the algorithm converged

iterations

Number of iterations performed

execution_time_ms

Execution time in milliseconds

Details

The algorithm uses dynamic programming to find the optimal binning solution that maximizes the total Information Value (IV) while respecting constraints. The process follows these steps:

  1. Preprocess the data to count occurrences and merge rare categories

  2. Sort categories based on their event rates (ratio of positive events to total events)

  3. Use dynamic programming to find the optimal partitioning of categories into bins

  4. Apply monotonicity constraints if specified

  5. Calculate WoE and IV metrics for each final bin

The Weight of Evidence (WoE) for each bin is calculated as:

$$WoE = \ln\left(\frac{\text{Distribution of Events}}{\text{Distribution of Non-Events}}\right)$$

The Information Value (IV) contribution for each bin is:

$$IV = (\text{Distribution of Events} - \text{Distribution of Non-Events}) \times WoE$$

Total IV is the sum of IV values across all bins and measures the overall predictive power.

This implementation is based on the methodology described in:

  • Navas-Palencia, G. (2022). "OptBinning: Mathematical Optimization for Optimal Binning". Journal of Open Source Software, 7(74), 4101.

  • Siddiqi, N. (2017). "Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards". John Wiley & Sons, 2nd Edition.

  • Thomas, L.C., Edelman, D.B., & Crook, J.N. (2017). "Credit Scoring and Its Applications". SIAM, 2nd Edition.

  • Kotsiantis, S.B., & Kanellopoulos, D. (2006). "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.

The dynamic programming algorithm optimally partitions the sorted categories into bins that maximize the total information value. The recurrence relation used is:

$$DP[i][k] = \max_{j<i} \{DP[j][k-1] + IV(\text{bin from } j \text{ to } i)\}$$

where \(DP[i][k]\) represents the maximum total IV achievable using the first i categories partitioned into "k" bins.

Examples

if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
n <- 1000
target <- sample(0:1, n, replace = TRUE)
feature <- sample(c("A", "B", "C", "D", "E"), n, replace = TRUE)

# Perform optimal binning
result <- optimal_binning_categorical_dp(target, feature, min_bins = 2, max_bins = 4)

# View results
print(result)
} # }