Optimal Binning for Categorical Variables using Dynamic Programming
optimal_binning_categorical_dp.Rd
Performs optimal binning for categorical variables using a dynamic programming approach with linear constraints. This algorithm finds the optimal grouping of categories that maximizes the Information Value (IV) while respecting constraints on the number of bins and monotonicity.
Usage
optimal_binning_categorical_dp(
target,
feature,
min_bins = 3L,
max_bins = 5L,
bin_cutoff = 0.05,
max_n_prebins = 20L,
convergence_threshold = 1e-06,
max_iterations = 1000L,
bin_separator = "%;%",
monotonic_trend = "auto"
)
Arguments
- target
An integer vector of binary target values (0 or 1).
- feature
A character vector of categorical feature values.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 5).
- bin_cutoff
Minimum proportion of total observations for a bin (default: 0.05).
- max_n_prebins
Maximum number of pre-bins before merging (default: 20).
- convergence_threshold
Convergence threshold for the dynamic programming algorithm (default: 0.0000001).
- max_iterations
Maximum number of iterations for the dynamic programming algorithm (default: 1000).
- bin_separator
Separator for concatenating category names in bins (default: "%;%).
- monotonic_trend
Force monotonic trend ('auto', 'ascending', 'descending', 'none'). Default: 'auto'.
Value
A data frame containing binning information with the following columns:
- id
Bin identifier (integer)
- bin
Bin name containing the concatenated categories
- woe
Weight of Evidence value for the bin
- iv
Information Value contribution of the bin
- count
Total number of observations in the bin
- count_pos
Number of positive events (target=1) in the bin
- count_neg
Number of negative events (target=0) in the bin
- event_rate
Rate of positive events in the bin (count_pos/count)
- total_iv
Total Information Value across all bins
- converged
Logical indicating whether the algorithm converged
- iterations
Number of iterations performed
- execution_time_ms
Execution time in milliseconds
Details
The algorithm uses dynamic programming to find the optimal binning solution that maximizes the total Information Value (IV) while respecting constraints. The process follows these steps:
Preprocess the data to count occurrences and merge rare categories
Sort categories based on their event rates (ratio of positive events to total events)
Use dynamic programming to find the optimal partitioning of categories into bins
Apply monotonicity constraints if specified
Calculate WoE and IV metrics for each final bin
The Weight of Evidence (WoE) for each bin is calculated as:
$$WoE = \ln\left(\frac{\text{Distribution of Events}}{\text{Distribution of Non-Events}}\right)$$
The Information Value (IV) contribution for each bin is:
$$IV = (\text{Distribution of Events} - \text{Distribution of Non-Events}) \times WoE$$
Total IV is the sum of IV values across all bins and measures the overall predictive power.
This implementation is based on the methodology described in:
Navas-Palencia, G. (2022). "OptBinning: Mathematical Optimization for Optimal Binning". Journal of Open Source Software, 7(74), 4101.
Siddiqi, N. (2017). "Intelligent Credit Scoring: Building and Implementing Better Credit Risk Scorecards". John Wiley & Sons, 2nd Edition.
Thomas, L.C., Edelman, D.B., & Crook, J.N. (2017). "Credit Scoring and Its Applications". SIAM, 2nd Edition.
Kotsiantis, S.B., & Kanellopoulos, D. (2006). "Discretization Techniques: A recent survey". GESTS International Transactions on Computer Science and Engineering, 32(1), 47-58.
The dynamic programming algorithm optimally partitions the sorted categories into bins that maximize the total information value. The recurrence relation used is:
$$DP[i][k] = \max_{j<i} \{DP[j][k-1] + IV(\text{bin from } j \text{ to } i)\}$$
where \(DP[i][k]\) represents the maximum total IV achievable using the first i categories partitioned into "k" bins.
Examples
if (FALSE) { # \dontrun{
# Create sample data
set.seed(123)
n <- 1000
target <- sample(0:1, n, replace = TRUE)
feature <- sample(c("A", "B", "C", "D", "E"), n, replace = TRUE)
# Perform optimal binning
result <- optimal_binning_categorical_dp(target, feature, min_bins = 2, max_bins = 4)
# View results
print(result)
} # }