Optimal Binning and Weight of Evidence Calculation
obwoe.Rd
This function implements a comprehensive suite of state-of-the-art algorithms for optimal binning and Weight of Evidence (WoE) calculation for both numerical and categorical variables. It maximizes predictive power while preserving interpretability through monotonic constraints, information-theoretic optimization, and statistical validation. Primarily designed for credit risk modeling, classification problems, and predictive analytics applications.
Usage
obwoe(
dt,
target,
features = NULL,
min_bins = 3L,
max_bins = 4L,
method = "jedi",
positive = "bad|1",
preprocess = TRUE,
progress = TRUE,
trace = FALSE,
outputall = TRUE,
control = list()
)
Arguments
- dt
A data.table containing the dataset.
- target
The name of the target variable column (must be binary: 0/1).
- features
Vector of feature names to process. If NULL, all features except the target will be processed.
- min_bins
Minimum number of bins (default: 3).
- max_bins
Maximum number of bins (default: 4).
- method
The binning method to use. Can be "auto" or one of the methods listed in the details section tables. Default is 'jedi'.
- positive
Character string specifying which category should be considered as positive. Must be either "bad|1" or "good|1".
- preprocess
Logical. Whether to preprocess the data before binning (default: TRUE).
- progress
Logical. Whether to display a progress bar. Default is TRUE.
- trace
Logical. Whether to generate error logs when testing existing methods.
- outputall
Logical. If TRUE, returns only the optimal binning gains table. If FALSE, returns a list with data, gains table, and reports (default: TRUE).
- control
A list of additional control parameters:
cat_cutoff: Minimum frequency for a category (default: 0.05)
bin_cutoff: Minimum frequency for a bin (default: 0.05)
min_bads: Minimum proportion of bad cases in a bin (default: 0.05)
pvalue_threshold: P-value threshold for statistical tests (default: 0.05)
max_n_prebins: Maximum number of pre-bins (default: 20)
monotonicity_direction: Direction of monotonicity for some algorithms ("increase" or "decrease")
lambda: Regularization parameter for some algorithms (default: 0.1)
min_bin_size: Minimum bin size as a proportion of total observations (default: 0.05)
min_iv_gain: Minimum IV gain for bin splitting for some algorithms (default: 0.01)
max_depth: Maximum depth for tree-based algorithms (default: 10)
num_miss_value: Value to replace missing numeric values (default: -999.0)
char_miss_value: Value to replace missing categorical values (default: "N/A")
outlier_method: Method for outlier detection ("iqr", "zscore", or "grubbs")
outlier_process: Whether to process outliers (default: FALSE)
iqr_k: IQR multiplier for outlier detection (default: 1.5)
zscore_threshold: Z-score threshold for outlier detection (default: 3)
grubbs_alpha: Significance level for Grubbs' test (default: 0.05)
n_threads: Number of threads for parallel processing (default: 1)
is_monotonic: Whether to enforce monotonicity in binning (default: TRUE)
population_size: Population size for genetic algorithm (default: 50)
max_generations: Maximum number of generations for genetic algorithm (default: 100)
mutation_rate: Mutation rate for genetic algorithm (default: 0.1)
initial_temperature: Initial temperature for simulated annealing (default: 1)
cooling_rate: Cooling rate for simulated annealing (default: 0.995)
max_iterations: Maximum number of iterations for iterative algorithms (default: 1000)
include_upper_bound: Include upper bound for numeric bins (default is TRUE)
bin_separator: Bin separator for optimal bins categorical variables (default = "%;%")
laplace_smoothing: Smoothing parameter for WoE calculation (default: 0.5)
sketch_k: Parameter controlling the accuracy of sketch-based algorithms (default: 200)
sketch_width: Width parameter for sketch-based algorithms (default: 2000)
sketch_depth: Depth parameter for sketch-based algorithms (default: 5)
polynomial_degree: Degree of polynomial for LPDB algorithm (default: 3)
auto_monotonicity: Auto-detect monotonicity direction (default: TRUE)
monotonic_trend: Monotonicity direction for DP algorithm (default: "auto")
use_chi2_algorithm: Whether to use enhanced Chi2 algorithm (default: FALSE)
chi_merge_threshold: Threshold for chi-merge algorithm (default: 0.05)
force_monotonic_direction: Force direction in MBLP (0=auto, 1=increasing, -1=decreasing)
monotonicity_direction: Monotonicity for UDT ("none", "increasing", "decreasing", "auto")
divergence_method: Divergence measure for DMIV ("he", "kl", "tr", "klj", "sc", "js", "l1", "l2", "ln")
bin_method: Method for WoE calculation in DMIV ("woe", "woe1")
adaptive_cooling: Whether to use adaptive cooling in SAB (default: TRUE)
enforce_monotonic: Whether to enforce monotonicity in various algorithms (default: TRUE)
Value
Depending on the value of outputall: If outputall = FALSE: A data.table containing the optimal binning gains table (woebin). If outputall = TRUE: A list containing:
- data
The original dataset with added WoE columns
- woebin
Information about the bins created, including:
feature: Name of the feature
bin: Bin label or range
count: Number of observations in the bin
count_distr: Proportion of observations in the bin
good: Number of good cases (target = 0) in the bin
bad: Number of bad cases (target = 1) in the bin
good_rate: Proportion of good cases in the bin
bad_rate: Proportion of bad cases in the bin
woe: Weight of Evidence for the bin
iv: Information Value contribution of the bin
- report_best_model
Report on the best tested models, including:
feature: Name of the feature
method: Best method selected for the feature
iv_total: Total Information Value achieved
n_bins: Number of bins created
runtime: Execution time for binning the feature
- report_preprocess
Preprocessing report for each feature, including:
feature: Name of the feature
type: Data type of the feature
missing_count: Number of missing values
outlier_count: Number of outliers detected
unique_count: Number of unique values
mean_before: Mean value before preprocessing
mean_after: Mean value after preprocessing
sd_before: Standard deviation before preprocessing
sd_after: Standard deviation after preprocessing
Details
Categorical Variable Algorithms
Algorithm | Abbreviation | Theoretical Foundation | Key Features |
ChiMerge | CM | Statistical Tests | Uses chi-square tests to merge. |
Dynamic Programming with Local Constraints | DPLC | Mathematical Programming | Maximizes IV with global constraints. |
Fisher's Exact Test Binning | FETB | Statistical Tests | Uses Fisher's exact test for statistical merging. |
Greedy Merge Binning | GMB | Iterative Optimization | Iteratively merges bins to optimize. |
Information Value Binning | IVB | Information Theory | Dynamic programming for IV optimization. |
Joint Entropy-Driven Information | JEDI | Information Theory | Adaptive merging with entropy. |
Monotonic Binning Algorithm | MBA | Information Theory | Combines WoE/IV with monotonicity constraints. |
Mixed Integer Linear Programming | MILP | Mathematical Programming | Mathematical optimization for binning. |
Monotonic Optimal Binning | MOB | Iterative Optimization | Specialized for monotonicity. |
Simulated Annealing Binning | SAB | Metaheuristic Optimization | Simulated annealing for global optimization. |
Similarity-Based Logistic Partitioning | SBLP | Distance-Based Methods | Similarity measures for optimal binning. |
Sliding Window Binning | SWB | Iterative Optimization | Sliding window approach for binning. |
User-Defined Technique | UDT | Hybrid Methods | Flexible hybrid approach. |
JEDI Multinomial WoE | JEDI_MWOE | Information Theory | Extension of JEDI for multinomial targets. |
Numerical Variable Algorithms
Algorithm | Abbreviation | Theoretical Foundation | Key Features |
Branch and Bound | BB | Mathematical Programming | Efficient search in solution space. |
ChiMerge | CM | Statistical Methods | Chi-square-based merging. |
Dynamic Programming with Local Constraints | DPLC | Mathematical Programming | Constrained optimization. |
Equal-Width Binning | EWB | Simple Discretization | Equal-width intervals for binning. |
Fisher's Exact Test Binning | FETB | Statistical Tests | Fisher's test for statistical merging. |
Joint Entropy-Driven Interval | JEDI | Information Theory | Entropy optimization with merging. |
K-means Binning | KMB | Clustering | K-means inspired clustering. |
Local Density Binning | LDB | Density Estimation | Adapts to local density structure. |
Local Polynomial Density Binning | LPDB | Density Estimation | Polynomial density estimation. |
Monotonic Binning with Linear Programming | MBLP | Mathematical Programming | Linear programming with monotonicity. |
Minimum Description Length Principle | MDLP | Information Theory | MDL criterion with monotonicity. |
Monotonic Optimal Binning | MOB | Iterative Optimization | Specialized monotonicity. |
Monotonic Risk Binning with LR Pre-binning | MRBLP | Hybrid Methods | Likelihood ratio pre-binning. |
Optimal Supervised Learning Partitioning | OSLP | Supervised Learning | Specialized supervised approach. |
Unsupervised Binning with Standard Deviation | UBSD | Statistical Methods | Standard deviation-based binning. |
Unsupervised Decision Tree | UDT | Decision Trees | Decision tree inspired binning. |
Isotonic Regression | IR | Statistical Methods | Pool Adjacent Violators algorithm (PAVA). |
Fast MDLP with Monotonicity | FAST_MDLPM | Information Theory | Optimized MDL implementation. |
JEDI Multinomial WoE | JEDI_MWOE | Information Theory | Multinomial extension of JEDI. |
Sketch-based Binning | SKETCH | Approximate Computing | KLL sketch for efficient computation. |
Mathematical Framework
Weight of Evidence (WoE)
The Weight of Evidence measures the predictive power of a bin and is defined as:
$$WoE_i = \ln\left(\frac{P(X_i|Y=1)}{P(X_i|Y=0)}\right)$$
Where \(P(X_i|Y=1)\) is the proportion of positive events in bin i relative to all positive events, and \(P(X_i|Y=0)\) is the proportion of negative events in bin i relative to all negative events.
With Bayesian smoothing applied (used in many implementations):
$$WoE_i = \ln\left(\frac{n_{1i} + \alpha\pi}{n_1 + m\alpha} \cdot \frac{n_0 + m\alpha}{n_{0i} + \alpha(1-\pi)}\right)$$
Where:
\(n_{1i}\) is the count of positive cases in bin i
\(n_{0i}\) is the count of negative cases in bin i
\(n_1\) is the total count of positive cases
\(n_0\) is the total count of negative cases
\(\pi\) is the overall positive rate
\(\alpha\) is the smoothing parameter (typically 0.5)
\(m\) is the number of bins
Information Value (IV)
The Information Value quantifies the predictive power of a variable:
$$IV_i = (P(X_i|Y=1) - P(X_i|Y=0)) \times WoE_i$$
The total Information Value is the sum across all bins:
$$IV_{total} = \sum_{i=1}^{n} IV_i$$
IV can be interpreted as follows:
IV < 0.02: Not predictive
0.02 <= IV < 0.1: Weak predictive power
0.1 <= IV < 0.3: Medium predictive power
0.3 <= IV < 0.5: Strong predictive power
IV >= 0.5: Suspicious (possible overfitting)
Method Selection
When method = "auto", the function tests multiple algorithms and selects the one that produces the highest total Information Value while respecting the specified constraints. The selection process considers:
Total Information Value (IV)
Monotonicity of WoE values
Number of bins created
Bin frequency distribution
Statistical stability
References
Beltrami, M., Mach, M., & Dall'Aglio, M. (2021). Monotonic Optimal Binning Algorithm for Credit Risk Modeling. Risks, 9(3), 58.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Thomas, L.C., Edelman, D.B., & Crook, J.N. (2002). Credit Scoring and Its Applications. SIAM.
Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271, 10 pages.
Zeng, Y. (2014). Univariate feature selection and binner. arXiv preprint arXiv:1410.5420.
Mironchyk, P., & Tchistiakov, V. (2017). Monotone Optimal Binning Algorithm for Credit Risk Modeling. Working Paper.
Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. In AAAI'92.
Liu, H. & Setiono, R. (1995). Chi2: Feature Selection and Discretization of Numeric Attributes. In TAI'95.
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1022-1027.
Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140-147.
Fisher, R. A. (1922). On the interpretation of X^2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85, 87-94.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.
Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to Linear Optimization. Athena Scientific.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671-680.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.
Examples
if (FALSE) { # \dontrun{
# Example 1: Using the German Credit Data
library(OptimalBinningWoE)
library(data.table)
library(scorecard)
data(germancredit, package = "scorecard")
dt <- as.data.table(germancredit)
# Process all features with MBLP method
result <- obwoe(dt,
target = "creditability", method = "jedi",
min_bins = 3, max_bins = 5, positive = "bad|1"
)
# View WoE binning information
print(result)
# Process only numeric features with MBLP method and get detailed output
numeric_features <- names(dt)[sapply(dt, is.numeric)]
numeric_features <- setdiff(numeric_features, "creditability")
result_detailed <- obwoe(dt,
target = "creditability", features = numeric_features,
method = c("jedi", "cm"), preprocess = TRUE, outputall = FALSE,
min_bins = 3, max_bins = 5, positive = "bad|1"
)
# View WoE-transformed data
head(result_detailed$data)
# View preprocessing report
print(result_detailed$report_preprocess)
# View best model report
print(result_detailed$report_best_model)
# Process only categoric features with UDT method
categoric_features <- names(dt)[sapply(dt, function(i) !is.numeric(i))]
categoric_features <- setdiff(categoric_features, "creditability")
result_cat <- obwoe(dt,
target = "creditability", features = categoric_features,
method = "udt", preprocess = TRUE,
min_bins = 3, max_bins = 4, positive = "bad|1"
)
# View binning information for categorical features
print(result_cat)
# Example 2: Automatic method selection
result_auto <- obwoe(dt,
target = "creditability",
method = "auto", # Tries multiple methods and selects the best
min_bins = 3, max_bins = 5, positive = "bad|1"
)
# View which methods were selected for each feature
print(result_auto$report_best_model)
# Example 3: Using specialized algorithms
# For numerical features with complex distributions
result_lpdb <- obwoe(dt,
target = "creditability",
features = numeric_features[1:3],
method = "jedi", # Local Polynomial Density Binning
min_bins = 3, max_bins = 5, positive = "bad|1",
control = list(polynomial_degree = 3)
)
# For categorical features with many levels
result_jedi <- obwoe(dt,
target = "creditability",
features = categoric_features[1:3],
method = "jedi", # Joint Entropy-Driven Information
min_bins = 3, max_bins = 5, positive = "bad|1"
)
} # }