Optimal Binning and Weight of Evidence Calculation

This function implements a comprehensive suite of state-of-the-art algorithms for optimal binning and Weight of Evidence (WoE) calculation for both numerical and categorical variables. It maximizes predictive power while preserving interpretability through monotonic constraints, information-theoretic optimization, and statistical validation. Primarily designed for credit risk modeling, classification problems, and predictive analytics applications.

Usage

obwoe(
  dt,
  target,
  features = NULL,
  min_bins = 3L,
  max_bins = 4L,
  method = "jedi",
  positive = "bad|1",
  preprocess = TRUE,
  progress = TRUE,
  trace = FALSE,
  outputall = TRUE,
  control = list()
)

Arguments

dt

A data.table containing the dataset.

target

The name of the target variable column (must be binary: 0/1).

features

Vector of feature names to process. If NULL, all features except the target will be processed.

min_bins

Minimum number of bins (default: 3).

max_bins

Maximum number of bins (default: 4).

method

The binning method to use. Can be "auto" or one of the methods listed in the details section tables. Default is 'jedi'.

positive

Character string specifying which category should be considered as positive. Must be either "bad|1" or "good|1".

preprocess

Logical. Whether to preprocess the data before binning (default: TRUE).

progress

Logical. Whether to display a progress bar. Default is TRUE.

trace

Logical. Whether to generate error logs when testing existing methods.

outputall

Logical. If TRUE, returns only the optimal binning gains table. If FALSE, returns a list with data, gains table, and reports (default: TRUE).

control

A list of additional control parameters:

cat_cutoff: Minimum frequency for a category (default: 0.05)
bin_cutoff: Minimum frequency for a bin (default: 0.05)
min_bads: Minimum proportion of bad cases in a bin (default: 0.05)
pvalue_threshold: P-value threshold for statistical tests (default: 0.05)
max_n_prebins: Maximum number of pre-bins (default: 20)
monotonicity_direction: Direction of monotonicity for some algorithms ("increase" or "decrease")
lambda: Regularization parameter for some algorithms (default: 0.1)
min_bin_size: Minimum bin size as a proportion of total observations (default: 0.05)
min_iv_gain: Minimum IV gain for bin splitting for some algorithms (default: 0.01)
max_depth: Maximum depth for tree-based algorithms (default: 10)
num_miss_value: Value to replace missing numeric values (default: -999.0)
char_miss_value: Value to replace missing categorical values (default: "N/A")
outlier_method: Method for outlier detection ("iqr", "zscore", or "grubbs")
outlier_process: Whether to process outliers (default: FALSE)
iqr_k: IQR multiplier for outlier detection (default: 1.5)
zscore_threshold: Z-score threshold for outlier detection (default: 3)
grubbs_alpha: Significance level for Grubbs' test (default: 0.05)
n_threads: Number of threads for parallel processing (default: 1)
is_monotonic: Whether to enforce monotonicity in binning (default: TRUE)
population_size: Population size for genetic algorithm (default: 50)
max_generations: Maximum number of generations for genetic algorithm (default: 100)
mutation_rate: Mutation rate for genetic algorithm (default: 0.1)
initial_temperature: Initial temperature for simulated annealing (default: 1)
cooling_rate: Cooling rate for simulated annealing (default: 0.995)
max_iterations: Maximum number of iterations for iterative algorithms (default: 1000)
include_upper_bound: Include upper bound for numeric bins (default is TRUE)
bin_separator: Bin separator for optimal bins categorical variables (default = "%;%")
laplace_smoothing: Smoothing parameter for WoE calculation (default: 0.5)
sketch_k: Parameter controlling the accuracy of sketch-based algorithms (default: 200)
sketch_width: Width parameter for sketch-based algorithms (default: 2000)
sketch_depth: Depth parameter for sketch-based algorithms (default: 5)
polynomial_degree: Degree of polynomial for LPDB algorithm (default: 3)
auto_monotonicity: Auto-detect monotonicity direction (default: TRUE)
monotonic_trend: Monotonicity direction for DP algorithm (default: "auto")
use_chi2_algorithm: Whether to use enhanced Chi2 algorithm (default: FALSE)
chi_merge_threshold: Threshold for chi-merge algorithm (default: 0.05)
force_monotonic_direction: Force direction in MBLP (0=auto, 1=increasing, -1=decreasing)
monotonicity_direction: Monotonicity for UDT ("none", "increasing", "decreasing", "auto")
divergence_method: Divergence measure for DMIV ("he", "kl", "tr", "klj", "sc", "js", "l1", "l2", "ln")
bin_method: Method for WoE calculation in DMIV ("woe", "woe1")
adaptive_cooling: Whether to use adaptive cooling in SAB (default: TRUE)
enforce_monotonic: Whether to enforce monotonicity in various algorithms (default: TRUE)

Value

Depending on the value of outputall: If outputall = FALSE: A data.table containing the optimal binning gains table (woebin). If outputall = TRUE: A list containing:

data

The original dataset with added WoE columns

woebin

Information about the bins created, including:

feature: Name of the feature
bin: Bin label or range
count: Number of observations in the bin
count_distr: Proportion of observations in the bin
good: Number of good cases (target = 0) in the bin
bad: Number of bad cases (target = 1) in the bin
good_rate: Proportion of good cases in the bin
bad_rate: Proportion of bad cases in the bin
woe: Weight of Evidence for the bin
iv: Information Value contribution of the bin

report_best_model

Report on the best tested models, including:

feature: Name of the feature
method: Best method selected for the feature
iv_total: Total Information Value achieved
n_bins: Number of bins created
runtime: Execution time for binning the feature

report_preprocess

Preprocessing report for each feature, including:

feature: Name of the feature
type: Data type of the feature
missing_count: Number of missing values
outlier_count: Number of outliers detected
unique_count: Number of unique values
mean_before: Mean value before preprocessing
mean_after: Mean value after preprocessing
sd_before: Standard deviation before preprocessing
sd_after: Standard deviation after preprocessing

Details

Categorical Variable Algorithms

Algorithm	Abbreviation	Theoretical Foundation	Key Features
ChiMerge	CM	Statistical Tests	Uses chi-square tests to merge.
Dynamic Programming with Local Constraints	DPLC	Mathematical Programming	Maximizes IV with global constraints.
Fisher's Exact Test Binning	FETB	Statistical Tests	Uses Fisher's exact test for statistical merging.
Greedy Merge Binning	GMB	Iterative Optimization	Iteratively merges bins to optimize.
Information Value Binning	IVB	Information Theory	Dynamic programming for IV optimization.
Joint Entropy-Driven Information	JEDI	Information Theory	Adaptive merging with entropy.
Monotonic Binning Algorithm	MBA	Information Theory	Combines WoE/IV with monotonicity constraints.
Mixed Integer Linear Programming	MILP	Mathematical Programming	Mathematical optimization for binning.
Monotonic Optimal Binning	MOB	Iterative Optimization	Specialized for monotonicity.
Simulated Annealing Binning	SAB	Metaheuristic Optimization	Simulated annealing for global optimization.
Similarity-Based Logistic Partitioning	SBLP	Distance-Based Methods	Similarity measures for optimal binning.
Sliding Window Binning	SWB	Iterative Optimization	Sliding window approach for binning.
User-Defined Technique	UDT	Hybrid Methods	Flexible hybrid approach.
JEDI Multinomial WoE	JEDI_MWOE	Information Theory	Extension of JEDI for multinomial targets.

Numerical Variable Algorithms

Algorithm	Abbreviation	Theoretical Foundation	Key Features
Branch and Bound	BB	Mathematical Programming	Efficient search in solution space.
ChiMerge	CM	Statistical Methods	Chi-square-based merging.
Dynamic Programming with Local Constraints	DPLC	Mathematical Programming	Constrained optimization.
Equal-Width Binning	EWB	Simple Discretization	Equal-width intervals for binning.
Fisher's Exact Test Binning	FETB	Statistical Tests	Fisher's test for statistical merging.
Joint Entropy-Driven Interval	JEDI	Information Theory	Entropy optimization with merging.
K-means Binning	KMB	Clustering	K-means inspired clustering.
Local Density Binning	LDB	Density Estimation	Adapts to local density structure.
Local Polynomial Density Binning	LPDB	Density Estimation	Polynomial density estimation.
Monotonic Binning with Linear Programming	MBLP	Mathematical Programming	Linear programming with monotonicity.
Minimum Description Length Principle	MDLP	Information Theory	MDL criterion with monotonicity.
Monotonic Optimal Binning	MOB	Iterative Optimization	Specialized monotonicity.
Monotonic Risk Binning with LR Pre-binning	MRBLP	Hybrid Methods	Likelihood ratio pre-binning.
Optimal Supervised Learning Partitioning	OSLP	Supervised Learning	Specialized supervised approach.
Unsupervised Binning with Standard Deviation	UBSD	Statistical Methods	Standard deviation-based binning.
Unsupervised Decision Tree	UDT	Decision Trees	Decision tree inspired binning.
Isotonic Regression	IR	Statistical Methods	Pool Adjacent Violators algorithm (PAVA).
Fast MDLP with Monotonicity	FAST_MDLPM	Information Theory	Optimized MDL implementation.
JEDI Multinomial WoE	JEDI_MWOE	Information Theory	Multinomial extension of JEDI.
Sketch-based Binning	SKETCH	Approximate Computing	KLL sketch for efficient computation.

Mathematical Framework

Weight of Evidence (WoE)

The Weight of Evidence measures the predictive power of a bin and is defined as:

$$WoE_i = \ln\left(\frac{P(X_i|Y=1)}{P(X_i|Y=0)}\right)$$

Where $P(X_i|Y=1)$ is the proportion of positive events in bin i relative to all positive events, and $P(X_i|Y=0)$ is the proportion of negative events in bin i relative to all negative events.

With Bayesian smoothing applied (used in many implementations):

$$WoE_i = \ln\left(\frac{n_{1i} + \alpha\pi}{n_1 + m\alpha} \cdot \frac{n_0 + m\alpha}{n_{0i} + \alpha(1-\pi)}\right)$$

Where:

$n_{1i}$ is the count of positive cases in bin i
$n_{0i}$ is the count of negative cases in bin i
$n_1$ is the total count of positive cases
$n_0$ is the total count of negative cases
$\pi$ is the overall positive rate
$\alpha$ is the smoothing parameter (typically 0.5)
$m$ is the number of bins

Information Value (IV)

The Information Value quantifies the predictive power of a variable:

$$IV_i = (P(X_i|Y=1) - P(X_i|Y=0)) \times WoE_i$$

The total Information Value is the sum across all bins:

$$IV_{total} = \sum_{i=1}^{n} IV_i$$

IV can be interpreted as follows:

IV < 0.02: Not predictive
0.02 <= IV < 0.1: Weak predictive power
0.1 <= IV < 0.3: Medium predictive power
0.3 <= IV < 0.5: Strong predictive power
IV >= 0.5: Suspicious (possible overfitting)

Monotonicity Constraint

Many algorithms enforce monotonicity of WoE values across bins, which means:

$$WoE_1 \leq WoE_2 \leq \ldots \leq WoE_n$$ (increasing)

$$WoE_1 \geq WoE_2 \geq \ldots \geq WoE_n$$ (decreasing)

Method Selection

When method = "auto", the function tests multiple algorithms and selects the one that produces the highest total Information Value while respecting the specified constraints. The selection process considers:

Total Information Value (IV)
Monotonicity of WoE values
Number of bins created
Bin frequency distribution
Statistical stability

References

Beltrami, M., Mach, M., & Dall'Aglio, M. (2021). Monotonic Optimal Binning Algorithm for Credit Risk Modeling. Risks, 9(3), 58.
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Thomas, L.C., Edelman, D.B., & Crook, J.N. (2002). Credit Scoring and Its Applications. SIAM.
Zeng, G. (2013). Metric Divergence Measures and Information Value in Credit Scoring. Journal of Mathematics, 2013, Article ID 848271, 10 pages.
Zeng, Y. (2014). Univariate feature selection and binner. arXiv preprint arXiv:1410.5420.
Mironchyk, P., & Tchistiakov, V. (2017). Monotone Optimal Binning Algorithm for Credit Risk Modeling. Working Paper.
Kerber, R. (1992). ChiMerge: Discretization of Numeric Attributes. In AAAI'92.
Liu, H. & Setiono, R. (1995). Chi2: Feature Selection and Discretization of Numeric Attributes. In TAI'95.
Fayyad, U., & Irani, K. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. Proceedings of the 13th International Joint Conference on Artificial Intelligence, 1022-1027.
Barlow, R. E., & Brunk, H. D. (1972). The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337), 140-147.
Fisher, R. A. (1922). On the interpretation of X^2 from contingency tables, and the calculation of P. Journal of the Royal Statistical Society, 85, 87-94.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.
Bertsimas, D., & Tsitsiklis, J. N. (1997). Introduction to Linear Optimization. Athena Scientific.
Gelman, A., Jakulin, A., Pittau, M. G., & Su, Y. S. (2008). A weakly informative default prior distribution for logistic and other regression models. The annals of applied statistics, 2(4), 1360-1383.
Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by simulated annealing. Science, 220(4598), 671-680.
Navas-Palencia, G. (2020). Optimal binning: mathematical programming formulations for binary classification. arXiv preprint arXiv:2001.08025.

Examples

if (FALSE) { # \dontrun{
# Example 1: Using the German Credit Data
library(OptimalBinningWoE)
library(data.table)
library(scorecard)
data(germancredit, package = "scorecard")
dt <- as.data.table(germancredit)

# Process all features with MBLP method
result <- obwoe(dt,
  target = "creditability", method = "jedi",
  min_bins = 3, max_bins = 5, positive = "bad|1"
)

# View WoE binning information
print(result)

# Process only numeric features with MBLP method and get detailed output
numeric_features <- names(dt)[sapply(dt, is.numeric)]
numeric_features <- setdiff(numeric_features, "creditability")

result_detailed <- obwoe(dt,
  target = "creditability", features = numeric_features,
  method = c("jedi", "cm"), preprocess = TRUE, outputall = FALSE,
  min_bins = 3, max_bins = 5, positive = "bad|1"
)

# View WoE-transformed data
head(result_detailed$data)

# View preprocessing report
print(result_detailed$report_preprocess)

# View best model report
print(result_detailed$report_best_model)

# Process only categoric features with UDT method
categoric_features <- names(dt)[sapply(dt, function(i) !is.numeric(i))]
categoric_features <- setdiff(categoric_features, "creditability")
result_cat <- obwoe(dt,
  target = "creditability", features = categoric_features,
  method = "udt", preprocess = TRUE,
  min_bins = 3, max_bins = 4, positive = "bad|1"
)

# View binning information for categorical features
print(result_cat)

# Example 2: Automatic method selection
result_auto <- obwoe(dt,
  target = "creditability",
  method = "auto", # Tries multiple methods and selects the best
  min_bins = 3, max_bins = 5, positive = "bad|1"
)

# View which methods were selected for each feature
print(result_auto$report_best_model)

# Example 3: Using specialized algorithms
# For numerical features with complex distributions
result_lpdb <- obwoe(dt,
  target = "creditability",
  features = numeric_features[1:3],
  method = "jedi", # Local Polynomial Density Binning
  min_bins = 3, max_bins = 5, positive = "bad|1",
  control = list(polynomial_degree = 3)
)

# For categorical features with many levels
result_jedi <- obwoe(dt,
  target = "creditability",
  features = categoric_features[1:3],
  method = "jedi", # Joint Entropy-Driven Information
  min_bins = 3, max_bins = 5, positive = "bad|1"
)
} # }