Generate a Detailed Gains Table from Optimal Binning Results

This function processes the results of optimal binning and generates a comprehensive gains table, including evaluation metrics and characteristics for each bin. It provides insights into the performance and information value of the binned feature within the context of binary classification models.

Usage

OBGainsTable(binning_result)

Arguments

binning_result

A list containing the binning results, which must include a DataFrame with the following columns:

id: Numeric bin identifier.
bin: Bin label where feature values were grouped.
count: Total count of observations in the bin.
count_pos: Count of positive cases (target=1) in the bin.
count_neg: Count of negative cases (target=0) in the bin.

Value

A DataFrame containing, for each bin, a detailed breakdown of metrics and characteristics. Columns include:

id: Numeric identifier of the bin.
bin: Label of the bin.
count: Total observations in the bin.
pos: Number of positive cases in the bin.
neg: Number of negative cases in the bin.
woe: Weight of Evidence (\(WoE_i = \ln\frac{P(X_i|Y=1)}{P(X_i|Y=0)}\)).
iv: Information Value contribution for the bin (\(IV_i = (P(X_i|Y=1) - P(X_i|Y=0)) \cdot WoE_i\)).
total_iv: Total IV across all bins.
cum_pos, cum_neg: Cumulative counts of positives and negatives up to the current bin.
pos_rate, neg_rate: Positive and negative rates within the bin.
pos_perc, neg_perc: Percentage of total positives/negatives represented by the bin.
count_perc, cum_count_perc: Percentage of total observations and cumulative percentages.
cum_pos_perc, cum_neg_perc: Cumulative percentages of positives and negatives relative to their totals.
cum_pos_perc_total, cum_neg_perc_total: Cumulative percentages of positives and negatives relative to total observations.
odds_pos: Odds of positives in the bin (\(\frac{pos}{neg}\)).
odds_ratio: Ratio of bin odds to total odds (\(OR_i = \frac{(P(Y=1|X_i)/P(Y=0|X_i))}{(P(Y=1)/P(Y=0))}\)).
lift: Lift of the bin (\(Lift_i = \frac{P(Y=1|X_i)}{P(Y=1)}\)).
ks: Kolmogorov-Smirnov statistic (\(KS_i = |F_1(i) - F_0(i)|\)).
gini_contribution: Contribution to the Gini index (\(Gini_i = P(X_i|Y=1)F_0(i) - P(X_i|Y=0)F_1(i)\)).
precision: Precision for the bin (\(Precision_i = \frac{TP}{TP + FP}\)).
recall: Recall for the bin (\(Recall_i = \frac{\sum_{j=1}^i TP_j}{\sum_{j=1}^n TP_j}\)).
f1_score: F1 Score (\(F1_i = 2 \cdot \frac{Precision_i \cdot Recall_i}{Precision_i + Recall_i}\)).
log_likelihood: Log-Likelihood (\(LL_i = n_{1i}\ln(p_i) + n_{0i}\ln(1-p_i)\)).
kl_divergence: Kullback-Leibler divergence (\(KL_i = p_i \ln\frac{p_i}{p} + (1-p_i)\ln\frac{1-p_i}{1-p}\)).
js_divergence: Jensen-Shannon divergence (\(JS_i = \frac{1}{2}KL(P||M) + \frac{1}{2}KL(Q||M)\)).

Details

This function organizes the bins and computes essential metrics that help evaluate the quality of optimal binning applied to a binary classification problem. These metrics include measures of separation, information gain, and performance lift, aiding in model performance analysis.

References

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2), 171-186.
Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.