Generate Gains Table for a Binned Feature — OBGainsTableFeature • OptimalBinningWoE

This function computes various statistical and performance metrics for a feature that has already been binned, considering a binary target (0/1). It is useful for evaluating the quality of bins generated by optimal binning methods. The calculated metrics include Weight of Evidence (WoE), Information Value (IV), accuracy rates, information divergences, Kolmogorov-Smirnov (KS), Lift, and others.

Usage

OBGainsTableFeature(binned_df, target, group_var = "bin")

Arguments

binned_df

A DataFrame containing the following columns, resulting from a binning process (e.g., using OBApplyWoENum or OBApplyWoECat):

feature: Original values of the variable.
bin: Bin label where the feature value was classified.
woe: Weight of Evidence associated with the bin.
idbin: Numeric bin identifier used to optimally order the bins.

target

A numeric binary vector (0 and 1) representing the target. It must have the same length as binned_df.

group_var

A string indicating which variable to use for grouping data and calculating metrics. Options: "bin", "woe", or "idbin". Default: "idbin".

Value

A DataFrame containing, for each group (bin) defined by group_var, the following columns:

group: Name or value of the group selected by group_var.
id: Numeric bin identifier, ordered.
count: Total count of observations in the group.
pos: Count of positive cases (target=1) in the group.
neg: Count of negative cases (target=0) in the group.
woe: Weight of Evidence for the group, calculated as \(WoE = ln\frac{P(X|Y=1)}{P(X|Y=0)}\).
iv: Contribution of the group to the Information Value: \(IV = (P(X|Y=1)-P(X|Y=0))*WoE\).
total_iv: Total IV value, sum of the IV from all groups.
cum_pos, cum_neg: Cumulative counts of positive and negative cases up to the current group.
pos_rate, neg_rate: Positive and negative rates within the group.
pos_perc, neg_perc: Percentage of total positives/negatives represented by the group.
count_perc, cum_count_perc: Percentage of total observations and their cumulative percentage.
cum_pos_perc, cum_neg_perc: Cumulative percentage of positives/negatives relative to the total positives/negatives.
cum_pos_perc_total, cum_neg_perc_total: Cumulative percentage of positives/negatives relative to total observations.
odds_pos: Odds of positives in the group (\(\frac{pos}{neg}\)).
odds_ratio: Ratio of group odds to overall odds (\(odds_{group}/odds_{total}\)).
lift: \(\frac{P(Y=1|X_{group})}{P(Y=1)}\).
ks: Kolmogorov-Smirnov statistic at the group level: \(|F_1(i)-F_0(i)|\).
gini_contribution: Contribution of the bin to the Gini index, given by \(P(X|Y=1)*F_0(i) - P(X|Y=0)*F_1(i)\).
precision: Precision for the group (\(\frac{TP}{TP+FP}\)), where TP = pos, FP = neg, considering the bin as "positive".
recall: \(\frac{\sum_{j=1}^i TP_j}{\sum_{j=1}^n TP_j}\), cumulative percentage of true positives.
f1_score: \(2 * \frac{Precision * Recall}{Precision + Recall}\).
log_likelihood: Log-likelihood for the group \(LL = n_{pos} ln(p_i) + n_{neg} ln(1-p_i)\), with \(p_i = pos_rate\).
kl_divergence: Kullback-Leibler divergence between the group distribution and the global distribution of positives.
js_divergence: Jensen-Shannon divergence between the group and global distributions, a symmetric and finite measure.

Details

The function organizes the bins defined by group_var and computes essential performance metrics for the applied binning in a binary classification model. These metrics assist in evaluating the bins' discrimination capability to separate positives from negatives and the information added by each bin to the model.

References

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.
Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2), 171-186.
Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.
Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

Examples

if (FALSE) { # \dontrun{
# Hypothetical example:
# Assume binned_df is the result of OBApplyWoENum(...) and target is a 0/1 vector.
# gains_table <- OBGainsTableFeature(binned_df, target, group_var = "idbin")
# print(gains_table)
} # }