Skip to contents

This function computes various statistical and performance metrics for a feature that has already been binned, considering a binary target (0/1). It is useful for evaluating the quality of bins generated by optimal binning methods. The calculated metrics include Weight of Evidence (WoE), Information Value (IV), accuracy rates, information divergences, Kolmogorov-Smirnov (KS), Lift, and others.

Usage

OBGainsTableFeature(binned_df, target, group_var = "bin")

Arguments

binned_df

A DataFrame containing the following columns, resulting from a binning process (e.g., using OBApplyWoENum or OBApplyWoECat):

  • feature: Original values of the variable.

  • bin: Bin label where the feature value was classified.

  • woe: Weight of Evidence associated with the bin.

  • idbin: Numeric bin identifier used to optimally order the bins.

target

A numeric binary vector (0 and 1) representing the target. It must have the same length as binned_df.

group_var

A string indicating which variable to use for grouping data and calculating metrics. Options: "bin", "woe", or "idbin". Default: "idbin".

Value

A DataFrame containing, for each group (bin) defined by group_var, the following columns:

  • group: Name or value of the group selected by group_var.

  • id: Numeric bin identifier, ordered.

  • count: Total count of observations in the group.

  • pos: Count of positive cases (target=1) in the group.

  • neg: Count of negative cases (target=0) in the group.

  • woe: Weight of Evidence for the group, calculated as \(WoE = ln\frac{P(X|Y=1)}{P(X|Y=0)}\).

  • iv: Contribution of the group to the Information Value: \(IV = (P(X|Y=1)-P(X|Y=0))*WoE\).

  • total_iv: Total IV value, sum of the IV from all groups.

  • cum_pos, cum_neg: Cumulative counts of positive and negative cases up to the current group.

  • pos_rate, neg_rate: Positive and negative rates within the group.

  • pos_perc, neg_perc: Percentage of total positives/negatives represented by the group.

  • count_perc, cum_count_perc: Percentage of total observations and their cumulative percentage.

  • cum_pos_perc, cum_neg_perc: Cumulative percentage of positives/negatives relative to the total positives/negatives.

  • cum_pos_perc_total, cum_neg_perc_total: Cumulative percentage of positives/negatives relative to total observations.

  • odds_pos: Odds of positives in the group (\(\frac{pos}{neg}\)).

  • odds_ratio: Ratio of group odds to overall odds (\(odds_{group}/odds_{total}\)).

  • lift: \(\frac{P(Y=1|X_{group})}{P(Y=1)}\).

  • ks: Kolmogorov-Smirnov statistic at the group level: \(|F_1(i)-F_0(i)|\).

  • gini_contribution: Contribution of the bin to the Gini index, given by \(P(X|Y=1)*F_0(i) - P(X|Y=0)*F_1(i)\).

  • precision: Precision for the group (\(\frac{TP}{TP+FP}\)), where TP = pos, FP = neg, considering the bin as "positive".

  • recall: \(\frac{\sum_{j=1}^i TP_j}{\sum_{j=1}^n TP_j}\), cumulative percentage of true positives.

  • f1_score: \(2 * \frac{Precision * Recall}{Precision + Recall}\).

  • log_likelihood: Log-likelihood for the group \(LL = n_{pos} ln(p_i) + n_{neg} ln(1-p_i)\), with \(p_i = pos_rate\).

  • kl_divergence: Kullback-Leibler divergence between the group distribution and the global distribution of positives.

  • js_divergence: Jensen-Shannon divergence between the group and global distributions, a symmetric and finite measure.

Details

The function organizes the bins defined by group_var and computes essential performance metrics for the applied binning in a binary classification model. These metrics assist in evaluating the bins' discrimination capability to separate positives from negatives and the information added by each bin to the model.

References

  • Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

  • Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2), 171-186.

  • Kullback, S., & Leibler, R. A. (1951). On Information and Sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.

  • Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145-151.

Examples

if (FALSE) { # \dontrun{
# Hypothetical example:
# Assume binned_df is the result of OBApplyWoENum(...) and target is a 0/1 vector.
# gains_table <- OBGainsTableFeature(binned_df, target, group_var = "idbin")
# print(gains_table)
} # }