Binning Numerical Variables using Custom Cutpoints — binning_numerical

This function performs optimal binning of a numerical variable based on predefined cutpoints, calculates the Weight of Evidence (WoE) and Information Value (IV) for each bin, and transforms the feature accordingly.

Usage

binning_numerical_cutpoints(feature, target, cutpoints)

Arguments

feature: A numeric vector representing the numerical feature to be binned.
target: An integer vector representing the binary target variable (0 or 1).
cutpoints: A numeric vector containing the cutpoints to define the bin boundaries.

Value

A list with two elements:

woefeature: A numeric vector representing the transformed feature with WoE values for each observation.
woebin: A data frame containing detailed statistics for each bin, including counts, WoE, and IV.

Details

Binning is a preprocessing step that groups continuous values of a numerical feature into a smaller number of bins. This function performs binning based on user-defined cutpoints, which allows you to define how the numerical feature should be split into intervals. The resulting bins are evaluated using the WoE and IV metrics, which are often used in predictive modeling, especially in credit risk modeling.

The Weight of Evidence (WoE) is calculated as: $$\text{WoE} = \log\left(\frac{\text{Positive Rate}}{\text{Negative Rate}}\right)$$ where the Positive Rate is the proportion of positive observations (target = 1) within the bin, and the Negative Rate is the proportion of negative observations (target = 0) within the bin.

The Information Value (IV) measures the predictive power of the numerical feature and is calculated as: $$IV = \sum (\text{Positive Rate} - \text{Negative Rate}) \times \text{WoE}$$

The IV metric provides insight into how well the binned feature predicts the target variable:

IV < 0.02: Not predictive
0.02 <= IV < 0.1: Weak predictive power
0.1 <= IV < 0.3: Medium predictive power
IV >= 0.3: Strong predictive power

The WoE transformation helps to convert the numerical variable into a continuous numeric feature, which can be directly used in logistic regression and other predictive models, improving model interpretability and performance.

References

Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons.

Author

Lopes, J. E.

Examples

if (FALSE) { # \dontrun{
# Example usage
feature <- c(23, 45, 34, 25, 56, 48, 35, 29, 53, 41)
target <- c(1, 0, 1, 1, 0, 0, 0, 1, 1, 0)
cutpoints <- c(30, 40, 50)
result <- binning_numerical_cutpoints(feature, target, cutpoints)
print(result$woefeature)  # WoE-transformed feature
print(result$woebin)      # WoE and IV statistics for each bin
} # }