Select Optimal Features Based on Weight of Evidence
OBSelectOptimalFeatures.Rd
This function selects optimal features from the result of an Optimal Binning and Weight of Evidence (WoE) analysis. It filters features based on their Information Value (IV), allowing for fine-tuned feature selection for predictive modeling.
Usage
OBSelectOptimalFeatures(
obresult,
target,
iv_threshold = 0.02,
min_features = 5L,
max_features = NULL
)
Arguments
- obresult
A list containing the result of the Optimal Binning and WoE analysis. Must include elements 'woedt' (a data.table with WoE transformed data) and 'bestsreport' (a data.table with feature performance metrics).
- target
Character. The name of the target variable in the dataset.
- iv_threshold
Numeric. The minimum Information Value threshold for feature selection. Features with IV below this threshold will be excluded. Default is 0.02.
- min_features
Integer. The minimum number of features to select, regardless of their IV. If fewer features meet the IV threshold, this ensures a minimum set is still selected. Default is 5.
- max_features
Integer or NULL. The maximum number of features to select. If NULL (default), no maximum limit is applied.
Value
A list containing:
- data
A data.table with the selected WoE features and the target variable.
- selected_features
A character vector of the selected WoE feature names.
- feature_iv
A data.table with all features and their total IV.
- report
A data.table summarizing the feature selection process.
Details
The function performs the following steps:
Validates input parameters.
Extracts and sorts features by their Information Value.
Selects features based on the provided IV threshold.
Adjusts the selection to meet minimum and maximum feature count requirements.
Prepares a final dataset with selected WoE features and the target variable.
Generates a summary report of the selection process.
Mathematical Background:
Weight of Evidence (WoE) and Information Value (IV) are key concepts in predictive modeling, especially in credit scoring. They are derived from information theory and provide a way to measure the predictive power of an independent variable in relation to the dependent variable.
Let \(Y\) be a binary target variable and \(X\) be a predictor variable.
For a given bin \(i\) of \(X\):
$$P(X_i|Y=1) = \frac{\text{Number of events in bin i}}{\text{Total number of events}}$$
$$P(X_i|Y=0) = \frac{\text{Number of non-events in bin i}}{\text{Total number of non-events}}$$
The Weight of Evidence for bin \(i\) is defined as:
$$WoE_i = \ln\left(\frac{P(X_i|Y=1)}{P(X_i|Y=0)}\right)$$
The Information Value for the entire variable \(X\) is:
$$IV = \sum_{i} (P(X_i|Y=1) - P(X_i|Y=0)) \cdot WoE_i$$
Interpretation of Information Value:
IV Range | Predictive Power |
< 0.02 | Useless |
0.02-0.1 | Weak |
0.1-0.3 | Medium |
0.3-0.5 | Strong |
> 0.5 | Suspicious |
Note: An IV > 0.5 might indicate overfitting or data leakage and should be investigated.
Examples
if (FALSE) { # \dontrun{
# Assuming 'obwoe_result' is the output from an Optimal Binning and WoE analysis
result <- OBSelectOptimalFeatures(
obresult = obwoe_result,
target = "target_variable",
iv_threshold = 0.05,
min_features = 10,
max_features = 30
)
# Access the final dataset with selected WoE features
final_dataset <- result$data
# View the selected WoE feature names
print(result$selected_features)
# View the feature selection summary report
print(result$report)
} # }