Preprocesses a numeric or categorical variable for optimal binning with handling of missing values and outliers
OBDataPreprocessor.Rd
This function preprocesses a given numeric or categorical feature, handling missing values and outliers based on the specified method. It can process both numeric and categorical features and supports outlier detection through various methods, including IQR, Z-score, and Grubbs' test. The function also generates summary statistics before and after preprocessing.
Usage
OBDataPreprocessor(
target,
feature,
num_miss_value = -999,
char_miss_value = "N/A",
outlier_method = "iqr",
outlier_process = FALSE,
preprocess = as.character(c("both")),
iqr_k = 1.5,
zscore_threshold = 3,
grubbs_alpha = 0.05
)
Arguments
- target
Numeric vector representing the binary target variable, where 1 indicates a positive event (e.g., default) and 0 indicates a negative event (e.g., non-default).
- feature
Numeric or character vector representing the feature to be binned.
- num_miss_value
(Optional) Numeric value to replace missing values in numeric features. Default is -999.0.
- char_miss_value
(Optional) String value to replace missing values in categorical features. Default is "N/A".
- outlier_method
(Optional) Method to detect outliers. Choose from "iqr", "zscore", or "grubbs". Default is "iqr".
- outlier_process
(Optional) Boolean flag indicating whether outliers should be processed. Default is FALSE.
- preprocess
(Optional) Character vector specifying what to return: "feature", "report", or "both". Default is "both".
- iqr_k
(Optional) The multiplier for the interquartile range (IQR) when using the IQR method to detect outliers. Default is 1.5.
- zscore_threshold
(Optional) The threshold for Z-score to detect outliers. Default is 3.0.
- grubbs_alpha
(Optional) The significance level for Grubbs' test to detect outliers. Default is 0.05.
Value
A list containing the following elements based on the preprocess
parameter:
preprocess
: A DataFrame containing the original and preprocessed feature values.report
: A DataFrame summarizing the variable type, number of missing values, number of outliers (for numeric features), and statistics before and after preprocessing.
Details
The function can handle both numeric and categorical features. For numeric features, it replaces missing values with num_miss_value
and can apply outlier detection using different methods. For categorical features, it replaces missing values with char_miss_value
. The function can return the preprocessed feature and/or a report with summary statistics.