step_obwoe() creates a specification of a recipe step that
discretizes predictor variables using one of 28 state-of-the-art optimal
binning algorithms and transforms them into Weight of Evidence (WoE) values.
This step fully integrates the OptimalBinningWoE package into the
tidymodels framework, supporting supervised discretization for both
binary and multinomial classification targets with extensive hyperparameter
tuning capabilities.
Usage
step_obwoe(
recipe,
...,
role = "predictor",
trained = FALSE,
outcome = NULL,
algorithm = "auto",
min_bins = 2L,
max_bins = 10L,
bin_cutoff = 0.05,
output = c("woe", "bin", "both"),
suffix_woe = "_woe",
suffix_bin = "_bin",
na_woe = 0,
control = list(),
binning_results = NULL,
skip = FALSE,
id = recipes::rand_id("obwoe")
)Arguments
- recipe
A recipe object. The step will be added to the sequence of operations for this recipe.
- ...
One or more selector functions to choose variables for this step. See
selectionsfor available selectors. Common choices includeall_predictors(),all_numeric_predictors(), orall_nominal_predictors(). Ensure the selected variables are compatible with the chosenalgorithm(e.g., do not apply"mdlp"to categorical data).- role
For variables created by this step, what role should they have?
Default is
"predictor".- trained
A logical indicating whether the step has been trained (fitted). This should not be set manually.
- outcome
A character string specifying the name of the binary or multinomial response variable. This argument is required as all binning algorithms are supervised. The outcome must exist in the training data provided to
prep(). The outcome should be encoded as a factor (standard for tidymodels classification) or as integers 0/1 for binary, 0/1/2/... for multinomial.- algorithm
Character string specifying the binning algorithm to use. Use
"auto"(default) for automatic selection based on target type:"jedi"for binary targets,"jedi_mwoe"for multinomial.Available algorithms are organized by supported feature types:
Universal (numerical and categorical):
"auto","jedi","jedi_mwoe","cm","dp","dmiv","fetb","mob","sketch","udt"Numerical only:
"bb","ewb","fast_mdlp","ir","kmb","ldb","lpdb","mblp","mdlp","mrblp","oslp","ubsd"Categorical only:
"gmb","ivb","mba","milp","sab","sblp","swb"This parameter is tunable with
tune().- min_bins
Integer specifying the minimum number of bins to create. Must be at least 2. Default is 2. This parameter is tunable with
tune().- max_bins
Integer specifying the maximum number of bins to create. Must be greater than or equal to
min_bins. Default is 10. This parameter is tunable withtune().- bin_cutoff
Numeric value between 0 and 1 (exclusive) specifying the minimum proportion of total observations that each bin must contain. Bins with fewer observations are merged with adjacent bins. This serves as a regularization mechanism to prevent overfitting and ensure statistical stability of WoE estimates. Default is 0.05 (5%). This parameter is tunable with
tune().- output
Character string specifying the transformation output format:
"woe"Replaces the original variable with WoE values (default). This is the standard choice for logistic regression scorecards.
"bin"Replaces the original variable with bin labels (character). Useful for tree-based models or exploratory analysis.
"both"Keeps the original column and adds two new columns with suffixes
_woeand_bin. Useful for model comparison or audit trails.
- suffix_woe
Character string suffix appended to create WoE column names when
output = "both". Default is"_woe".- suffix_bin
Character string suffix appended to create bin column names when
output = "both". Default is"_bin".- na_woe
Numeric value to assign to observations that cannot be mapped to a bin during
bake(). This includes missing values (NA) and unseen categories not present in the training data. Default is 0, which represents neutral evidence (neither good nor bad).- control
A named list of additional control parameters passed to
control.obwoe. These provide fine-grained control over algorithm behavior such as convergence thresholds and maximum pre-bins. Parameters specified directly instep_obwoe()(e.g.,bin_cutoff) take precedence over values in this list.- binning_results
Internal storage for fitted binning models after
prep(). Do not set manually.- skip
Logical. Should this step be skipped when
bake()is called on new data? Default isFALSE. Setting toTRUEis rarely needed for WoE transformations but may be useful in specialized workflows.- id
A unique character string to identify this step. If not provided, a random identifier is generated.
Details
Weight of Evidence Transformation
Weight of Evidence (WoE) is a supervised encoding technique that transforms categorical and continuous variables into a scale that measures the predictive strength of each value or bin relative to the target variable. For a bin \(i\), the WoE is defined as:
$$WoE_i = \ln\left(\frac{\text{Distribution of Events}_i}{\text{Distribution of Non-Events}_i}\right)$$
Positive WoE values indicate the bin has a higher proportion of events (e.g., defaults) than the overall population, while negative values indicate lower risk.
Algorithm Selection Strategy
The algorithm parameter provides access to 28 binning algorithms:
Use
algorithm = "auto"(default) for automatic selection:"jedi"for binary targets,"jedi_mwoe"for multinomial.Use
algorithm = "mob"(Monotonic Optimal Binning) when monotonic WoE trends are required for regulatory compliance (Basel/IFRS 9).Use
algorithm = "mdlp"for entropy-based discretization of numerical variables (requiresall_numeric_predictors()).Use
algorithm = "dp"(Dynamic Programming) for exact optimal solutions when computational cost is acceptable.
If an incompatible algorithm is applied to a variable (e.g., "mdlp"
on a factor), the step will issue a warning during prep() and skip
that variable, leaving it untransformed.
Handling New Data
During bake(), observations are mapped to bins learned during
prep():
Numerical variables: Values are assigned to bins based on the learned cutpoints using interval notation.
Categorical variables: Categories are matched to their corresponding bins. Categories not seen during training receive the
na_woevalue.Missing values: Always receive the
na_woevalue.
Tuning with tune
This step is fully compatible with the tune package. The following
parameters support tune():
algorithm: Seeobwoe_algorithm.min_bins: Seeobwoe_min_bins.max_bins: Seeobwoe_max_bins.bin_cutoff: Seeobwoe_bin_cutoff.
References
Siddiqi, N. (2006). Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. John Wiley & Sons. doi:10.1002/9781119201731
Thomas, L. C., Edelman, D. B., & Crook, J. N. (2002). Credit Scoring and Its Applications. SIAM Monographs on Mathematical Modeling and Computation. doi:10.1137/1.9780898718317
Navas-Palencia, G. (2020). Optimal Binning: Mathematical Programming Formulation and Solution Approach. Expert Systems with Applications, 158, 113508. doi:10.1016/j.eswa.2020.113508
See also
obwoe for the underlying binning engine,
control.obwoe for advanced control parameters,
obwoe_algorithm, obwoe_min_bins,
obwoe_max_bins, obwoe_bin_cutoff for tuning
parameter definitions,
recipe, prep,
bake for the tidymodels recipe framework.
Examples
# \donttest{
library(recipes)
#> Loading required package: dplyr
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
#>
#> Attaching package: ‘recipes’
#> The following object is masked from ‘package:stats’:
#>
#> step
# Simulated credit data
set.seed(123)
credit_data <- data.frame(
age = rnorm(500, 45, 12),
income = exp(rnorm(500, 10, 0.6)),
employment = sample(c("Employed", "Self-Employed", "Unemployed"),
500,
replace = TRUE, prob = c(0.7, 0.2, 0.1)
),
education = factor(c("HighSchool", "Bachelor", "Master", "PhD")[
sample(1:4, 500, replace = TRUE, prob = c(0.3, 0.4, 0.2, 0.1))
]),
default = factor(rbinom(500, 1, 0.15),
levels = c(0, 1),
labels = c("No", "Yes")
)
)
# Example 1: Basic usage with automatic algorithm selection
rec_basic <- recipe(default ~ ., data = credit_data) %>%
step_obwoe(all_predictors(), outcome = "default")
rec_prepped <- prep(rec_basic)
baked_data <- bake(rec_prepped, new_data = NULL)
head(baked_data)
#> # A tibble: 6 × 5
#> age income employment education default
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 0.0599 -0.115 0.00761 -0.0945 No
#> 2 0.0599 -0.115 0.00761 -0.0979 Yes
#> 3 0.0599 0.230 0.00761 -0.0945 No
#> 4 0.0599 0.230 0.00761 -0.0504 No
#> 5 0.0599 -0.310 0.00761 -0.0979 No
#> 6 -20.2 0 0.00761 -0.0979 No
# View binning details
tidy(rec_prepped, number = 1)
#> # A tibble: 15 × 5
#> terms bin woe iv id
#> <chr> <chr> <dbl> <dbl> <chr>
#> 1 age (-Inf;65.213228] 0.0599 0.00348 obwoe_sOPpR
#> 2 age (65.213228;+Inf] -20.2 1.17 obwoe_sOPpR
#> 3 income (-Inf;11272.824005] -0.310 0.0129 obwoe_sOPpR
#> 4 income (11272.824005;20785.031269] -0.115 0.00384 obwoe_sOPpR
#> 5 income (20785.031269;24008.757553] 0 0 obwoe_sOPpR
#> 6 income (24008.757553;32373.582575] 0 0 obwoe_sOPpR
#> 7 income (32373.582575;59462.758585] 0.230 0.0114 obwoe_sOPpR
#> 8 income (59462.758585;+Inf] 0.429 0.0107 obwoe_sOPpR
#> 9 employment Self-Employed -0.0873 0.00136 obwoe_sOPpR
#> 10 employment Employed 0.00761 0.0000409 obwoe_sOPpR
#> 11 employment Unemployed 0.0382 0.000164 obwoe_sOPpR
#> 12 education HighSchool -0.0979 0.00273 obwoe_sOPpR
#> 13 education Master -0.0945 0.00187 obwoe_sOPpR
#> 14 education Bachelor -0.0504 0.000965 obwoe_sOPpR
#> 15 education PhD 0.490 0.0298 obwoe_sOPpR
# Example 2: Numerical-only algorithm on numeric predictors
rec_mdlp <- recipe(default ~ age + income, data = credit_data) %>%
step_obwoe(all_numeric_predictors(),
outcome = "default",
algorithm = "mdlp",
min_bins = 3,
max_bins = 6
)
# Example 3: Output both bins and WoE
rec_both <- recipe(default ~ age, data = credit_data) %>%
step_obwoe(age,
outcome = "default",
output = "both"
)
baked_both <- bake(prep(rec_both), new_data = NULL)
names(baked_both)
#> [1] "age" "age_woe" "age_bin" "default"
# Contains: default, age, age_woe, age_bin
# Example 4: Custom control parameters
rec_custom <- recipe(default ~ ., data = credit_data) %>%
step_obwoe(all_predictors(),
outcome = "default",
algorithm = "mob",
bin_cutoff = 0.03,
control = list(
max_n_prebins = 30,
convergence_threshold = 1e-8
)
)
# Example 5: Tuning specification (for use with tune package)
# rec_tune <- recipe(default ~ ., data = credit_data) %>%
# step_obwoe(all_predictors(),
# outcome = "default",
# algorithm = tune(),
# min_bins = tune(),
# max_bins = tune())
# }
