
Optimal Binning for Numerical Variables using Unsupervised Binning with Standard Deviation
Source:R/obn_ubsd.R
ob_numerical_ubsd.RdImplements a hybrid binning algorithm that initializes bins using unsupervised statistical properties (mean and standard deviation of the feature) and refines them through supervised optimization using Weight of Evidence (WoE) and Information Value (IV).
Important Clarification: Despite "Unsupervised" in the name, this method is predominantly supervised. The unsupervised component is limited to the initial bin creation step (~1% of the algorithm). All subsequent refinement (merge, monotonicity enforcement, bin count adjustment) uses the target variable extensively.
The statistical initialization via \(\mu \pm k\sigma\) provides a data-driven starting point that may be advantageous for approximately normal distributions, but offers no guarantees for skewed or multimodal data.
Usage
ob_numerical_ubsd(
feature,
target,
min_bins = 3,
max_bins = 5,
bin_cutoff = 0.05,
max_n_prebins = 20,
convergence_threshold = 1e-06,
max_iterations = 1000,
laplace_smoothing = 0.5
)Arguments
- feature
Numeric vector of feature values. Missing values (NA) and infinite values are not permitted and will trigger an error.
- target
Integer or numeric vector of binary target values (must contain only 0 and 1). Must have the same length as
feature.- min_bins
Minimum number of bins (default: 3). Must be at least 2.
- max_bins
Maximum number of bins (default: 5). Must be \(\ge\)
min_bins.- bin_cutoff
Minimum fraction of total observations per bin (default: 0.05). Must be in (0, 1).
- max_n_prebins
Maximum number of pre-bins before optimization (default: 20). Must be at least equal to
min_bins.- convergence_threshold
Convergence threshold for IV change (default: 1e-6).
- max_iterations
Maximum iterations for optimization (default: 1000).
- laplace_smoothing
Laplace smoothing parameter (default: 0.5). Must be non-negative.
Value
A list containing:
- id
Integer bin identifiers (1-based).
- bin
Character bin intervals
"[lower;upper)".- woe
Numeric WoE values (monotonic after enforcement).
- iv
Numeric IV contributions per bin.
- count
Integer total observations per bin.
- count_pos
Integer positive class counts.
- count_neg
Integer negative class counts.
- event_rate
Numeric event rates per bin.
- cutpoints
Numeric bin boundaries (excluding \(\pm\infty\)).
- total_iv
Total Information Value.
- converged
Logical convergence flag.
- iterations
Integer iteration count.
Details
Algorithm Overview
UBSD executes in six phases:
Phase 1: Statistical Initialization (UNSUPERVISED)
Initial bin edges are created by combining two approaches:
Standard deviation-based cutpoints: $$\{\mu - 2\sigma, \mu - \sigma, \mu, \mu + \sigma, \mu + 2\sigma\}$$ where \(\mu\) is the sample mean and \(\sigma\) is the sample standard deviation (with Bessel correction: \(N-1\) divisor).
Equal-width cutpoints: $$\left\{x_{\min} + i \times \frac{x_{\max} - x_{\min}}{\text{max\_n\_prebins}}\right\}_{i=1}^{\text{max\_n\_prebins}-1}$$
The union of these two sets is taken, sorted, and limited to max_n_prebins
edges (plus \(-\infty\) and \(+\infty\) boundaries).
Rationale: For approximately normal distributions, \(\mu \pm k\sigma\) cutpoints align with natural quantiles:
\(\mu - 2\sigma\) to \(\mu + 2\sigma\) captures ~95% of data (68-95-99.7 rule)
Equal-width ensures coverage of entire range
Limitation: For skewed distributions (e.g., log-normal), \(\mu - 2\sigma\) may fall outside the data range, creating empty bins.
Special Case: If \(\sigma < \epsilon\) (feature is nearly constant), fallback to pure equal-width binning.
Phase 2: Observation Assignment
Each observation is assigned to a bin via linear search: $$\text{bin}(x_i) = \min\{j : x_i > \text{lower}_j \land x_i \le \text{upper}_j\}$$
Counts are accumulated: count, count_pos, count_neg.
Phase 3: Rare Bin Merging (SUPERVISED)
Bins with \(\text{count} < \text{bin\_cutoff} \times N\) are merged with adjacent bins. Merge direction is chosen to minimize IV loss:
$$\text{direction} = \arg\min_{d \in \{\text{left}, \text{right}\}} \left( \text{IV}_i + \text{IV}_{i+d} \right)$$
This is a supervised step (uses IV computed from target).
Phase 4: WoE/IV Calculation (SUPERVISED)
Weight of Evidence with Laplace smoothing: $$\text{WoE}_i = \ln\left(\frac{n_i^{+} + \alpha}{n^{+} + k\alpha} \bigg/ \frac{n_i^{-} + \alpha}{n^{-} + k\alpha}\right)$$
Information Value: $$\text{IV}_i = \left(\frac{n_i^{+} + \alpha}{n^{+} + k\alpha} - \frac{n_i^{-} + \alpha}{n^{-} + k\alpha}\right) \times \text{WoE}_i$$
Phase 5: Monotonicity Enforcement (SUPERVISED)
Direction is auto-detected via majority vote: $$\text{increasing} = \begin{cases} \text{TRUE} & \text{if } \sum_i \mathbb{1}_{\{\text{WoE}_i > \text{WoE}_{i-1}\}} \ge \sum_i \mathbb{1}_{\{\text{WoE}_i < \text{WoE}_{i-1}\}} \\ \text{FALSE} & \text{otherwise} \end{cases}$$
Violations are resolved via PAVA (Pool Adjacent Violators Algorithm).
Phase 6: Bin Count Adjustment (SUPERVISED)
If \(k > \text{max\_bins}\), bins are merged to minimize IV loss: $$\text{merge\_idx} = \arg\min_{i=0}^{k-2} \left( \text{IV}_i + \text{IV}_{i+1} \right)$$
Convergence Criterion: $$|\text{IV}_{\text{total}}^{(t)} - \text{IV}_{\text{total}}^{(t-1)}| < \text{convergence\_threshold}$$
Comparison with Related Methods
| Method | Initialization | Truly Unsupervised? | Best For |
| UBSD | \(\mu \pm k\sigma\) + equal-width | No (1 pct unsup) | Normal distributions |
| MOB/MRBLP | Equal-frequency | No (0 pct unsup) | General use |
| MDLP | Equal-frequency | No (0 pct unsup) | Information theory |
| Sketch | KLL Sketch quantiles | No (0 pct unsup) | Streaming data |
When to Use UBSD
Use UBSD: If you have prior knowledge that the feature is approximately normally distributed and want bins aligned with standard deviations (e.g., for interpretability: "2 standard deviations below mean").
Avoid UBSD: For skewed distributions (use MDLP or MOB), for multimodal distributions (use LDB), or when you need provable optimality (use Sketch for quantile guarantees).
Alternative: For true unsupervised binning (no target), use
cut()withbreaks = "Sturges"or"FD"(Freedman-Diaconis).
Computational Complexity
Identical to MOB/MRBLP: \(O(N + k^2 \times \text{max\_iterations})\)
References
Sturges, H. A. (1926). "The Choice of a Class Interval". Journal of the American Statistical Association, 21(153), 65-66.
Scott, D. W. (1979). "On optimal and data-based histograms". Biometrika, 66(3), 605-610.
Freedman, D., & Diaconis, P. (1981). "On the histogram as a density estimator: L2 theory". Zeitschrift fuer Wahrscheinlichkeitstheorie, 57(4), 453-476.
Thomas, L. C. (2009). Consumer Credit Models: Pricing, Profit, and Portfolios. Oxford University Press.
Zeng, G. (2014). "A Necessary Condition for a Good Binning Algorithm in Credit Scoring". Applied Mathematical Sciences, 8(65), 3229-3242.
Siddiqi, N. (2006). Credit Risk Scorecards. Wiley.
See also
ob_numerical_mdlp for information-theoretic binning,
ob_numerical_mob for pure supervised binning,
cut for true unsupervised binning.
Examples
# \donttest{
# Simulate normally distributed credit scores
set.seed(123)
n <- 5000
# Feature: Normally distributed FICO scores
feature <- rnorm(n, mean = 680, sd = 60)
# Target: Logistic relationship with score
prob_default <- 1 / (1 + exp((feature - 680) / 30))
target <- rbinom(n, 1, prob_default)
# Apply UBSD
result <- ob_numerical_ubsd(
feature = feature,
target = target,
min_bins = 3,
max_bins = 5
)
# Compare with MDLP (should be similar for normal data)
result_mdlp <- ob_numerical_mdlp(feature, target)
data.frame(
Method = c("UBSD", "MDLP"),
N_Bins = c(length(result$woe), length(result_mdlp$woe)),
Total_IV = c(result$total_iv, result_mdlp$total_iv)
)
#> Method N_Bins Total_IV
#> 1 UBSD 5 2.030524
#> 2 MDLP 3 1.065396
# }