Skip to contents

Implements a supervised binning algorithm that uses Information Gain (Entropy) to identify the most informative initial split points, followed by a bottom-up merging process to satisfy constraints (minimum frequency, monotonicity, max bins).

Although historically referred to as "Unsupervised Decision Trees" in some contexts, this method is strictly **supervised** (uses target variable) and operates **bottom-up** after an initial entropy-based selection of cutpoints. It is particularly effective when the relationship between feature and target is non-linear but highly informative in specific regions.

Usage

ob_numerical_udt(
  feature,
  target,
  min_bins = 3,
  max_bins = 5,
  bin_cutoff = 0.05,
  max_n_prebins = 20,
  laplace_smoothing = 0.5,
  monotonicity_direction = "none",
  convergence_threshold = 1e-06,
  max_iterations = 1000
)

Arguments

feature

Numeric vector of feature values. Missing values (NA) are handled by placing them in a separate bin. Infinite values are treated as valid numeric extremes or placed in the missing bin if they represent errors.

target

Integer vector of binary target values (must contain only 0 and 1). Must have the same length as feature.

min_bins

Minimum number of bins (default: 3). Must be at least 2.

max_bins

Maximum number of bins (default: 5). Must be greater than or equal to min_bins.

bin_cutoff

Minimum fraction of total observations per bin (default: 0.05). Bins below this threshold are merged based on Event Rate similarity.

max_n_prebins

Maximum number of pre-bins (default: 20). The algorithm will select the top max_n_prebins cutpoints with highest Information Gain. Performance Note: High values (>50) may significantly slow down processing for large datasets due to the O(N^2) nature of cutpoint selection.

laplace_smoothing

Laplace smoothing parameter (default: 0.5) for WoE calculation.

monotonicity_direction

String specifying monotonicity constraint:

  • "none" (default): No monotonicity enforcement.

  • "increasing": WoE must be non-decreasing.

  • "decreasing": WoE must be non-increasing.

  • "auto": Automatically determined by Pearson correlation.

convergence_threshold

Convergence threshold for IV optimization (default: 1e-6).

max_iterations

Maximum iterations for optimization loop (default: 1000).

Value

A list containing:

id

Integer bin identifiers (1-based).

bin

Character bin intervals "(lower;upper]".

woe

Numeric WoE values.

iv

Numeric IV contributions.

event_rate

Numeric event rates.

count

Integer total observations.

count_pos

Integer positive class counts.

count_neg

Integer negative class counts.

cutpoints

Numeric bin boundaries.

total_iv

Total Information Value.

gini

Gini index (2*AUC - 1) calculated on the binned data.

ks

Kolmogorov-Smirnov statistic calculated on the binned data.

converged

Logical convergence flag.

iterations

Integer iteration count.

Details

Algorithm Overview

The UDT algorithm executes in four phases:

Phase 1: Entropy-Based Pre-binning

The algorithm evaluates every possible cutpoint \(c\) (midpoints between sorted unique values) using Information Gain (IG): $$IG(S, c) = H(S) - \left( \frac{|S_L|}{|S|} H(S_L) + \frac{|S_R|}{|S|} H(S_R) \right)$$

The top max_n_prebins cutpoints with the highest IG are selected to form the initial bins. This ensures that the starting bins capture the most discriminative regions of the feature space.

Phase 2: Rare Bin Merging

Bins with frequency \(< \text{bin\_cutoff}\) are merged. The merge partner is chosen to minimize the difference in Event Rates: $$\text{merge\_idx} = \arg\min_{j \in \{i-1, i+1\}} |ER_i - ER_j|$$ This differs from IV-based methods and aims to preserve local risk probability smoothness.

Phase 3: Monotonicity Enforcement

If requested, monotonicity is enforced by iteratively merging bins that violate the specified direction ("increasing", "decreasing", or "auto"). Auto-direction is determined by the sign of the Pearson correlation between feature and target.

Phase 4: Constraint Satisfaction

If \(k > \text{max\_bins}\), bins are merged minimizing IV loss until the constraint is met.

Warning on Complexity

The pre-binning phase evaluates Information Gain for all unique values. For continuous features with many unique values (e.g., \(N > 10,000\)), this step can be computationally intensive (\(O(N^2)\)). Consider rounding or using ob_numerical_sketch for very large datasets.

References

  • Quinlan, J. R. (1986). "Induction of Decision Trees". Machine Learning, 1(1), 81-106.

  • Fayyad, U. M., & Irani, K. B. (1992). "On the Handling of Continuous-Valued Attributes in Decision Tree Generation". Machine Learning, 8, 87-102.

  • Liu, H., et al. (2002). "Discretization: An Enabling Technique". Data Mining and Knowledge Discovery, 6(4), 393-423.

See also

ob_numerical_mdlp for a pure MDL-based approach, ob_numerical_sketch for fast approximation on large data.

Author

Lopes, J. E.