ai mladvanced 12m2026-06-01

Day 03 — Gradient Boosted Trees — XGBoost / LightGBM, Loss, Regularisation

On tabular data (still the majority of business ML) GBDTs beat deep nets and are the default at every credit, fraud and ads shop. Knowing the loss math and the…

Gradient boosted decision trees (GBDTs) are an additive model F_M(x) = Σ_\{m=1..M\} η · f_m(x) where each f_m is a regression tree fit to the negative gradient of the loss w.r.t. the current prediction. They dominate tabular ML because trees handle categorical splits, missing values, non-linearities and interactions natively — and unlike deep nets they need almost no data prep.

🧠 Concept

Why it matters & the mental model.

1. The boosting recursion

At step m, we have predictions ŷ_i = F_\{m-1\}(x_i). We compute pseudo-residuals r_i = -∂L/∂ŷ_i (for MSE this is just y_i - ŷ_i; for log loss it's y_i - σ(ŷ_i)). We fit a small tree f_m to (x_i, r_i), then update F_m = F_\{m-1\} + η · f_m. The learning rate η (≈ 0.05-0.1) is regularisation: smaller η means more trees but better generalisation.

2. XGBoost's second-order trick

XGBoost approximates the loss to second order via Taylor: L ≈ Σ [g_i · f(x_i) + ½ · h_i · f(x_i)²] + Ω(f), where g_i = ∂L/∂ŷ, h_i = ∂²L/∂ŷ², Ω(f) = γT + ½λ‖w‖² regularises tree complexity (T leaves, w leaf weights). Closed-form optimal leaf weight: w* = -G/(H+λ). Split gain: ½ · [G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)] - γ. This is why XGBoost converges faster than first-order GBM.

3. LightGBM — speed via histograms and leaf-wise growth

Two innovations:

Histogram-based splits: bucket continuous features into 255 bins; finding the best split is O(#bins) instead of O(#unique values). 8-20× faster, slightly less accurate at low bin counts.
Leaf-wise (best-first) growth: split the leaf with the highest gain regardless of depth. Lower loss per tree but easy to overfit → cap with num_leaves (typically 31-127) and min_data_in_leaf (≥ 20).

XGBoost grows level-wise (full layer by layer), which is more uniform and less prone to overfit but slower.

🛠 Deep Dive

Internals, code, architecture.

4. Regularisation knobs you actually tune

Always use early stopping on a held-out set (early_stopping_rounds=50). It's the single most effective regulariser.

5. Categorical features

LightGBM and CatBoost handle high-cardinality categoricals natively (sorted by target mean, then split on the sorted order — Fisher's exact split). XGBoost ≥1.5 supports enable_categorical=True but on truly huge cardinality CatBoost still wins.

6. Loss functions worth knowing

MSE / Huber — regression.
Log loss / softmax — binary / multi-class.
Pairwise / LambdaRank — ranking (search, recsys).
Tweedie / Poisson — claim severity, count data.
Quantile loss — prediction intervals (great for forecasting).

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Calibration

GBDTs trained with log-loss are usually well-calibrated; trained with hinge or focal loss they are not. Use isotonic regression or Platt scaling on a held-out set if downstream uses raw probabilities (e.g. expected revenue).

8. Feature importance — careful

gain is the only importance worth reading; split count is misleading on high-cardinality features. For attribution prefer SHAP (TreeExplainer is exact and O(TLD²) — fast).

9. Production checklist

Pin tree_method='hist', device='cuda' if GPU available — 5-10× speedup.
Save models with model.save_model('model.json') not pickle (version-portable).
Monitor feature drift with PSI; retrain when PSI > 0.2 on top features.
Beware data leakage from target-encoded categoricals: encode inside CV folds.

10. Discussion prompts

"Walk me through the math of one XGBoost split." "Why does leaf-wise growth overfit faster?" "When would you pick a neural net over LightGBM on tabular data?" The honest answer to the last is: rarely, unless you have huge sequential / image side-features.

Key points

Resources

Practice Problem: Subtree of Another Tree (Easy)

← previous

Day 02 — Apache Spark Architecture — Driver, Executors, Shuffles, Catalyst

Day 04 — Designing a URL Shortener at Scale — IDs, Storage, Cache, CDN