Day 03 — Gradient Boosted Trees — XGBoost / LightGBM, Loss, Regularisation
On tabular data (still the majority of business ML) GBDTs beat deep nets and are the default at every credit, fraud and ads shop. Knowing the loss math and the…
Gradient boosted decision trees (GBDTs) are an additive model F_M(x) = Σ_\{m=1..M\} η · f_m(x) where each f_m is a regression tree fit to the negative gradient of the loss w.r.t. the current prediction. They dominate tabular ML because trees handle categorical splits, missing values, non-linearities and interactions natively — and unlike deep nets they need almost no data prep.
🧠 Concept
Why it matters & the mental model.
1. The boosting recursion
At step m, we have predictions ŷ_i = F_\{m-1\}(x_i). We compute pseudo-residuals r_i = -∂L/∂ŷ_i (for MSE this is just y_i - ŷ_i; for log loss it's y_i - σ(ŷ_i)). We fit a small tree f_m to (x_i, r_i), then update F_m = F_\{m-1\} + η · f_m. The learning rate η (≈ 0.05-0.1) is regularisation: smaller η means more trees but better generalisation.
2. XGBoost's second-order trick
XGBoost approximates the loss to second order via Taylor: L ≈ Σ [g_i · f(x_i) + ½ · h_i · f(x_i)²] + Ω(f), where g_i = ∂L/∂ŷ, h_i = ∂²L/∂ŷ², Ω(f) = γT + ½λ‖w‖² regularises tree complexity (T leaves, w leaf weights). Closed-form optimal leaf weight: w* = -G/(H+λ). Split gain: ½ · [G_L²/(H_L+λ) + G_R²/(H_R+λ) - G²/(H+λ)] - γ. This is why XGBoost converges faster than first-order GBM.
3. LightGBM — speed via histograms and leaf-wise growth
Two innovations:
- Histogram-based splits: bucket continuous features into 255 bins; finding the best split is O(#bins) instead of O(#unique values). 8-20× faster, slightly less accurate at low bin counts.
- Leaf-wise (best-first) growth: split the leaf with the highest gain regardless of depth. Lower loss per tree but easy to overfit → cap with
num_leaves(typically 31-127) andmin_data_in_leaf(≥ 20).
XGBoost grows level-wise (full layer by layer), which is more uniform and less prone to overfit but slower.
🛠 Deep Dive
Internals, code, architecture.
4. Regularisation knobs you actually tune
Always use early stopping on a held-out set (early_stopping_rounds=50). It's the single most effective regulariser.
5. Categorical features
LightGBM and CatBoost handle high-cardinality categoricals natively (sorted by target mean, then split on the sorted order — Fisher's exact split). XGBoost ≥1.5 supports enable_categorical=True but on truly huge cardinality CatBoost still wins.
6. Loss functions worth knowing
- MSE / Huber — regression.
- Log loss / softmax — binary / multi-class.
- Pairwise / LambdaRank — ranking (search, recsys).
- Tweedie / Poisson — claim severity, count data.
- Quantile loss — prediction intervals (great for forecasting).
🚀 In Practice
Trade-offs, exercises, what to ship today.
7. Calibration
GBDTs trained with log-loss are usually well-calibrated; trained with hinge or focal loss they are not. Use isotonic regression or Platt scaling on a held-out set if downstream uses raw probabilities (e.g. expected revenue).
8. Feature importance — careful
gain is the only importance worth reading; split count is misleading on high-cardinality features. For attribution prefer SHAP (TreeExplainer is exact and O(TLD²) — fast).
9. Production checklist
- Pin
tree_method='hist',device='cuda'if GPU available — 5-10× speedup. - Save models with
model.save_model('model.json')not pickle (version-portable). - Monitor feature drift with PSI; retrain when PSI > 0.2 on top features.
- Beware data leakage from target-encoded categoricals: encode inside CV folds.
10. Discussion prompts
"Walk me through the math of one XGBoost split." "Why does leaf-wise growth overfit faster?" "When would you pick a neural net over LightGBM on tabular data?" The honest answer to the last is: rarely, unless you have huge sequential / image side-features.
Resources
- 🎥 StatQuest — Gradient Boost Part 1-4
- 📖 XGBoost paper — Chen & Guestrin 2016
- 📖 LightGBM paper — Ke et al. 2017
- 📖 Kaggle notebook — XGBoost hyperparameter tuning
Practice Problem: Subtree of Another Tree (Easy)