GBDT Part 2 — XGBoost, LightGBM, Regularisation, In-Practice Tuning
Session 14 of the 48-session learning series.
Date: Sat, 2026-06-20 · Time: 14:30–16:30 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 03 · Est. read: 2 h
Why this session matters
This is Session 14 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- XGBoost — second-order Taylor expansion, exact greedy + approximate split
- LightGBM — histogram-based, GOSS, EFB, leaf-wise growth
- Regularisation — L1, L2, gamma, min_child_weight, max_depth
- Hyperparameter tuning — the order that actually matters
- When NOT to use GBDT — high-cardinality categoricals, sparse text, sequential data
Pre-read (skim before the session)
- XGBoost — A Scalable Tree Boosting System (Chen & Guestrin, 2016)
- LightGBM — A Highly Efficient Gradient Boosting Decision Tree (Ke et al., 2017)
- CatBoost paper (Prokhorenkova et al., 2018)
- Laurae's XGBoost tuning notes
Deep dive
1. Why XGBoost won Kaggle
Session 5 covered the math of boosting: fit a tree to the gradient, add it scaled by learning rate, repeat. XGBoost added three things that turned a textbook algorithm into a production weapon:
- Second-order objective — uses both gradient and Hessian, so the loss approximation is quadratic, not linear. Splits are chosen with sharper objective improvement.
- System engineering — sparse-aware split-finding, column block pre-sort, out-of-core training. Trains on disk-resident data faster than scikit-learn loads it.
- Regularisation baked in —
α(L1),λ(L2) on leaf weights;γ(min-loss-reduction) for pruning. Generalises out of the box.
2. XGBoost's objective
The objective at step t:
Obj^(t) = Σ_i loss(y_i, ŷ_i^(t-1) + f_t(x_i)) + Ω(f_t)
≈ Σ_i [ g_i · f_t(x_i) + ½ h_i · f_t(x_i)² ] + Ω(f_t)
Where g_i = ∂loss/∂ŷ, h_i = ∂²loss/∂ŷ². For a tree with leaves j and weights w_j:
Obj^(t) = Σ_j [ (Σ_{i∈j} g_i) · w_j + ½ (Σ_{i∈j} h_i + λ) · w_j² ] + γ T
Closed-form optimal leaf weight:
w_j* = - G_j / (H_j + λ)
And the gain of a candidate split:
Gain = ½ [ G_L² / (H_L + λ) + G_R² / (H_R + λ) - (G_L + G_R)² / (H_L + H_R + λ) ] - γ
If Gain \< 0, don't split. That's γ pruning. This is the core formula in tree boosting.
3. Exact vs approximate split finding
Exact greedy: for each feature, sort, scan, compute gain at every split point. O(n × d) per node. Great for small data, brutal for billions of rows.
Approximate (histogram): bucket each feature into max_bin (default 256) bins. Split-finding is now O(max_bin × d), independent of n. XGBoost's tree_method='hist' (default since 1.0) and all of LightGBM use this.
You barely lose accuracy (the buckets are quantile-based) and gain 10–100× speedup. There's no reason to use exact-greedy on >100K rows.
4. LightGBM's additions
LightGBM doubles down on histogram + ships two more tricks:
- GOSS (Gradient-based One-Side Sampling): keep all rows with large gradient (they're hard examples); randomly sample easy ones (small gradient). Cuts training time without losing accuracy.
- EFB (Exclusive Feature Bundling): bundle sparse features that are rarely non-zero at the same time into a single feature. Saves memory and time on wide sparse data (one-hot encodings).
- Leaf-wise growth (vs level-wise): always split the leaf with highest loss reduction, not the whole level. Deeper, lopsided trees per iteration; gives more loss reduction per tree but can overfit small data — control with
num_leavesandmin_data_in_leaf.
Practical: LightGBM is 2–10× faster than XGBoost on the same data, usually within 0.5% AUC.
5. Regularisation — what each knob does
| Param | What it does | When to increase |
|---|---|---|
max_depth / num_leaves | Tree size | Decrease if overfitting |
min_child_weight / min_data_in_leaf | Min samples per leaf | Increase if overfitting |
λ / reg_lambda | L2 on leaf weights | Increase if overfitting |
α / reg_alpha | L1 on leaf weights | Increase to drive weights to zero |
γ / min_split_gain | Min loss reduction to split | Increase to prune more |
subsample | Row sampling per tree | Decrease (e.g., 0.8) to reduce variance |
colsample_bytree | Column sampling per tree | Decrease (e.g., 0.8) to reduce variance |
learning_rate (η) | Step size | Decrease and increase n_estimators together |
6. Tuning order that actually works
Most online tuning guides have it wrong (GridSearchCV over everything). The real order:
- Fix
learning_rate=0.1,n_estimators=1000with early stopping on a validation set. - Tune
max_depth/num_leavesfirst — biggest impact on capacity. - Then
min_child_weight/min_data_in_leaf— controls overfit. - Then
subsample,colsample_bytree— variance reduction. - Then
reg_alpha,reg_lambda,gamma— fine-tuning. - Finally drop
learning_rateto 0.01 or 0.05, bumpn_estimatorsproportionally for the final model.
Use Optuna with TPE — Bayesian search beats grid for >3 hyperparams.
import optuna, lightgbm as lgb
def objective(trial):
params = {
"objective": "binary",
"metric": "auc",
"num_leaves": trial.suggest_int("num_leaves", 16, 256),
"max_depth": trial.suggest_int("max_depth", 4, 12),
"min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
"learning_rate": 0.05,
"feature_fraction": trial.suggest_float("feature_fraction", 0.5, 1.0),
"bagging_fraction": trial.suggest_float("bagging_fraction", 0.5, 1.0),
"bagging_freq": 5,
"lambda_l2": trial.suggest_float("lambda_l2", 1e-3, 10, log=True),
}
model = lgb.train(params, train_set=dtrain, valid_sets=[dval],
num_boost_round=2000, early_stopping_rounds=50, verbose_eval=False)
return model.best_score["valid_0"]["auc"]
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)
7. Categorical handling
- XGBoost: needs one-hot or target encoding. Recently added native categorical support but still rough.
- LightGBM: native categorical via
categorical_featureparam. Splits like{a, c, f} vs {b, d, e}instead of one-hot. - CatBoost: best-in-class, uses ordered target statistics; built for categoricals.
For high-cardinality categoricals (>1000 levels), use target encoding with leave-one-out or k-fold to avoid leakage.
8. When GBDT is the wrong tool
GBDT dominates tabular. It's the wrong tool for:
- Text / images / audio — no spatial or sequential prior. Neural nets win.
- Very high-cardinality sparse features — embeddings handle this better.
- Online learning — GBDT is batch by nature. Use SGD or factorisation machines.
- Strict latency budget (<1 ms per inference at high QPS) — logistic regression or a tiny MLP is simpler.
9. Production checklist
- Monotonic constraints (
monotone_constraints) — force the model to be monotonic in features likepriceorbid— keeps business logic sane. - Feature importance — use SHAP, not the built-in gain. SHAP is consistent; built-in importance varies wildly across runs.
- Calibration — gradient boosting on log-loss is reasonably calibrated, but verify with reliability diagrams. Use isotonic or Platt if needed.
- Drift monitoring — track feature distributions in production vs training; alert on PSI > 0.2.
Reading material
In-depth research material
- ESL Ch. 10 — Boosting
- SHAP — A Unified Approach
- Friedman 1999 — Stochastic Gradient Boosting
- Laurae's hyperparameter notes
Video reference
▶︎ StatQuest — XGBoost Part 1: Regression
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Split Array Largest Sum
- Link: https://leetcode.com/problems/split-array-largest-sum/
- Difficulty: Hard
- Why this problem: Binary-search on answer; greedy feasibility — same shape as max-leaf objective tuning.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Derive the XGBoost gain formula from the second-order Taylor expansion.
- Explain leaf-wise vs level-wise growth and when each overfits.
- List the tuning order (depth → leaf-min → subsample → reg → η).
- Compare LightGBM, XGBoost, CatBoost on speed and categorical handling.
- Identify two problem types where GBDT is the wrong choice.
- Solve
split-array-largest-sum— binary-search on the answer; same shape as tuning a split threshold.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.