ai mlintermediate 12m2026-06-09

GBDT Part 2 — XGBoost, LightGBM, Regularisation, In-Practice Tuning

Session 14 of the 48-session learning series.

Date: Sat, 2026-06-20 · Time: 14:30–16:30 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 03 · Est. read: 2 h

Why this session matters

This is Session 14 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

XGBoost — second-order Taylor expansion, exact greedy + approximate split
LightGBM — histogram-based, GOSS, EFB, leaf-wise growth
Regularisation — L1, L2, gamma, min_child_weight, max_depth
Hyperparameter tuning — the order that actually matters
When NOT to use GBDT — high-cardinality categoricals, sparse text, sequential data

Pre-read (skim before the session)

Deep dive

1. Why XGBoost won Kaggle

Session 5 covered the math of boosting: fit a tree to the gradient, add it scaled by learning rate, repeat. XGBoost added three things that turned a textbook algorithm into a production weapon:

Second-order objective — uses both gradient and Hessian, so the loss approximation is quadratic, not linear. Splits are chosen with sharper objective improvement.
System engineering — sparse-aware split-finding, column block pre-sort, out-of-core training. Trains on disk-resident data faster than scikit-learn loads it.
Regularisation baked in — α (L1), λ (L2) on leaf weights; γ (min-loss-reduction) for pruning. Generalises out of the box.

2. XGBoost's objective

The objective at step t:

Obj^(t) = Σ_i  loss(y_i, ŷ_i^(t-1) + f_t(x_i))  +  Ω(f_t)
        ≈ Σ_i  [ g_i · f_t(x_i)  +  ½ h_i · f_t(x_i)² ]  +  Ω(f_t)

Where g_i = ∂loss/∂ŷ, h_i = ∂²loss/∂ŷ². For a tree with leaves j and weights w_j:

Obj^(t) = Σ_j  [ (Σ_{i∈j} g_i) · w_j  +  ½ (Σ_{i∈j} h_i + λ) · w_j² ]  +  γ T

Closed-form optimal leaf weight:

w_j* = - G_j / (H_j + λ)

And the gain of a candidate split:

Gain = ½ [ G_L² / (H_L + λ)  +  G_R² / (H_R + λ)  -  (G_L + G_R)² / (H_L + H_R + λ) ] - γ

If Gain \< 0, don't split. That's γ pruning. This is the core formula in tree boosting.

3. Exact vs approximate split finding

Exact greedy: for each feature, sort, scan, compute gain at every split point. O(n × d) per node. Great for small data, brutal for billions of rows.

Approximate (histogram): bucket each feature into max_bin (default 256) bins. Split-finding is now O(max_bin × d), independent of n. XGBoost's tree_method='hist' (default since 1.0) and all of LightGBM use this.

You barely lose accuracy (the buckets are quantile-based) and gain 10–100× speedup. There's no reason to use exact-greedy on >100K rows.

4. LightGBM's additions

LightGBM doubles down on histogram + ships two more tricks:

GOSS (Gradient-based One-Side Sampling): keep all rows with large gradient (they're hard examples); randomly sample easy ones (small gradient). Cuts training time without losing accuracy.
EFB (Exclusive Feature Bundling): bundle sparse features that are rarely non-zero at the same time into a single feature. Saves memory and time on wide sparse data (one-hot encodings).
Leaf-wise growth (vs level-wise): always split the leaf with highest loss reduction, not the whole level. Deeper, lopsided trees per iteration; gives more loss reduction per tree but can overfit small data — control with num_leaves and min_data_in_leaf.

Practical: LightGBM is 2–10× faster than XGBoost on the same data, usually within 0.5% AUC.

5. Regularisation — what each knob does

Param	What it does	When to increase
`max_depth` / `num_leaves`	Tree size	Decrease if overfitting
`min_child_weight` / `min_data_in_leaf`	Min samples per leaf	Increase if overfitting
`λ` / `reg_lambda`	L2 on leaf weights	Increase if overfitting
`α` / `reg_alpha`	L1 on leaf weights	Increase to drive weights to zero
`γ` / `min_split_gain`	Min loss reduction to split	Increase to prune more
`subsample`	Row sampling per tree	Decrease (e.g., 0.8) to reduce variance
`colsample_bytree`	Column sampling per tree	Decrease (e.g., 0.8) to reduce variance
`learning_rate` (`η`)	Step size	Decrease and increase `n_estimators` together

6. Tuning order that actually works

Most online tuning guides have it wrong (GridSearchCV over everything). The real order:

Fix learning_rate=0.1, n_estimators=1000 with early stopping on a validation set.
Tune max_depth / num_leaves first — biggest impact on capacity.
Then min_child_weight / min_data_in_leaf — controls overfit.
Then subsample, colsample_bytree — variance reduction.
Then reg_alpha, reg_lambda, gamma — fine-tuning.
Finally drop learning_rate to 0.01 or 0.05, bump n_estimators proportionally for the final model.

Use Optuna with TPE — Bayesian search beats grid for >3 hyperparams.

import optuna, lightgbm as lgb

def objective(trial):
    params = {
        "objective": "binary",
        "metric": "auc",
        "num_leaves": trial.suggest_int("num_leaves", 16, 256),
        "max_depth": trial.suggest_int("max_depth", 4, 12),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100),
        "learning_rate": 0.05,
        "feature_fraction": trial.suggest_float("feature_fraction", 0.5, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.5, 1.0),
        "bagging_freq": 5,
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-3, 10, log=True),
    }
    model = lgb.train(params, train_set=dtrain, valid_sets=[dval],
                      num_boost_round=2000, early_stopping_rounds=50, verbose_eval=False)
    return model.best_score["valid_0"]["auc"]

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

7. Categorical handling

XGBoost: needs one-hot or target encoding. Recently added native categorical support but still rough.
LightGBM: native categorical via categorical_feature param. Splits like {a, c, f} vs {b, d, e} instead of one-hot.
CatBoost: best-in-class, uses ordered target statistics; built for categoricals.

For high-cardinality categoricals (>1000 levels), use target encoding with leave-one-out or k-fold to avoid leakage.

8. When GBDT is the wrong tool

GBDT dominates tabular. It's the wrong tool for:

Text / images / audio — no spatial or sequential prior. Neural nets win.
Very high-cardinality sparse features — embeddings handle this better.
Online learning — GBDT is batch by nature. Use SGD or factorisation machines.
Strict latency budget (<1 ms per inference at high QPS) — logistic regression or a tiny MLP is simpler.

9. Production checklist

Monotonic constraints (monotone_constraints) — force the model to be monotonic in features like price or bid — keeps business logic sane.
Feature importance — use SHAP, not the built-in gain. SHAP is consistent; built-in importance varies wildly across runs.
Calibration — gradient boosting on log-loss is reasonably calibrated, but verify with reliability diagrams. Use isotonic or Platt if needed.
Drift monitoring — track feature distributions in production vs training; alert on PSI > 0.2.

Link: https://leetcode.com/problems/split-array-largest-sum/
Difficulty: Hard
Why this problem: Binary-search on answer; greedy feasibility — same shape as max-leaf objective tuning.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Derive the XGBoost gain formula from the second-order Taylor expansion.
Explain leaf-wise vs level-wise growth and when each overfits.
List the tuning order (depth → leaf-min → subsample → reg → η).
Compare LightGBM, XGBoost, CatBoost on speed and categorical handling.
Identify two problem types where GBDT is the wrong choice.
Solve split-array-largest-sum — binary-search on the answer; same shape as tuning a split threshold.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

RAG Part 2 — Retrieval, Re-Ranking, Generation, Evaluation

Kafka Part 2 — Replication, ISR, Consumer Groups, Exactly-Once