ai mlintermediate 12m2026-06-09

Gradient Boosted Trees Part 1 — Boosting Intuition, Trees, Loss

Session 5 of the 48-session learning series.

Date: Sat, 2026-06-13 · Time: 14:30–16:30 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 03 · Est. read: 2 h

Why this session matters

This is Session 05 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Why gradient boosting still beats neural nets on tabular data
The boosting idea — fit a tree to the residual, repeat, sum the trees
Trees as base learners — splits, gain, depth, leaf values
Loss functions — MSE for regression, log-loss for classification
What goes wrong without regularisation (and what Part 2 will fix)

Pre-read (skim before the session)

Deep dive

1. Why GBDT still wins on tabular data

In 2024–2026, transformer-everything is the headline. But:

For tabular data — features, joins, categoricals — XGBoost / LightGBM / CatBoost still beat deep nets in 80%+ of Kaggle competitions and most enterprise ML jobs.
They handle missing values, mixed types, and skewed targets without much preprocessing.
They train on a CPU in minutes for tens of millions of rows.
They give you feature importance and SHAP explanations for free.

If your ML role touches click prediction, ranking, churn, risk scoring, demand forecasting — GBDT is the workhorse. Knowing it cold is non-negotiable.

2. The boosting idea in one line

Fit a tree to the residuals (errors) of the previous model. Add it. Repeat M times. The final prediction is the sum.

F_0(x) = mean(y)                       # initial guess
for m in 1..M:
    r_m = y - F_{m-1}(x)               # residuals
    h_m = tree.fit(x, r_m)             # tree on residuals
    F_m(x) = F_{m-1}(x) + lr * h_m(x)  # add (with learning rate)

Each tree is a correction. You need lots of small corrections (M = 500–2000) more than a few big ones.

3. Why this works — the gradient-descent view

For squared loss L(y, F) = (y - F)²/2, the negative gradient is exactly the residual (y - F). So fitting a tree to the residual IS approximating one step of gradient descent in function space.

For other losses (log-loss for classification, Huber for robust regression, quantile loss for prediction intervals), you fit a tree to the pseudo-residual = negative gradient of the loss w.r.t. the current prediction. Same algorithm, different "residual".

This is the Friedman 2001 insight: boosting is gradient descent where each "step" is a small tree.

4. The base learner — what makes a single tree

A regression tree recursively splits the feature space:

              x_3 < 4.5?
              /        \
           yes          no
           /              \
   x_1 < 2.0?       x_5 in {A, B}?
    /       \          /        \
   …         …       leaf      leaf
                    (val=2.1) (val=-0.7)

For each candidate split (feature × threshold), compute the gain:

gain = MSE(parent) - [w_L · MSE(left) + w_R · MSE(right)]

Choose the split with max gain. Recurse until depth limit or min-gain threshold. Leaf value = mean of y at that leaf (for MSE loss).

XGBoost/LightGBM use a smarter gain that includes regularisation — Session 14.

5. Loss functions you'll actually use

Task	Loss	Pseudo-residual
Regression	MSE: `½(y − F)²`	`y − F`
Binary classification	Log-loss: `-y·log(p) − (1-y)·log(1-p)` (p = σ(F))	`y − p`
Robust regression	Huber	clipped residual
Quantile (e.g. P90)	Pinball	`α` if `y > F` else `α − 1`
Ranking (LambdaMART)	Pairwise / listwise	gradient ∝ swap utility

For classification, you typically operate on logits (raw F), and the link function σ(F) gives probabilities. Same gradient-descent shell.

6. Learning rate, number of trees, depth — the three knobs

Knob	Typical range	Effect
`learning_rate`	0.01 – 0.3	Smaller = more trees needed, less overfit, more robust.
`n_estimators`	100 – 5000	Higher with smaller LR. Use early stopping.
`max_depth`	3 – 8	Deeper trees overfit faster; XGBoost prefers shallow.

Standard recipe:

lr = 0.05, n_estimators = 2000, max_depth = 6, early stopping with 50 rounds.
Let early stopping pick the actual tree count.

7. A worked example — predict house prices in 30 lines

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

df = pd.read_csv("ames-housing.csv")
y = df.pop("SalePrice")
X = pd.get_dummies(df, drop_first=True)

X_tr, X_v, y_tr, y_v = train_test_split(X, y, test_size=0.2, random_state=0)

model = xgb.XGBRegressor(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric="rmse",
    early_stopping_rounds=50,
)
model.fit(X_tr, y_tr, eval_set=[(X_v, y_v)], verbose=False)

print("best iter:", model.best_iteration)
print("RMSE:", model.evals_result()["validation_0"]["rmse"][model.best_iteration])

That's a baseline that's very hard to beat without per-feature engineering.

8. The intuition for feature importance

GBDT gives you three importance metrics:

Gain — total reduction in loss across all splits using this feature. The most useful one for "which features moved the needle?"
Cover — number of training rows that fell into splits using this feature.
Weight — count of splits using this feature. Misleading on high-cardinality categoricals.

For deeper explanation use SHAP (Session 14 will mention; full SHAP is its own rabbit hole). For a first pass, gain is fine.

9. What goes wrong without regularisation (preview of Part 2)

A deep tree on noisy data memorises. Combine 500 deep trees and you get a model that scores 0.01 RMSE on train and 5.0 on test. The classic fix list — covered in detail next session:

Shrinkage (= learning rate < 1)
Subsample rows per tree (stochastic gradient boosting)
Colsample features per tree
L1 / L2 leaf penalties (XGBoost's reg_alpha, reg_lambda)
Min child weight / min samples leaf
Early stopping on a held-out set

10. What's next (Session 14 — GBDT Part 2)

XGBoost — the second-order Newton step, sparsity-aware splits, regularisation in the gain function
LightGBM — histogram-based splits, leaf-wise growth, GOSS, EFB
CatBoost — ordered boosting, native categorical handling
Tuning recipes (Optuna)
Production gotchas (feature drift, label leakage, monitoring)

Link: https://leetcode.com/problems/binary-tree-maximum-path-sum/
Difficulty: Hard
Why this problem: DFS returning max gain ending at node; track global best across left+node+right.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain the boosting loop in 5 lines of pseudocode.
Derive that the negative gradient of MSE is the residual.
Pick a loss for: regression, binary classification, quantile prediction.
Tune learning_rate, n_estimators, max_depth with early stopping.
Read a feature-importance plot (gain) and call out top drivers.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

SOLID Part 1 — SRP, OCP, LSP with Python Examples

Transformers Part 2 — Positional Encoding, RoPE, MLP, LayerNorm