Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Gradient Boosted Trees Part 1 — Boosting Intuition, Trees, Loss

Session 5 of the 48-session learning series.

Date: Sat, 2026-06-13 · Time: 14:30–16:30 IST · Track: 📈 Machine Learning (ML) · Parent 28-day topic: Day 03 · Est. read: 2 h

Why this session matters

This is Session 05 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

  • Why gradient boosting still beats neural nets on tabular data
  • The boosting idea — fit a tree to the residual, repeat, sum the trees
  • Trees as base learners — splits, gain, depth, leaf values
  • Loss functions — MSE for regression, log-loss for classification
  • What goes wrong without regularisation (and what Part 2 will fix)

Pre-read (skim before the session)

Deep dive

1. Why GBDT still wins on tabular data

In 2024–2026, transformer-everything is the headline. But:

  • For tabular data — features, joins, categoricals — XGBoost / LightGBM / CatBoost still beat deep nets in 80%+ of Kaggle competitions and most enterprise ML jobs.
  • They handle missing values, mixed types, and skewed targets without much preprocessing.
  • They train on a CPU in minutes for tens of millions of rows.
  • They give you feature importance and SHAP explanations for free.

If your ML role touches click prediction, ranking, churn, risk scoring, demand forecasting — GBDT is the workhorse. Knowing it cold is non-negotiable.

2. The boosting idea in one line

Fit a tree to the residuals (errors) of the previous model. Add it. Repeat M times. The final prediction is the sum.

F_0(x) = mean(y)                       # initial guess
for m in 1..M:
    r_m = y - F_{m-1}(x)               # residuals
    h_m = tree.fit(x, r_m)             # tree on residuals
    F_m(x) = F_{m-1}(x) + lr * h_m(x)  # add (with learning rate)

Each tree is a correction. You need lots of small corrections (M = 500–2000) more than a few big ones.

3. Why this works — the gradient-descent view

For squared loss L(y, F) = (y - F)²/2, the negative gradient is exactly the residual (y - F). So fitting a tree to the residual IS approximating one step of gradient descent in function space.

For other losses (log-loss for classification, Huber for robust regression, quantile loss for prediction intervals), you fit a tree to the pseudo-residual = negative gradient of the loss w.r.t. the current prediction. Same algorithm, different "residual".

This is the Friedman 2001 insight: boosting is gradient descent where each "step" is a small tree.

4. The base learner — what makes a single tree

A regression tree recursively splits the feature space:

              x_3 < 4.5?
              /        \
           yes          no
           /              \
   x_1 < 2.0?       x_5 in {A, B}?
    /       \          /        \
   …         …       leaf      leaf
                    (val=2.1) (val=-0.7)

For each candidate split (feature × threshold), compute the gain:

gain = MSE(parent) - [w_L · MSE(left) + w_R · MSE(right)]

Choose the split with max gain. Recurse until depth limit or min-gain threshold. Leaf value = mean of y at that leaf (for MSE loss).

XGBoost/LightGBM use a smarter gain that includes regularisation — Session 14.

5. Loss functions you'll actually use

TaskLossPseudo-residual
RegressionMSE: ½(y − F)²y − F
Binary classificationLog-loss: -y·log(p) − (1-y)·log(1-p) (p = σ(F))y − p
Robust regressionHuberclipped residual
Quantile (e.g. P90)Pinballα if y > F else α − 1
Ranking (LambdaMART)Pairwise / listwisegradient ∝ swap utility

For classification, you typically operate on logits (raw F), and the link function σ(F) gives probabilities. Same gradient-descent shell.

6. Learning rate, number of trees, depth — the three knobs

KnobTypical rangeEffect
learning_rate0.01 – 0.3Smaller = more trees needed, less overfit, more robust.
n_estimators100 – 5000Higher with smaller LR. Use early stopping.
max_depth3 – 8Deeper trees overfit faster; XGBoost prefers shallow.

Standard recipe:

  • lr = 0.05, n_estimators = 2000, max_depth = 6, early stopping with 50 rounds.
  • Let early stopping pick the actual tree count.

7. A worked example — predict house prices in 30 lines

import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split

df = pd.read_csv("ames-housing.csv")
y = df.pop("SalePrice")
X = pd.get_dummies(df, drop_first=True)

X_tr, X_v, y_tr, y_v = train_test_split(X, y, test_size=0.2, random_state=0)

model = xgb.XGBRegressor(
    n_estimators=2000,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.9,
    colsample_bytree=0.9,
    eval_metric="rmse",
    early_stopping_rounds=50,
)
model.fit(X_tr, y_tr, eval_set=[(X_v, y_v)], verbose=False)

print("best iter:", model.best_iteration)
print("RMSE:", model.evals_result()["validation_0"]["rmse"][model.best_iteration])

That's a baseline that's very hard to beat without per-feature engineering.

8. The intuition for feature importance

GBDT gives you three importance metrics:

  1. Gain — total reduction in loss across all splits using this feature. The most useful one for "which features moved the needle?"
  2. Cover — number of training rows that fell into splits using this feature.
  3. Weight — count of splits using this feature. Misleading on high-cardinality categoricals.

For deeper explanation use SHAP (Session 14 will mention; full SHAP is its own rabbit hole). For a first pass, gain is fine.

9. What goes wrong without regularisation (preview of Part 2)

A deep tree on noisy data memorises. Combine 500 deep trees and you get a model that scores 0.01 RMSE on train and 5.0 on test. The classic fix list — covered in detail next session:

  • Shrinkage (= learning rate < 1)
  • Subsample rows per tree (stochastic gradient boosting)
  • Colsample features per tree
  • L1 / L2 leaf penalties (XGBoost's reg_alpha, reg_lambda)
  • Min child weight / min samples leaf
  • Early stopping on a held-out set

10. What's next (Session 14 — GBDT Part 2)

  • XGBoost — the second-order Newton step, sparsity-aware splits, regularisation in the gain function
  • LightGBM — histogram-based splits, leaf-wise growth, GOSS, EFB
  • CatBoost — ordered boosting, native categorical handling
  • Tuning recipes (Optuna)
  • Production gotchas (feature drift, label leakage, monitoring)

Reading material

In-depth research material

Video reference

▶︎ StatQuest — Gradient Boost Part 1: Regression Main Ideas

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Binary Tree Maximum Path Sum

Post-session checklist

By the end of this session you should be able to:

  • Explain the boosting loop in 5 lines of pseudocode.
  • Derive that the negative gradient of MSE is the residual.
  • Pick a loss for: regression, binary classification, quantile prediction.
  • Tune learning_rate, n_estimators, max_depth with early stopping.
  • Read a feature-importance plot (gain) and call out top drivers.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.