Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlbeginner 6m2025-03-03

Stanford CS229: Machine Learning Course

CS229 provides a broad introduction to statistical machine learning (at an intermediate / advanced level) and covers supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs, practical ); and reinforcement learning among other topics

Documenting the Stanford CS229 ML Course and gathering insights

  • Although I have an informal understanding of ML and deep learning, I decided to begin this in-depth course on a friend's recommendation and am documenting my progress as part of the "Build in Public" movement.

Lecture 1 - Introduction and Linear Algebra

Course lecture : Lecture - 1

  • Slides : Slides
  • Linear Algebra Reading : pdf

Background & Prerequisites — What You Need to Know Before Completing This Blog

CS229 is an intermediate/advanced course. Before blogging through each lecture, you need a strong foundation in the following areas. Each section explains the topic, why it matters for CS229, and the depth required.


1. Linear Algebra (Critical)

Why: ML algorithms are fundamentally linear algebra operations. Gradient descent, PCA, SVD, neural networks — all are matrix/vector operations.

  • Vectors — Column vectors, row vectors, norms (L1, L2, L∞). Geometric interpretation: direction + magnitude.
  • Matrices — Addition, multiplication (row-by-column), transpose, trace. Understand matrix multiplication as a composition of linear transformations.
  • Systems of linear equationsAx=bAx = b. Solving via Gaussian elimination, LU decomposition. When solutions exist (rank conditions).
  • Eigenvalues & EigenvectorsAv=λvAv = \lambda v. Eigendecomposition, spectral theorem for symmetric matrices. Used in PCA, covariance analysis, stability analysis.
  • Singular Value Decomposition (SVD)A=UΣVTA = U\Sigma V^T. Fundamental for dimensionality reduction, matrix approximation, and understanding matrix rank.
  • Positive definite matricesxTAx>0x^T A x > 0 for all nonzero xx. Guarantees convexity of quadratic forms. Hessian positive definite → local minimum.
  • Matrix calculus — Gradient of a scalar w.r.t. a vector (xf\nabla_x f), Jacobian (vector-to-vector), Hessian (second derivatives). Chain rule for matrix operations.

2. Probability & Statistics (Critical)

Why: ML is applied statistics. Every model makes probabilistic assumptions about data.

  • Probability axioms — Sample space, events, probability measure. Conditional probability, Bayes' theorem: P(AB)={P(BA)P(A)}{P(B)}P(A|B) = \frac\{P(B|A)P(A)\}\{P(B)\}.
  • Random variables — Discrete (PMF) and continuous (PDF). CDF. Expected value E[X]E[X], variance {Var}(X)=E[X2](E[X])2\text\{Var\}(X) = E[X^2] - (E[X])^2.
  • Common distributions — Bernoulli, Binomial, Poisson (discrete). Gaussian/Normal, Exponential, Uniform (continuous). Multivariate Gaussian {N}(μ,Σ)\mathcal\{N\}(\mu, \Sigma).
  • Maximum Likelihood Estimation (MLE) — Given data, find parameters θ\theta that maximize P({data}θ)P(\text\{data\}|\theta). Log-likelihood trick. Applies to linear regression, logistic regression, Gaussian models.
  • Maximum A Posteriori (MAP) — MLE with a prior: maximize P(θ{data})P({data}θ)P(θ)P(\theta|\text\{data\}) \propto P(\text\{data\}|\theta)P(\theta). Equivalent to regularized MLE.
  • Information theory basics — Entropy H(X)=p(x)logp(x)H(X) = -\sum p(x)\log p(x), KL divergence, cross-entropy (used as loss functions).

3. Multivariable Calculus (Critical)

Why: Optimization (gradient descent) requires computing and understanding gradients.

  • Partial derivatives — Rate of change of a function w.r.t. one variable, holding others constant.
  • Gradient — Vector of all partial derivatives: f=[{f}{x1},,{f}{xn}]T\nabla f = [\frac\{\partial f\}\{\partial x_1\}, \ldots, \frac\{\partial f\}\{\partial x_n\}]^T. Points in the direction of steepest ascent.
  • Chain rule — For composite functions: {}{x}f(g(x))=f(g(x))g(x)\frac\{\partial\}\{\partial x\} f(g(x)) = f'(g(x)) \cdot g'(x). Extends to multivariate (Jacobian chain rule). Foundation of backpropagation.
  • Hessian matrix — Matrix of second partial derivatives. Determines curvature. Positive definite Hessian → convex function → gradient descent converges.
  • Taylor expansion — First-order: linear approximation (used in gradient descent). Second-order: quadratic approximation (used in Newton's method).

4. Optimization Theory (Important)

Why: Training any ML model is an optimization problem — minimizing a loss function.

  • Convex functions — A function ff is convex if f(αx+(1α)y)αf(x)+(1α)f(y)f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y). Convex problems have a single global minimum. Linear regression loss is convex; neural network loss is not.
  • Gradient descent — Update rule: θ:=θαθJ(θ)\theta := \theta - \alpha \nabla_\theta J(\theta). Learning rate α\alpha controls step size. Too large → diverge, too small → slow.
  • Stochastic Gradient Descent (SGD) — Use a random subset (mini-batch) of data for each gradient update. Faster per iteration, noisier but often converges to good solutions.
  • Newton's method — Second-order: θ:=θH{1}J\theta := \theta - H^\{-1\}\nabla J. Converges faster (quadratically) but expensive to compute/invert the Hessian.
  • Lagrange multipliers — For constrained optimization. Used in deriving SVMs.

5. CS229 Course Structure — Lecture Roadmap

Why: To plan which lectures to blog and in what order.

  • Supervised Learning — Linear regression, logistic regression, GDA (Generative Discriminative Analysis), Naive Bayes, SVMs, neural networks, decision trees, ensemble methods.
  • Unsupervised Learning — K-means clustering, EM algorithm, PCA, ICA, factor analysis.
  • Learning Theory — Bias/variance tradeoff, VC dimension, PAC learning, regularization, model selection.
  • Reinforcement Learning — MDPs, value iteration, policy iteration, Q-learning, policy gradient.
  • Practical Advice — Debugging ML algorithms, error analysis, ceiling analysis.

6. Python & NumPy (Practical)

Why: Implementing algorithms from the course requires efficient numerical computing.

  • NumPy arrays — Vectorized operations, broadcasting, slicing. Avoid Python loops for numerical work.
  • Matrix operationsnp.dot(), np.linalg.inv(), np.linalg.eig(), np.linalg.svd().
  • Matplotlib — Plotting decision boundaries, loss curves, data distributions.
  • Jupyter notebooks — Interactive development for experimentation and documentation.

TODO / Remaining Work

  • Complete Lecture 1 notes: Linear algebra review with worked examples
  • Blog Lecture 2: Linear regression, normal equations, gradient descent derivation
  • Blog Lecture 3: Locally weighted regression, logistic regression
  • Blog Lecture 4: Newton's method, perceptron, exponential family
  • Blog Lecture 5: Generative learning algorithms (GDA, Naive Bayes)
  • Blog Lecture 6-7: Support Vector Machines (kernels, SMO algorithm)
  • Blog Lecture 8-9: Neural networks and backpropagation
  • Blog Lecture 10: Bias/variance, regularization, model selection
  • Blog Lecture 11-12: Unsupervised learning (K-means, EM, PCA)
  • Blog Lecture 13-14: Reinforcement learning (MDPs, Q-learning)
  • Add Python implementations for each lecture's key algorithm

Lecture 1 — Summary & Key Takeaways

The Three Types of Machine Learning (Andrew Ng's Framing)

  1. Supervised Learning — given labelled pairs (x{(i)},y{(i)})(x^\{(i)\}, y^\{(i)\}), learn h:XYh: X \to Y.
    • Regression: y{R}y \in \mathbb\{R\} (house prices, watch time).
    • Classification: y{0,1,,K}y \in \{0, 1, \ldots, K\} (spam/not-spam, digit recognition).
  2. Unsupervised Learning — only x{(i)}x^\{(i)\}, no labels. Find structure: clusters, lower-dim representations, density.
  3. Reinforcement Learning — an agent interacts with an environment, gets rewards, learns a policy π(as)\pi(a|s) to maximise long-run reward.

Core Notation Used Throughout the Course

SymbolMeaning
mmNumber of training examples
nnNumber of input features (dimension of xx)
x{(i)}{R}nx^\{(i)\} \in \mathbb\{R\}^nThe ii-th training input
y{(i)}y^\{(i)\}The ii-th training label
X{R}{m×n}X \in \mathbb\{R\}^\{m \times n\}Design matrix (rows = examples)
θ{R}n\theta \in \mathbb\{R\}^nModel parameters
hθ(x)h_\theta(x)Hypothesis function, e.g., θTx\theta^T x
J(θ)J(\theta)Cost / loss function

Linear Algebra Essentials for the Rest of the Course

Lecture 1's linear-algebra review exists because almost every algorithm in CS229 can be written as matrix/vector operations:

  • Vector inner product: uTv=iuiviu^T v = \sum_i u_i v_i — used for every prediction hθ(x)=θTxh_\theta(x) = \theta^T x.
  • Matrix-vector product: (Ax)i=jA{ij}xj(Ax)_i = \sum_j A_\{ij\} x_j — how linear layers of a neural net work.
  • Gradient: θJ{R}n\nabla_\theta J \in \mathbb\{R\}^n, the vector of partial derivatives. Gradient descent steps in θJ-\nabla_\theta J.
  • Hessian: H{ij}=2J/θiθjH_\{ij\} = \partial^2 J / \partial \theta_i \partial \theta_j. Newton's method uses θθH{1}J\theta \leftarrow \theta - H^\{-1\} \nabla J.
  • Positive semi-definiteness: xTAx0x^T A x \geq 0 for all xx. Guarantees convex JJ when JJ is quadratic with Hessian AA.

My Plan for Blogging the Rest of the Course

I'm following the official lecture order (linked per-lecture above). For each lecture I commit to producing:

  1. A one-page summary — key concepts, definitions, and intuitions.
  2. The math derivation of the central algorithm (not a copy of the slides; re-derived in my own notation).
  3. A Python/NumPy implementation of the algorithm from scratch — no scikit-learn except for sanity-checking.
  4. A worked example on a real dataset (UCI, Kaggle, or a small dataset I generate).
  5. "Where I got stuck" — the specific concept that required re-watching or extra reading, and how I unstuck myself.

Build-in-public reality: the value is not in re-explaining Andrew Ng — the value is in the friction points I hit and how a practising engineer (not a PhD student) works through them.

When every lecture has a published post with all five artefacts above, flip this post to status: published.

  • Create summary comparison table of all algorithms covered