Stanford CS229: Machine Learning Course
CS229 provides a broad introduction to statistical machine learning (at an intermediate / advanced level) and covers supervised learning (generative/discriminative learning, parametric/non-parametric learning, neural networks, support vector machines); unsupervised learning (clustering, dimensionality reduction, kernel methods); learning theory (bias/variance tradeoffs, practical ); and reinforcement learning among other topics
Documenting the Stanford CS229 ML Course and gathering insights
- Although I have an informal understanding of ML and deep learning, I decided to begin this in-depth course on a friend's recommendation and am documenting my progress as part of the "Build in Public" movement.
Lecture 1 - Introduction and Linear Algebra
Course lecture : Lecture - 1
Background & Prerequisites — What You Need to Know Before Completing This Blog
CS229 is an intermediate/advanced course. Before blogging through each lecture, you need a strong foundation in the following areas. Each section explains the topic, why it matters for CS229, and the depth required.
1. Linear Algebra (Critical)
Why: ML algorithms are fundamentally linear algebra operations. Gradient descent, PCA, SVD, neural networks — all are matrix/vector operations.
- Vectors — Column vectors, row vectors, norms (L1, L2, L∞). Geometric interpretation: direction + magnitude.
- Matrices — Addition, multiplication (row-by-column), transpose, trace. Understand matrix multiplication as a composition of linear transformations.
- Systems of linear equations — . Solving via Gaussian elimination, LU decomposition. When solutions exist (rank conditions).
- Eigenvalues & Eigenvectors — . Eigendecomposition, spectral theorem for symmetric matrices. Used in PCA, covariance analysis, stability analysis.
- Singular Value Decomposition (SVD) — . Fundamental for dimensionality reduction, matrix approximation, and understanding matrix rank.
- Positive definite matrices — for all nonzero . Guarantees convexity of quadratic forms. Hessian positive definite → local minimum.
- Matrix calculus — Gradient of a scalar w.r.t. a vector (), Jacobian (vector-to-vector), Hessian (second derivatives). Chain rule for matrix operations.
2. Probability & Statistics (Critical)
Why: ML is applied statistics. Every model makes probabilistic assumptions about data.
- Probability axioms — Sample space, events, probability measure. Conditional probability, Bayes' theorem: .
- Random variables — Discrete (PMF) and continuous (PDF). CDF. Expected value , variance .
- Common distributions — Bernoulli, Binomial, Poisson (discrete). Gaussian/Normal, Exponential, Uniform (continuous). Multivariate Gaussian .
- Maximum Likelihood Estimation (MLE) — Given data, find parameters that maximize . Log-likelihood trick. Applies to linear regression, logistic regression, Gaussian models.
- Maximum A Posteriori (MAP) — MLE with a prior: maximize . Equivalent to regularized MLE.
- Information theory basics — Entropy , KL divergence, cross-entropy (used as loss functions).
3. Multivariable Calculus (Critical)
Why: Optimization (gradient descent) requires computing and understanding gradients.
- Partial derivatives — Rate of change of a function w.r.t. one variable, holding others constant.
- Gradient — Vector of all partial derivatives: . Points in the direction of steepest ascent.
- Chain rule — For composite functions: . Extends to multivariate (Jacobian chain rule). Foundation of backpropagation.
- Hessian matrix — Matrix of second partial derivatives. Determines curvature. Positive definite Hessian → convex function → gradient descent converges.
- Taylor expansion — First-order: linear approximation (used in gradient descent). Second-order: quadratic approximation (used in Newton's method).
4. Optimization Theory (Important)
Why: Training any ML model is an optimization problem — minimizing a loss function.
- Convex functions — A function is convex if . Convex problems have a single global minimum. Linear regression loss is convex; neural network loss is not.
- Gradient descent — Update rule: . Learning rate controls step size. Too large → diverge, too small → slow.
- Stochastic Gradient Descent (SGD) — Use a random subset (mini-batch) of data for each gradient update. Faster per iteration, noisier but often converges to good solutions.
- Newton's method — Second-order: . Converges faster (quadratically) but expensive to compute/invert the Hessian.
- Lagrange multipliers — For constrained optimization. Used in deriving SVMs.
5. CS229 Course Structure — Lecture Roadmap
Why: To plan which lectures to blog and in what order.
- Supervised Learning — Linear regression, logistic regression, GDA (Generative Discriminative Analysis), Naive Bayes, SVMs, neural networks, decision trees, ensemble methods.
- Unsupervised Learning — K-means clustering, EM algorithm, PCA, ICA, factor analysis.
- Learning Theory — Bias/variance tradeoff, VC dimension, PAC learning, regularization, model selection.
- Reinforcement Learning — MDPs, value iteration, policy iteration, Q-learning, policy gradient.
- Practical Advice — Debugging ML algorithms, error analysis, ceiling analysis.
6. Python & NumPy (Practical)
Why: Implementing algorithms from the course requires efficient numerical computing.
- NumPy arrays — Vectorized operations, broadcasting, slicing. Avoid Python loops for numerical work.
- Matrix operations —
np.dot(),np.linalg.inv(),np.linalg.eig(),np.linalg.svd(). - Matplotlib — Plotting decision boundaries, loss curves, data distributions.
- Jupyter notebooks — Interactive development for experimentation and documentation.
TODO / Remaining Work
- Complete Lecture 1 notes: Linear algebra review with worked examples
- Blog Lecture 2: Linear regression, normal equations, gradient descent derivation
- Blog Lecture 3: Locally weighted regression, logistic regression
- Blog Lecture 4: Newton's method, perceptron, exponential family
- Blog Lecture 5: Generative learning algorithms (GDA, Naive Bayes)
- Blog Lecture 6-7: Support Vector Machines (kernels, SMO algorithm)
- Blog Lecture 8-9: Neural networks and backpropagation
- Blog Lecture 10: Bias/variance, regularization, model selection
- Blog Lecture 11-12: Unsupervised learning (K-means, EM, PCA)
- Blog Lecture 13-14: Reinforcement learning (MDPs, Q-learning)
- Add Python implementations for each lecture's key algorithm
Lecture 1 — Summary & Key Takeaways
The Three Types of Machine Learning (Andrew Ng's Framing)
- Supervised Learning — given labelled pairs , learn .
- Regression: (house prices, watch time).
- Classification: (spam/not-spam, digit recognition).
- Unsupervised Learning — only , no labels. Find structure: clusters, lower-dim representations, density.
- Reinforcement Learning — an agent interacts with an environment, gets rewards, learns a policy to maximise long-run reward.
Core Notation Used Throughout the Course
| Symbol | Meaning |
|---|---|
| Number of training examples | |
| Number of input features (dimension of ) | |
| The -th training input | |
| The -th training label | |
| Design matrix (rows = examples) | |
| Model parameters | |
| Hypothesis function, e.g., | |
| Cost / loss function |
Linear Algebra Essentials for the Rest of the Course
Lecture 1's linear-algebra review exists because almost every algorithm in CS229 can be written as matrix/vector operations:
- Vector inner product: — used for every prediction .
- Matrix-vector product: — how linear layers of a neural net work.
- Gradient: , the vector of partial derivatives. Gradient descent steps in .
- Hessian: . Newton's method uses .
- Positive semi-definiteness: for all . Guarantees convex when is quadratic with Hessian .
My Plan for Blogging the Rest of the Course
I'm following the official lecture order (linked per-lecture above). For each lecture I commit to producing:
- A one-page summary — key concepts, definitions, and intuitions.
- The math derivation of the central algorithm (not a copy of the slides; re-derived in my own notation).
- A Python/NumPy implementation of the algorithm from scratch — no scikit-learn except for sanity-checking.
- A worked example on a real dataset (UCI, Kaggle, or a small dataset I generate).
- "Where I got stuck" — the specific concept that required re-watching or extra reading, and how I unstuck myself.
Build-in-public reality: the value is not in re-explaining Andrew Ng — the value is in the friction points I hit and how a practising engineer (not a PhD student) works through them.
When every lecture has a published post with all five artefacts above, flip this post to status: published.
- Create summary comparison table of all algorithms covered