Documenting the Stanford CS229 ML Course and gathering insights
- Although I have an informal understanding of ML and deep learning, I decided to begin this in-depth course on a friend's recommendation and am documenting my progress as part of the "Build in Public" movement.
Lecture 1 - Introduction and Linear Algebra
Course lecture : Lecture - 1
Background & Prerequisites — What You Need to Know Before Completing This Blog
CS229 is an intermediate/advanced course. Before blogging through each lecture, you need a strong foundation in the following areas. Each section explains the topic, why it matters for CS229, and the depth required.
1. Linear Algebra (Critical)
Why: ML algorithms are fundamentally linear algebra operations. Gradient descent, PCA, SVD, neural networks — all are matrix/vector operations. - Vectors — Column vectors, row vectors, norms (L1, L2, L∞). Geometric interpretation: direction + magnitude. - Matrices — Addition, multiplication (row-by-column), transpose, trace. Understand matrix multiplication as a composition of linear transformations. - Systems of linear equations — $Ax = b$. Solving via Gaussian elimination, LU decomposition. When solutions exist (rank conditions). - Eigenvalues & Eigenvectors — $Av = \lambda v$. Eigendecomposition, spectral theorem for symmetric matrices. Used in PCA, covariance analysis, stability analysis. - Singular Value Decomposition (SVD) — $A = U\Sigma V^T$. Fundamental for dimensionality reduction, matrix approximation, and understanding matrix rank. - Positive definite matrices — $x^T A x > 0$ for all nonzero $x$. Guarantees convexity of quadratic forms. Hessian positive definite → local minimum. - Matrix calculus — Gradient of a scalar w.r.t. a vector ($\nabla_x f$), Jacobian (vector-to-vector), Hessian (second derivatives). Chain rule for matrix operations.
2. Probability & Statistics (Critical)
Why: ML is applied statistics. Every model makes probabilistic assumptions about data. - Probability axioms — Sample space, events, probability measure. Conditional probability, Bayes' theorem: $P(A|B) = \frac{P(B|A)P(A)}{P(B)}$. - Random variables — Discrete (PMF) and continuous (PDF). CDF. Expected value $E[X]$, variance $\text{Var}(X) = E[X^2] - (E[X])^2$. - Common distributions — Bernoulli, Binomial, Poisson (discrete). Gaussian/Normal, Exponential, Uniform (continuous). Multivariate Gaussian $\mathcal{N}(\mu, \Sigma)$. - Maximum Likelihood Estimation (MLE) — Given data, find parameters $\theta$ that maximize $P(\text{data}|\theta)$. Log-likelihood trick. Applies to linear regression, logistic regression, Gaussian models. - Maximum A Posteriori (MAP) — MLE with a prior: maximize $P(\theta|\text{data}) \propto P(\text{data}|\theta)P(\theta)$. Equivalent to regularized MLE. - Information theory basics — Entropy $H(X) = -\sum p(x)\log p(x)$, KL divergence, cross-entropy (used as loss functions).
3. Multivariable Calculus (Critical)
Why: Optimization (gradient descent) requires computing and understanding gradients. - Partial derivatives — Rate of change of a function w.r.t. one variable, holding others constant. - Gradient — Vector of all partial derivatives: $\nabla f = [\frac{\partial f}{\partial x_1}, \ldots, \frac{\partial f}{\partial x_n}]^T$. Points in the direction of steepest ascent. - Chain rule — For composite functions: $\frac{\partial}{\partial x} f(g(x)) = f'(g(x)) \cdot g'(x)$. Extends to multivariate (Jacobian chain rule). Foundation of backpropagation. - Hessian matrix — Matrix of second partial derivatives. Determines curvature. Positive definite Hessian → convex function → gradient descent converges. - Taylor expansion — First-order: linear approximation (used in gradient descent). Second-order: quadratic approximation (used in Newton's method).
4. Optimization Theory (Important)
Why: Training any ML model is an optimization problem — minimizing a loss function. - Convex functions — A function $f$ is convex if $f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y)$. Convex problems have a single global minimum. Linear regression loss is convex; neural network loss is not. - Gradient descent — Update rule: $\theta := \theta - \alpha \nabla_\theta J(\theta)$. Learning rate $\alpha$ controls step size. Too large → diverge, too small → slow. - Stochastic Gradient Descent (SGD) — Use a random subset (mini-batch) of data for each gradient update. Faster per iteration, noisier but often converges to good solutions. - Newton's method — Second-order: $\theta := \theta - H^{-1}\nabla J$. Converges faster (quadratically) but expensive to compute/invert the Hessian. - Lagrange multipliers — For constrained optimization. Used in deriving SVMs.
5. CS229 Course Structure — Lecture Roadmap
Why: To plan which lectures to blog and in what order. - Supervised Learning — Linear regression, logistic regression, GDA (Generative Discriminative Analysis), Naive Bayes, SVMs, neural networks, decision trees, ensemble methods. - Unsupervised Learning — K-means clustering, EM algorithm, PCA, ICA, factor analysis. - Learning Theory — Bias/variance tradeoff, VC dimension, PAC learning, regularization, model selection. - Reinforcement Learning — MDPs, value iteration, policy iteration, Q-learning, policy gradient. - Practical Advice — Debugging ML algorithms, error analysis, ceiling analysis.
6. Python & NumPy (Practical)
Why: Implementing algorithms from the course requires efficient numerical computing.
- NumPy arrays — Vectorized operations, broadcasting, slicing. Avoid Python loops for numerical work.
- Matrix operations — np.dot(), np.linalg.inv(), np.linalg.eig(), np.linalg.svd().
- Matplotlib — Plotting decision boundaries, loss curves, data distributions.
- Jupyter notebooks — Interactive development for experimentation and documentation.
TODO / Remaining Work
- [ ] Complete Lecture 1 notes: Linear algebra review with worked examples
- [ ] Blog Lecture 2: Linear regression, normal equations, gradient descent derivation
- [ ] Blog Lecture 3: Locally weighted regression, logistic regression
- [ ] Blog Lecture 4: Newton's method, perceptron, exponential family
- [ ] Blog Lecture 5: Generative learning algorithms (GDA, Naive Bayes)
- [ ] Blog Lecture 6-7: Support Vector Machines (kernels, SMO algorithm)
- [ ] Blog Lecture 8-9: Neural networks and backpropagation
- [ ] Blog Lecture 10: Bias/variance, regularization, model selection
- [ ] Blog Lecture 11-12: Unsupervised learning (K-means, EM, PCA)
- [ ] Blog Lecture 13-14: Reinforcement learning (MDPs, Q-learning)
- [ ] Add Python implementations for each lecture's key algorithm
- [ ] Create summary comparison table of all algorithms covered