09 Optim: Prerequisites and Gradient Descent
լուսանկարի հղումը, Հեղինակ՝ Artist Name
📚 Նյութը
This section was autogenerated. If you spot a mistake, please let me know!
Դասախոսություն
Գործնական
Jupyter Notebooks
🏡 Տնային
- ❗❗❗ DON’T CHECK THE SOLUTIONS BEFORE TRYING TO DO THE HOMEWORK BY YOURSELF❗❗❗
- Please don’t hesitate to ask questions, never forget about the 🍊karalyok🍊 principle!
- The harder the problem is, the more 🧀cheeses🧀 it has.
- Problems with 🎁 are just extra bonuses. It would be good to try to solve them, but also it’s not the highest priority task.
- If the problem involve many boring calculations, feel free to skip them - important part is understanding the concepts.
- Submit your solutions here (even if it’s unfinished)
Gradient Descent
01 Condition number by hand
Consider the quadratic \[f(\mathbf{x}) = \tfrac{1}{2}\mathbf{x}^T Q \mathbf{x}, \quad Q = \begin{pmatrix} 100 & 0 \\ 0 & 1 \end{pmatrix}.\]
- Compute the gradient \(\nabla f(\mathbf{x})\) and the Hessian \(H = \nabla^2 f(\mathbf{x})\).
- Find the eigenvalues of \(H\).
- Report the condition number \(\kappa(H) = \lambda_{\max} / \lambda_{\min}\).
- Explain in 2-3 sentences why a large \(\kappa\) predicts slow convergence of gradient descent.
02 Implement vanilla GD from scratch
Implement gradient descent in pure numpy (no scikit-learn, no PyTorch). Minimize \(f(\mathbf{x}) = x_1^2 + x_2^2\) starting from \(\mathbf{x}_0 = (5, 5)\).
- Write a function
gd(grad_f, x0, lr, n_iter)that returns the trajectory of \(\mathbf{x}\) values and the loss at each iteration. - Run with five different learning rates \(\eta \in \{0.01,\ 0.1,\ 0.5,\ 1.0,\ 1.5\}\) for 100 iterations each.
- On a single matplotlib figure, plot loss vs iteration for each \(\eta\) (use a log-scale y-axis).
- Which \(\eta\) converge fastest? Which (if any) diverge? Explain in one sentence why.
03 Find the divergence threshold
For the ill-conditioned quadratic \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\) starting from \(\mathbf{x}_0 = (1, 1)\), find experimentally the largest learning rate \(\eta^*\) for which GD still converges within 1000 iterations. Three decimal places is fine.
04 Beat constant LR with a custom schedule
Take the ill-conditioned quadratic from HW 03: \[f(\mathbf{x}) = 100 x_1^2 + x_2^2, \quad \mathbf{x}_0 = (5, 1).\]
A constant learning rate just below the divergence threshold (\(\eta \approx 0.009\)) takes a few hundred iterations to reach a loss below \(10^{-6}\). Your task: design a learning-rate schedule \(\eta(t)\) that converges in fewer than 50 iterations to loss below \(10^{-6}\).
Things to try: step decay, exponential decay, cosine annealing, warmup + decay, or your own custom function of iteration \(t\).
Report:
- Your final iteration count and final loss.
- A plot of loss vs iteration (log y-axis).
- A plot of \(\eta(t)\) vs \(t\) for your schedule.
Tip: there’s a starter cell in Lectures/optim/03_gd_step_size.ipynb with helper code if you want a scaffold.
05 Visualize the zig-zag
Plot the GD trajectory on a contour plot for both:
- well-conditioned: \(f(\mathbf{x}) = x_1^2 + x_2^2\)
- ill-conditioned: \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\)
Both starting from \(\mathbf{x}_0 = (5, 1)\) with the same learning rate \(\eta = 0.009\) (just below the divergence threshold from HW 03) for 50 iterations.
- Make a side-by-side figure (2 subplots) with the level curves and the GD trajectory overlaid on each.
- Describe in 2-3 sentences why the ill-conditioned trajectory zig-zags. Relate your answer to the eigenvectors of the Hessian.
06 Stuck at a saddle?
Consider the saddle-point function \[f(x, y) = x^2 - y^2.\]
- Compute the gradient. Where is \(\nabla f = \mathbf{0}\)?
- Run GD from three starting points: \((0, 0)\), \((0, 0.001)\), and \((0.001, 0)\). Use \(\eta = 0.1\) and 100 iterations. Report the final point in each case.
- Explain in 2-3 sentences what each starting point illustrates. Which gets stuck at the saddle? Which escapes downward? Which drifts away along an unstable direction?
- ML connection: why are saddle points a much bigger concern for neural networks (deep, high-dimensional) than they are for the 2D problems above?
Hint for part 4: think about what fraction of critical points are saddles vs. true local minima in high dimensions, and why that fraction changes as dimension grows.
07 GD for linear regression
Apply gradient descent to fit a linear regression model on synthetic data.
Generate the data (run this exactly, so we both see the same numbers):
import numpy as np
np.random.seed(509)
n = 200
X = np.random.uniform(0, 10, size=(n, 3)) # 3 features
true_w = np.array([2.5, -1.0, 0.5])
y = X @ true_w + np.random.normal(0, 0.5, size=n) # add noiseTasks:
- Write the MSE loss \(L(\mathbf{w}) = \tfrac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}_i - y_i)^2\) and derive its gradient \(\nabla_{\mathbf{w}} L\) by hand. State the result in vectorized form.
- Implement GD to minimize \(L\) starting from \(\mathbf{w} = \mathbf{0}\). Tune the learning rate — something in \([10^{-4}, 10^{-2}]\) should work. Run until \(\|\nabla L\| < 10^{-4}\).
- Compute the closed-form normal-equations solution: \(\mathbf{w}^* = (X^T X)^{-1} X^T y\). In practice use
np.linalg.solve(X.T @ X, X.T @ y)rather than computing the inverse explicitly. - Compare your GD solution to \(\mathbf{w}^*\) and to
true_w. How close are they? - Plot training loss vs iteration.
- In 2-3 sentences, explain when you’d prefer GD over the closed-form solution. What changes as the number of features grows from 3 to 3000 to 3 million?
Hint for part 6: how expensive is inverting (or solving with) a \(p \times p\) matrix as \(p\) grows?
🛠️ Գործնական
🎲 xx+37 (xx)
- ▶️ToDo
- 🔗Random link ToDo
- 🇦🇲🎶ToDo
- 🌐🎶ToDo
- 🤌Կարգին ToDo