10 Optim: Momentum + First Order Methods
լուսանկարի հղումը, Հեղինակ՝ Artist Name
📚 Նյութը
This section was autogenerated. If you spot a mistake, please let me know!
Դասախոսություն
Գործնական
Jupyter Notebooks
🏡 Տնային
- ❗❗❗ DON’T CHECK THE SOLUTIONS BEFORE TRYING TO DO THE HOMEWORK BY YOURSELF❗❗❗
- Please don’t hesitate to ask questions, never forget about the 🍊karalyok🍊 principle!
- The harder the problem is, the more 🧀cheeses🧀 it has.
- Problems with 🎁 are just extra bonuses. It would be good to try to solve them, but also it’s not the highest priority task.
- If the problem involve many boring calculations, feel free to skip them - important part is understanding the concepts.
- Submit your solutions here (even if it’s unfinished)
Momentum, Adam, and Friends
01 Write out the update rules
For each of these algorithms, write out the per-step update \(\mathbf{x}_{t+1} = \ldots\) given a gradient \(\mathbf{g}_t = \nabla f(\mathbf{x}_t)\) at iteration \(t\). For each one, also state in one phrase what the algorithm “remembers” between iterations (e.g. “nothing”, “the 1st moment of past gradients”, “1st and 2nd moments”, etc.):
- Vanilla gradient descent
- GD with momentum (coefficient \(\beta\))
- AdaGrad
- RMSProp (coefficient \(\gamma\))
- Adam (with \(\beta_1\), \(\beta_2\), and the bias-correction step)
Write them from memory or your own notes first, then verify against the reference implementations in Lectures/optim/04_adam_in_class.ipynb.
02 Implement momentum from scratch
Take the vanilla GD function you wrote in chapter 09 HW 02 and add a momentum term. Use this formulation: \[\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_t, \quad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \mathbf{v}_t, \quad \mathbf{v}_0 = \mathbf{0}.\]
Apply it to the ill-conditioned quadratic from chapter 09 HW 05: \[f(\mathbf{x}) = 100 x_1^2 + x_2^2, \quad \mathbf{x}_0 = (5, 1), \quad \eta = 0.009.\]
- Run with momentum coefficient \(\beta \in \{0, 0.5, 0.9, 0.99\}\) for 200 iterations each.
- Plot loss vs iteration for each \(\beta\) on a single figure with a log y-axis.
- Report a small table: \(\beta\) vs number of iterations to reach loss below \(10^{-6}\) (write “did not converge” if it doesn’t).
- Why does \(\beta = 0.99\) behave qualitatively differently from \(\beta = 0.9\)? Two sentences.
03 Momentum kills the zig-zag
Building directly on chapter 09 HW 05, now overlay GD+Momentum on the same plot.
For \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\) from \(\mathbf{x}_0 = (5, 1)\) with \(\eta = 0.009\) for 100 iterations, plot three trajectories on the same contour plot:
- vanilla GD
- GD+Momentum with \(\beta = 0.9\)
- GD+Momentum with \(\beta = 0.99\)
Then in 2-3 sentences, describe how momentum changes the trajectory shape. What’s the physical intuition? (Hint: think of a ball rolling down a hill with friction.)
04 Implement Adam from scratch
Implement the full Adam algorithm including the bias-correction step. The update rules at iteration \(t\) (using 1-indexed \(t\), so \(t = 1\) on the first step): \[\mathbf{v}_t = \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t\] \[\mathbf{G}_t = \beta_2 \mathbf{G}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 \quad (\text{element-wise square})\] \[\hat{\mathbf{v}}_t = \mathbf{v}_t / (1 - \beta_1^t), \quad \hat{\mathbf{G}}_t = \mathbf{G}_t / (1 - \beta_2^t)\] \[\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \cdot \hat{\mathbf{v}}_t / (\sqrt{\hat{\mathbf{G}}_t} + \epsilon)\]
with \(\mathbf{v}_0 = \mathbf{G}_0 = \mathbf{0}\) and PyTorch defaults: \(\eta = 0.001\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\).
- Apply Adam to the Rosenbrock function \(f(x_1, x_2) = (1 - x_1)^2 + 100(x_2 - x_1^2)^2\) from \(\mathbf{x}_0 = (-1.5, 2.5)\) for 2000 iterations.
- Plot the loss curve (log y-axis).
- Plot the trajectory overlaid on the Rosenbrock contour plot. Does it reach the global minimum at \((1, 1)\)?
05 The bias correction matters
Implement two versions of Adam: one with the bias-correction step (\(\hat{\mathbf{v}}_t, \hat{\mathbf{G}}_t\)), one without (use the raw \(\mathbf{v}_t, \mathbf{G}_t\) directly in the update).
- Apply both to \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\) from \(\mathbf{x}_0 = (5, 1)\) for 30 iterations. Use defaults from HW 04.
- Plot the loss vs iteration for both on the same figure.
- Explain in 2-3 sentences why the bias correction matters most in the first few iterations. (Hint: at \(t = 1\) with \(\beta_1 = 0.9\) and \(\mathbf{v}_0 = \mathbf{0}\), compute \(\mathbf{v}_1\) vs \(\hat{\mathbf{v}}_1\). Which one would you want to actually use?)
06 When does Adam fail?
The default ML wisdom is “use Adam.” But Adam doesn’t always win. Reproduce a case where it loses.
Consider the Rastrigin function \[f(x_1, x_2) = 20 + (x_1^2 - 10 \cos 2\pi x_1) + (x_2^2 - 10 \cos 2\pi x_2)\] from \(\mathbf{x}_0 = (4, 4)\) for 1000 iterations.
- AdaGrad: implement it using your formula from HW 01, or use the reference implementation from
Lectures/optim/04_adam_in_class.ipynb. Use \(\eta = 0.5\). - Adam: use your implementation from HW 04 with the defaults.
- Report the final loss and number of iterations to converge (or “did not converge”) for each.
- In 2-3 sentences, speculate why a “simpler” adaptive method beats Adam on this function. (Hint: Rastrigin is highly multimodal. What does Adam’s \(\hat{\mathbf{G}}_t\) do in a flat region of \(f\)?)
07 Robustness to the learning rate
A common claim about Adam: “the defaults usually just work.” Test how sensitive vanilla GD and Adam are to the learning rate \(\eta\).
For \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\) from \(\mathbf{x}_0 = (5, 1)\) for 500 iterations:
- Run vanilla GD with \(\eta \in \{10^{-4},\ 10^{-3},\ 10^{-2},\ 10^{-1},\ 1.0\}\).
- Run Adam (your implementation from HW 04) with \(\eta \in \{10^{-4},\ 10^{-3},\ 10^{-2},\ 10^{-1},\ 1.0\}\). Keep \(\beta_1, \beta_2, \epsilon\) at defaults.
- For each algorithm, plot loss vs iteration for all 5 learning rates on one figure (log y-axis). Two figures total.
- For each algorithm, count how many of the 5 learning rates produced “reasonable” convergence — define this as final loss below \(10^{-3}\).
- In 2-3 sentences, summarize which algorithm is more robust to learning-rate choice, and the practical implication for someone tuning a new model.
08 Adam for linear regression
Revisit chapter 09 HW 07 (linear regression on synthetic data using vanilla GD), but now use Adam instead.
- Generate the same data exactly as in chapter 09 HW 07 (same seed, same
X,y,true_w). - Apply your Adam implementation (from HW 04) to minimize the MSE loss starting from \(\mathbf{w} = \mathbf{0}\). Run for 2000 iterations.
- Plot loss vs iteration for both vanilla GD (from chapter 09 HW 07) and Adam on the same figure (log y-axis).
- For this strictly convex problem with a single global minimum, does Adam beat vanilla GD? Write 2-3 sentences on what this tells you about when adaptive methods earn their complexity.
🛠️ Գործնական
🎲 xx+37 (xx)
- ▶️ToDo
- 🔗Random link ToDo
- 🇦🇲🎶ToDo
- 🌐🎶ToDo
- 🤌Կարգին ToDo