10 Optim: Momentum + First Order Methods

image.png լուսանկարի հղումը, Հեղինակ՝ Artist Name

📚 Նյութը

This section was autogenerated. If you spot a mistake, please let me know!

Դասախոսություն

Գործնական

Jupyter Notebooks

🏡 Տնային

Note
  1. ❗❗❗ DON’T CHECK THE SOLUTIONS BEFORE TRYING TO DO THE HOMEWORK BY YOURSELF❗❗❗
  2. Please don’t hesitate to ask questions, never forget about the 🍊karalyok🍊 principle!
  3. The harder the problem is, the more 🧀cheeses🧀 it has.
  4. Problems with 🎁 are just extra bonuses. It would be good to try to solve them, but also it’s not the highest priority task.
  5. If the problem involve many boring calculations, feel free to skip them - important part is understanding the concepts.
  6. Submit your solutions here (even if it’s unfinished)

Momentum, Adam, and Friends

01 Write out the update rules

For each of these algorithms, write out the per-step update \(\mathbf{x}_{t+1} = \ldots\) given a gradient \(\mathbf{g}_t = \nabla f(\mathbf{x}_t)\) at iteration \(t\). For each one, also state in one phrase what the algorithm “remembers” between iterations (e.g. “nothing”, “the 1st moment of past gradients”, “1st and 2nd moments”, etc.):

  1. Vanilla gradient descent
  2. GD with momentum (coefficient \(\beta\))
  3. AdaGrad
  4. RMSProp (coefficient \(\gamma\))
  5. Adam (with \(\beta_1\), \(\beta_2\), and the bias-correction step)

Write them from memory or your own notes first, then verify against the reference implementations in Lectures/optim/04_adam_in_class.ipynb.

02 Implement momentum from scratch

Take the vanilla GD function you wrote in chapter 09 HW 02 and add a momentum term. Use this formulation: \[\mathbf{v}_t = \beta \mathbf{v}_{t-1} + \mathbf{g}_t, \quad \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \mathbf{v}_t, \quad \mathbf{v}_0 = \mathbf{0}.\]

Apply it to the ill-conditioned quadratic from chapter 09 HW 05: \[f(\mathbf{x}) = 100 x_1^2 + x_2^2, \quad \mathbf{x}_0 = (5, 1), \quad \eta = 0.009.\]

  1. Run with momentum coefficient \(\beta \in \{0, 0.5, 0.9, 0.99\}\) for 200 iterations each.
  2. Plot loss vs iteration for each \(\beta\) on a single figure with a log y-axis.
  3. Report a small table: \(\beta\) vs number of iterations to reach loss below \(10^{-6}\) (write “did not converge” if it doesn’t).
  4. Why does \(\beta = 0.99\) behave qualitatively differently from \(\beta = 0.9\)? Two sentences.

03 Momentum kills the zig-zag

Building directly on chapter 09 HW 05, now overlay GD+Momentum on the same plot.

For \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\) from \(\mathbf{x}_0 = (5, 1)\) with \(\eta = 0.009\) for 100 iterations, plot three trajectories on the same contour plot:

  • vanilla GD
  • GD+Momentum with \(\beta = 0.9\)
  • GD+Momentum with \(\beta = 0.99\)

Then in 2-3 sentences, describe how momentum changes the trajectory shape. What’s the physical intuition? (Hint: think of a ball rolling down a hill with friction.)

04 Implement Adam from scratch

Implement the full Adam algorithm including the bias-correction step. The update rules at iteration \(t\) (using 1-indexed \(t\), so \(t = 1\) on the first step): \[\mathbf{v}_t = \beta_1 \mathbf{v}_{t-1} + (1 - \beta_1) \mathbf{g}_t\] \[\mathbf{G}_t = \beta_2 \mathbf{G}_{t-1} + (1 - \beta_2) \mathbf{g}_t^2 \quad (\text{element-wise square})\] \[\hat{\mathbf{v}}_t = \mathbf{v}_t / (1 - \beta_1^t), \quad \hat{\mathbf{G}}_t = \mathbf{G}_t / (1 - \beta_2^t)\] \[\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \cdot \hat{\mathbf{v}}_t / (\sqrt{\hat{\mathbf{G}}_t} + \epsilon)\]

with \(\mathbf{v}_0 = \mathbf{G}_0 = \mathbf{0}\) and PyTorch defaults: \(\eta = 0.001\), \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\).

  1. Apply Adam to the Rosenbrock function \(f(x_1, x_2) = (1 - x_1)^2 + 100(x_2 - x_1^2)^2\) from \(\mathbf{x}_0 = (-1.5, 2.5)\) for 2000 iterations.
  2. Plot the loss curve (log y-axis).
  3. Plot the trajectory overlaid on the Rosenbrock contour plot. Does it reach the global minimum at \((1, 1)\)?

05 The bias correction matters

Implement two versions of Adam: one with the bias-correction step (\(\hat{\mathbf{v}}_t, \hat{\mathbf{G}}_t\)), one without (use the raw \(\mathbf{v}_t, \mathbf{G}_t\) directly in the update).

  1. Apply both to \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\) from \(\mathbf{x}_0 = (5, 1)\) for 30 iterations. Use defaults from HW 04.
  2. Plot the loss vs iteration for both on the same figure.
  3. Explain in 2-3 sentences why the bias correction matters most in the first few iterations. (Hint: at \(t = 1\) with \(\beta_1 = 0.9\) and \(\mathbf{v}_0 = \mathbf{0}\), compute \(\mathbf{v}_1\) vs \(\hat{\mathbf{v}}_1\). Which one would you want to actually use?)

06 When does Adam fail?

The default ML wisdom is “use Adam.” But Adam doesn’t always win. Reproduce a case where it loses.

Consider the Rastrigin function \[f(x_1, x_2) = 20 + (x_1^2 - 10 \cos 2\pi x_1) + (x_2^2 - 10 \cos 2\pi x_2)\] from \(\mathbf{x}_0 = (4, 4)\) for 1000 iterations.

  1. AdaGrad: implement it using your formula from HW 01, or use the reference implementation from Lectures/optim/04_adam_in_class.ipynb. Use \(\eta = 0.5\).
  2. Adam: use your implementation from HW 04 with the defaults.
  3. Report the final loss and number of iterations to converge (or “did not converge”) for each.
  4. In 2-3 sentences, speculate why a “simpler” adaptive method beats Adam on this function. (Hint: Rastrigin is highly multimodal. What does Adam’s \(\hat{\mathbf{G}}_t\) do in a flat region of \(f\)?)

07 Robustness to the learning rate

A common claim about Adam: “the defaults usually just work.” Test how sensitive vanilla GD and Adam are to the learning rate \(\eta\).

For \(f(\mathbf{x}) = 100 x_1^2 + x_2^2\) from \(\mathbf{x}_0 = (5, 1)\) for 500 iterations:

  1. Run vanilla GD with \(\eta \in \{10^{-4},\ 10^{-3},\ 10^{-2},\ 10^{-1},\ 1.0\}\).
  2. Run Adam (your implementation from HW 04) with \(\eta \in \{10^{-4},\ 10^{-3},\ 10^{-2},\ 10^{-1},\ 1.0\}\). Keep \(\beta_1, \beta_2, \epsilon\) at defaults.
  3. For each algorithm, plot loss vs iteration for all 5 learning rates on one figure (log y-axis). Two figures total.
  4. For each algorithm, count how many of the 5 learning rates produced “reasonable” convergence — define this as final loss below \(10^{-3}\).
  5. In 2-3 sentences, summarize which algorithm is more robust to learning-rate choice, and the practical implication for someone tuning a new model.

08 Adam for linear regression

Revisit chapter 09 HW 07 (linear regression on synthetic data using vanilla GD), but now use Adam instead.

  1. Generate the same data exactly as in chapter 09 HW 07 (same seed, same X, y, true_w).
  2. Apply your Adam implementation (from HW 04) to minimize the MSE loss starting from \(\mathbf{w} = \mathbf{0}\). Run for 2000 iterations.
  3. Plot loss vs iteration for both vanilla GD (from chapter 09 HW 07) and Adam on the same figure (log y-axis).
  4. For this strictly convex problem with a single global minimum, does Adam beat vanilla GD? Write 2-3 sentences on what this tells you about when adaptive methods earn their complexity.

🛠️ Գործնական

🎲 xx+37 (xx)

Flag Counter