09 Optim: Prerequisites and Gradient Descent

լուսանկարի հղումը, Հեղինակ՝ Artist Name

📚 Նյութը

⚠️ Note

This section was autogenerated. If you spot a mistake, please let me know!

Դասախոսություն

Գործնական

Jupyter Notebooks

🏡 Տնային

Note

❗❗❗ DON’T CHECK THE SOLUTIONS BEFORE TRYING TO DO THE HOMEWORK BY YOURSELF❗❗❗
Please don’t hesitate to ask questions, never forget about the 🍊karalyok🍊 principle!
The harder the problem is, the more 🧀cheeses🧀 it has.
Problems with 🎁 are just extra bonuses. It would be good to try to solve them, but also it’s not the highest priority task.
If the problem involve many boring calculations, feel free to skip them - important part is understanding the concepts.
Submit your solutions here (even if it’s unfinished)

Gradient Descent

01 Condition number by hand

Consider the quadratic \[f(\mathbf{x}) = \tfrac{1}{2}\mathbf{x}^T Q \mathbf{x}, \quad Q = \begin{pmatrix} 100 & 0 \\ 0 & 1 \end{pmatrix}.\]

Compute the gradient $\nabla f(\mathbf{x})$ and the Hessian $H = \nabla^2 f(\mathbf{x})$.
Find the eigenvalues of $H$.
Report the condition number $\kappa(H) = \lambda_{\max} / \lambda_{\min}$.
Explain in 2-3 sentences why a large $\kappa$ predicts slow convergence of gradient descent.

02 Implement vanilla GD from scratch

Implement gradient descent in pure numpy (no scikit-learn, no PyTorch). Minimize $f(\mathbf{x}) = x_1^2 + x_2^2$ starting from $\mathbf{x}_0 = (5, 5)$.

Write a function gd(grad_f, x0, lr, n_iter) that returns the trajectory of $\mathbf{x}$ values and the loss at each iteration.
Run with five different learning rates $\eta \in \{0.01,\ 0.1,\ 0.5,\ 1.0,\ 1.5\}$ for 100 iterations each.
On a single matplotlib figure, plot loss vs iteration for each $\eta$ (use a log-scale y-axis).
Which $\eta$ converge fastest? Which (if any) diverge? Explain in one sentence why.

03 Find the divergence threshold

For the ill-conditioned quadratic $f(\mathbf{x}) = 100 x_1^2 + x_2^2$ starting from $\mathbf{x}_0 = (1, 1)$, find experimentally the largest learning rate $\eta^*$ for which GD still converges within 1000 iterations. Three decimal places is fine.

04 Beat constant LR with a custom schedule

Take the ill-conditioned quadratic from HW 03: \[f(\mathbf{x}) = 100 x_1^2 + x_2^2, \quad \mathbf{x}_0 = (5, 1).\]

A constant learning rate just below the divergence threshold ($\eta \approx 0.009$) takes a few hundred iterations to reach a loss below $10^{-6}$. Your task: design a learning-rate schedule $\eta(t)$ that converges in fewer than 50 iterations to loss below $10^{-6}$.

Things to try: step decay, exponential decay, cosine annealing, warmup + decay, or your own custom function of iteration $t$.

Report:

Your final iteration count and final loss.
A plot of loss vs iteration (log y-axis).
A plot of $\eta(t)$ vs $t$ for your schedule.

Tip: there’s a starter cell in Lectures/optim/03_gd_step_size.ipynb with helper code if you want a scaffold.

05 Visualize the zig-zag

Plot the GD trajectory on a contour plot for both:

well-conditioned: $f(\mathbf{x}) = x_1^2 + x_2^2$
ill-conditioned: $f(\mathbf{x}) = 100 x_1^2 + x_2^2$

Both starting from $\mathbf{x}_0 = (5, 1)$ with the same learning rate $\eta = 0.009$ (just below the divergence threshold from HW 03) for 50 iterations.

Make a side-by-side figure (2 subplots) with the level curves and the GD trajectory overlaid on each.
Describe in 2-3 sentences why the ill-conditioned trajectory zig-zags. Relate your answer to the eigenvectors of the Hessian.

06 Stuck at a saddle?

Consider the saddle-point function \[f(x, y) = x^2 - y^2.\]

Compute the gradient. Where is $\nabla f = \mathbf{0}$?
Run GD from three starting points: $(0, 0)$, $(0, 0.001)$, and $(0.001, 0)$. Use $\eta = 0.1$ and 100 iterations. Report the final point in each case.
Explain in 2-3 sentences what each starting point illustrates. Which gets stuck at the saddle? Which escapes downward? Which drifts away along an unstable direction?
ML connection: why are saddle points a much bigger concern for neural networks (deep, high-dimensional) than they are for the 2D problems above?

Hint for part 4: think about what fraction of critical points are saddles vs. true local minima in high dimensions, and why that fraction changes as dimension grows.

07 GD for linear regression

Apply gradient descent to fit a linear regression model on synthetic data.

Generate the data (run this exactly, so we both see the same numbers):

import numpy as np
np.random.seed(509)
n = 200
X = np.random.uniform(0, 10, size=(n, 3))      # 3 features
true_w = np.array([2.5, -1.0, 0.5])
y = X @ true_w + np.random.normal(0, 0.5, size=n)  # add noise

Tasks:

Write the MSE loss $L(\mathbf{w}) = \tfrac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}_i - y_i)^2$ and derive its gradient $\nabla_{\mathbf{w}} L$ by hand. State the result in vectorized form.
Implement GD to minimize $L$ starting from $\mathbf{w} = \mathbf{0}$. Tune the learning rate — something in $[10^{-4}, 10^{-2}]$ should work. Run until $\|\nabla L\| < 10^{-4}$.
Compute the closed-form normal-equations solution: $\mathbf{w}^* = (X^T X)^{-1} X^T y$. In practice use np.linalg.solve(X.T @ X, X.T @ y) rather than computing the inverse explicitly.
Compare your GD solution to $\mathbf{w}^*$ and to true_w. How close are they?
Plot training loss vs iteration.
In 2-3 sentences, explain when you’d prefer GD over the closed-form solution. What changes as the number of features grows from 3 to 3000 to 3 million?

Hint for part 6: how expensive is inverting (or solving with) a $p \times p$ matrix as $p$ grows?

🛠️ Գործնական

🎲 xx+37 (xx)

▶️ToDo
🔗Random link ToDo
🇦🇲🎶ToDo
🌐🎶ToDo
🤌Կարգին ToDo

--- title: "09 Optim: Prerequisites and Gradient Descent" format: html: css: homework-styles.css --- <script src="homework-scripts.js"></script> ![image.png](../background_photos/) [լուսանկարի հղումը](https://unsplash.com/), Հեղինակ՝ [Artist Name](https://unsplash.com/) # 📚 Նյութը ::: {.callout-tip collapse="true"} ## ⚠️ Note This section was autogenerated. If you spot a mistake, please let me know! ::: ## Դասախոսություն - [📺 Դասախոսություն 15 — Prerequisites, 1st/2nd order conditions](https://youtu.be/7bl1k6TZ9_Q) - [📺 Դասախոսություն 16 — (In)Exact Line Search, Gradient Descent intro](https://youtu.be/UjGD31i8Ji4) - [🎞️ Սլայդեր — Optimization Prerequisites](Lectures/optim/02_prereqs.pdf) - [🎞️ Սլայդեր — Gradient Descent Step Size](Lectures/optim/03_gd_step_size.pdf) - [🎞️ LMU — Math Prerequisites (Hessian, Quadratic Forms)](Lectures/optim/slides_lmu/01_math.pdf) - [🎞️ LMU — First Order Methods](Lectures/optim/slides_lmu/04_first_order.pdf) ## Գործնական - [📺 Գործնական 08 — Gradient descent, Hessian, multivar Taylor](https://youtu.be/XFF03oPjAyE) - [📺 Գործնական 09 — Golden-section search](https://youtu.be/ZKdCO6o3i4E) - [📺 Գործնական 10 — GD, learning rate schedule](https://youtu.be/H8M6p-uSxCM) ## Jupyter Notebooks - [📓 Common Optimization Functions](Lectures/optim/02_common_functions.ipynb) - [📓 Gradient Descent — Step Size](Lectures/optim/03_gd_step_size.ipynb) - [📓 GD Step Size — In Class](Lectures/optim/03_gd_step_in_class.ipynb) - [📓 Benchmark: First Order Methods](Lectures/optim/benchmark_first_order_methods.ipynb) # 🏡 Տնային ::: {.callout-note collapse="false"} 1. ❗❗❗ DON'T CHECK THE SOLUTIONS BEFORE TRYING TO DO THE HOMEWORK BY YOURSELF❗❗❗ 2. Please don't hesitate to ask questions, never forget about the 🍊karalyok🍊 principle! 3. The harder the problem is, the more 🧀cheeses🧀 it has. 4. Problems with 🎁 are just extra bonuses. It would be good to try to solve them, but also it's not the highest priority task. 5. If the problem involve many boring calculations, feel free to skip them - important part is understanding the concepts. 6. Submit your solutions [here](https://forms.gle/CFEvNqFiTSsDLiFc6) (even if it's unfinished) ::: ## Gradient Descent ### 01 Condition number by hand {data-difficulty="2"} Consider the quadratic $$f(\mathbf{x}) = \tfrac{1}{2}\mathbf{x}^T Q \mathbf{x}, \quad Q = \begin{pmatrix} 100 & 0 \\ 0 & 1 \end{pmatrix}.$$ 1. Compute the gradient $\nabla f(\mathbf{x})$ and the Hessian $H = \nabla^2 f(\mathbf{x})$. 2. Find the eigenvalues of $H$. 3. Report the condition number $\kappa(H) = \lambda_{\max} / \lambda_{\min}$. 4. Explain in 2-3 sentences why a large $\kappa$ predicts slow convergence of gradient descent. ### 02 Implement vanilla GD from scratch {data-difficulty="2"} Implement gradient descent in pure numpy (no scikit-learn, no PyTorch). Minimize $f(\mathbf{x}) = x_1^2 + x_2^2$ starting from $\mathbf{x}_0 = (5, 5)$. 1. Write a function `gd(grad_f, x0, lr, n_iter)` that returns the trajectory of $\mathbf{x}$ values and the loss at each iteration. 2. Run with five different learning rates $\eta \in \{0.01,\ 0.1,\ 0.5,\ 1.0,\ 1.5\}$ for 100 iterations each. 3. On a single matplotlib figure, plot loss vs iteration for each $\eta$ (use a log-scale y-axis). 4. Which $\eta$ converge fastest? Which (if any) diverge? Explain in one sentence why. ### 03 Find the divergence threshold {data-difficulty="2"} For the ill-conditioned quadratic $f(\mathbf{x}) = 100 x_1^2 + x_2^2$ starting from $\mathbf{x}_0 = (1, 1)$, find experimentally the largest learning rate $\eta^*$ for which GD still converges within 1000 iterations. Three decimal places is fine. ### 04 Beat constant LR with a custom schedule {data-difficulty="3"} Take the ill-conditioned quadratic from HW 03: $$f(\mathbf{x}) = 100 x_1^2 + x_2^2, \quad \mathbf{x}_0 = (5, 1).$$ A constant learning rate just below the divergence threshold ($\eta \approx 0.009$) takes a few hundred iterations to reach a loss below $10^{-6}$. Your task: design a learning-rate schedule $\eta(t)$ that converges in **fewer than 50 iterations** to loss below $10^{-6}$. **Things to try**: step decay, exponential decay, cosine annealing, warmup + decay, or your own custom function of iteration $t$. Report: 1. Your final iteration count and final loss. 2. A plot of loss vs iteration (log y-axis). 3. A plot of $\eta(t)$ vs $t$ for your schedule. *Tip*: there's a starter cell in `Lectures/optim/03_gd_step_size.ipynb` with helper code if you want a scaffold. ### 05 Visualize the zig-zag {data-difficulty="2"} Plot the GD trajectory on a contour plot for both: - well-conditioned: $f(\mathbf{x}) = x_1^2 + x_2^2$ - ill-conditioned: $f(\mathbf{x}) = 100 x_1^2 + x_2^2$ Both starting from $\mathbf{x}_0 = (5, 1)$ with the same learning rate $\eta = 0.009$ (just below the divergence threshold from HW 03) for 50 iterations. 1. Make a side-by-side figure (2 subplots) with the level curves and the GD trajectory overlaid on each. 2. Describe in 2-3 sentences why the ill-conditioned trajectory zig-zags. Relate your answer to the eigenvectors of the Hessian. ### 06 Stuck at a saddle? {data-difficulty="2"} Consider the saddle-point function $$f(x, y) = x^2 - y^2.$$ 1. Compute the gradient. Where is $\nabla f = \mathbf{0}$? 2. Run GD from three starting points: $(0, 0)$, $(0, 0.001)$, and $(0.001, 0)$. Use $\eta = 0.1$ and 100 iterations. Report the final point in each case. 3. Explain in 2-3 sentences what each starting point illustrates. Which gets stuck at the saddle? Which escapes downward? Which drifts away along an unstable direction? 4. **ML connection**: why are saddle points a much bigger concern for neural networks (deep, high-dimensional) than they are for the 2D problems above? *Hint for part 4*: think about what fraction of critical points are saddles vs. true local minima in high dimensions, and why that fraction changes as dimension grows. ### 07 GD for linear regression {data-difficulty="3"} Apply gradient descent to fit a linear regression model on synthetic data. **Generate the data** (run this exactly, so we both see the same numbers): ```python import numpy as np np.random.seed(509) n = 200 X = np.random.uniform(0, 10, size=(n, 3)) # 3 features true_w = np.array([2.5, -1.0, 0.5]) y = X @ true_w + np.random.normal(0, 0.5, size=n) # add noise ``` **Tasks**: 1. Write the MSE loss $L(\mathbf{w}) = \tfrac{1}{n} \sum_{i=1}^{n} (\mathbf{w}^T \mathbf{x}_i - y_i)^2$ and derive its gradient $\nabla_{\mathbf{w}} L$ by hand. State the result in vectorized form. 2. Implement GD to minimize $L$ starting from $\mathbf{w} = \mathbf{0}$. Tune the learning rate — something in $[10^{-4}, 10^{-2}]$ should work. Run until $\|\nabla L\| < 10^{-4}$. 3. Compute the closed-form normal-equations solution: $\mathbf{w}^* = (X^T X)^{-1} X^T y$. In practice use `np.linalg.solve(X.T @ X, X.T @ y)` rather than computing the inverse explicitly. 4. Compare your GD solution to $\mathbf{w}^*$ and to `true_w`. How close are they? 5. Plot training loss vs iteration. 6. In 2-3 sentences, explain when you'd prefer GD over the closed-form solution. What changes as the number of features grows from 3 to 3000 to 3 million? *Hint for part 6*: how expensive is inverting (or solving with) a $p \times p$ matrix as $p$ grows? # 🛠️ Գործնական - [🛠️📺 Գործնականի տեսագրությունը](https://youtu.be/H8M6p-uSxCM) - [🛠️🗂️ Գործնականի PDF-ը]() # 🎲 xx+37 (xx) - ▶️[ToDo]() - 🔗[Random link ToDo]() - 🇦🇲🎶[ToDo]() - 🌐🎶[ToDo]() - 🤌[Կարգին ToDo]() <a href="http://s01.flagcounter.com/more/1oO"><img src="https://s01.flagcounter.com/count2/1oO/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_10/viewers_0/labels_0/pageviews_1/flags_0/percent_0/" alt="Flag Counter"></a>