Understanding Gradient Descent Oscillations and the Momentum Solution

Gradient descent (GD) often struggles on loss surfaces with uneven curvature, leading to a frustrating zigzag pattern. This occurs when the surface is steep in one direction and flat in another, causing GD to either overshoot or stagnate. Momentum is a powerful optimization technique that smooths out these oscillations by maintaining a velocity from past gradients. Below, we answer key questions about this behavior and how momentum fixes it, using a controlled anisotropic surface as a concrete example.

Why does gradient descent produce a zigzag pattern on certain loss surfaces?

Gradient descent zigzags when the loss surface has uneven curvature—steep in one direction and flat in another. Imagine a narrow, elongated valley: the gradient is large along the steep axis, causing GD to take big jumps that overshoot the bottom, then reverse direction. On the flat axis, the gradient is tiny, so progress is slow. This trade-off is fundamental: a high learning rate quickens flat-axis movement but amplifies steep-axis oscillations; a low learning rate stabilizes the steep axis but makes flat-axis convergence painfully slow. The result is a back-and-forth path that wastes steps and slows overall convergence.

Understanding Gradient Descent Oscillations and the Momentum Solution — Source: www.marktechpost.com

What is the condition number, and how does it relate to GD inefficiency?

The condition number measures the ratio of the largest to smallest curvature of the loss surface. In our example, the Hessian has eigenvalues 10 (steep direction) and 0.1 (flat direction), giving a condition number of 100. A high condition number means the surface is much more curved along one axis than the other, which forces GD to zigzag. The stability limit for GD is 2 / λ_max (here 0.2). With a learning rate of 0.18, the steep axis overshoots every step, while the flat axis moves only 1.8% of the remaining distance per step. This imbalance is the core reason for inefficiency.

How does momentum mitigate the zigzag problem?

Momentum introduces a velocity term that accumulates past gradients. Instead of updating parameters solely based on the current gradient, it computes a running average of gradients. This means consistent gradients (like those along the flat direction) reinforce each other, accelerating movement. Oscillating gradients (on the steep axis) tend to cancel out because their directions reverse, reducing instability. The result is a smoother, more direct path toward the minimum. Mathematically, momentum uses a hyperparameter β (e.g., 0.9) to blend previous velocity with the current gradient, effectively dampening oscillations while preserving progress in consistent directions.

What are the key differences between vanilla GD and momentum update equations?

Vanilla gradient descent updates parameters θ as:
θ = θ - η ∇L(θ).
Momentum modifies this with a velocity v:
v = β v - η ∇L(θ)
θ = θ + v.
Here, η is the learning rate and β (typically 0.9) controls how much past velocity is retained. When β=0, momentum reduces to vanilla GD. The velocity term integrates gradients over time, so if gradients point consistently in one direction, v grows, accelerating descent. If gradients oscillate, v averages toward zero, smoothing out the path. This simple change is why momentum can converge in fewer steps, as shown in our simulation.

Can you show a concrete simulation example comparing GD and momentum?

Yes—we tested both optimizers on the loss surface L(x,y) = 0.05x² + 5y², starting from a point far from the minimum. Using a learning rate of 0.18, vanilla GD required 185 steps to converge. Momentum with β=0.9 completed the same task in 159 steps—a 14% improvement. However, when β was set too high (e.g., 0.99), momentum failed to converge entirely because the velocity accumulated too much inertia, causing overshooting. These results highlight the practical benefit of momentum but also the need for careful tuning of β.

What happens if the momentum coefficient is set too high?

If β is too large (approaching 1), the velocity term becomes overly sluggish. While it helps cancel oscillations, it also resists changes in direction, leading to overshooting and potential divergence. In our simulation, β=0.99 prevented convergence because the accumulated velocity kept the optimizer moving past the minimum without correcting. This is a classic trade-off: too little momentum (low β) gives little benefit, while too much can destabilize the algorithm. Typical values range from 0.5 to 0.999, with 0.9 being a common starting point. The optimal β depends on the surface curvature and learning rate.

Is the zigzag problem limited to simple quadratic surfaces?

No—this problem is widespread in real-world optimization. Many neural network loss surfaces have highly anisotropic (direction-dependent) curvature, especially in deep architectures. For instance, in training a deep network, some parameter directions may be extremely flat while others are steep, exactly the scenario that causes GD to zigzag. Momentum helps, but advanced variants like Nesterov accelerated gradient (NAG) or adaptive methods (Adam, RMSprop) are often used for even better handling of varying curvature. The core insight—that maintaining a velocity from past gradients smoothes updates—remains foundational to modern optimization.