Why do some optimizers work better than others? Why do adaptive methods like Adam and Muon outperform simpler alternatives like SGD when training large models? Can we identify fundamental principles that explain the success of modern optimizers, predict when they will fail, and guide the design of new ones?
A major thread in the last decade-or-so of applied deep learning has been the development of better and better optimizers for deep neural networks. AlexNet in 2012 was trained with vanilla SGD with momentum and weight decay. Since then, we’ve seen a sequence of more and more performant optimizers found by a field-wide process of empirical trial, error, and iteration (itself a kind of meta-optimization!). A line of adaptive optimizers produced Adam (Kingma and Ba, 2014), whose variants became the industry standard for training large models. More recently, Muon (Jordan et al., 2024), an adaptive optimizer which whitens gradient matrices, has started gaining ground. These adaptive optimizers are practically very important: compared to vanilla SGD, they make large models train faster and more stably.
It is a big mystery why these adaptive optimizers work so well. An instructive exercise is to go look at the update rule for Adam and try to build some intuition for the role of each quantity. One quickly finds that, although there is a plausible story that can be told about every term, it is far from obvious why that particular functional form should be the right one: one could easily construct lots of “cousins of Adam” which we could justify with equally plausible stories. Nonetheless, an enormous number of these variants have been tried, and only a few consistently give meaningful gains. The genesis of Muon is a bit different — its spectral whitening is arguably more mathematically “natural” than Adam’s elementwise adaptation — but it’s still quite unclear why this “natural” operation should be good for optimizing neural networks!
Can we identify general principles — even empirical ones — that allow us to falsifiably predict (or even retrodict) which optimizers work better for training large models? Ideally, we’d like to get some mathematical characterization of the nature of the optimization problem, involving both the structure of the network and the resulting loss landscape, and be able to work backwards from the problem to an optimizer that efficiently solves it. How do we work towards that goal?
Discussion