Does training in realistic deep nonlinear networks proceed as a series of approximately low-rank steps?

When a deep neural network is trained, the updates to each weight matrix at any given time are approximately low-rank. Furthermore, as initialization scale decreases, discrete steps appear in the loss trajectory. Can we unify these observations in a saddle-to-saddle picture of neural network training? Can we use this picture to shed light on the training of realistic networks?

When a neural network is trained in a typical fashion, its loss drops relatively smoothly and continuously. However, when trained from small initialization, its loss tends to drop in a series of discrete steps with plateaus in between, much like a deep linear network. See examples below. These steps blur together and eventually disappear as the initialization scale increases. This suggests the tantalizing possibility that this behavior was “in there all along” and that taking initialization to be small simply makes this clear at the level of the loss curve.

Realistic models trained from small initialization show stepwise learning. Plots show loss trajectories from CNNs and ViTs (Atanasov et al., 2024; Figures 2 and 16) and a ResNet trained in a self-supervised fashion with SimCLR loss (Simon et al., 2023; Figure 2). Do these steps reveal something fundamental about the training process of deep neural networks?

This would be big if true: the training process of a deep neural network is messy, continuous, high-dimensional, and defies simple characterization. However, a series of discrete steps which can be cleanly separated from one another seems much easier to deal with, since each step could then be studied in isolation. This is exactly the story in deep linear networks: the learning process naturally decomposes into a sequence of subproblems, each with low-rank dynamics, which may be studied mostly independently.

So: is it true? Does a deep neural network trained from infinitesimal initialization follow a perfect saddle-to-saddle trajectory? When the same network is trained normally, does this idealized saddle-to-saddle trajectory remain a useful approximation of the learning process, and can we thereby gain insight into what is learned and how?

Does training in realistic deep nonlinear networks proceed as a series of approximately low-rank steps?

Discussion