Deep linear networks and kernel methods are the two main workhorse solvable models of learning mechanics. The first captures nonlinear dynamics of the parameters, and the second learns nonlinear functions of the data. Is there a class of solvable model that captures both deep, nonlinear dynamics and nonlinear function learning, while still maintaining some level of generality?
Physics shows us that a great toy model is a wonderful thing. The harmonic oscillator is dead simple, but it’s useful for understanding not only an enormous suite of systems that oscillate (pendulums, strings, drumheads, etc.) but also, with minimal modifications, damped and driven oscillation, feedback and control, and perturbatively anharmonic oscillation. Similarly, the wavefunction of an electron orbiting a central charge gives us an understanding of not just the hydrogen atom but also access to an amazing breadth of effects one conceptual step away: fine and hyperfine structure effects, the Stark and Zeeman effects, and more. There’s a reason these models are taught to young students.
There should be a toy model like this for deep neural networks. Right now, the two most cleanly solvable models of neural-network-like learning are deep linear networks and kernel regression. The first has rich, modewise learning dynamics, but is ultimately limited by its ability to express only linear functions. The second can represent fully nonlinear functions, but its learning dynamics are linear. Each of these are useful toy models for a subset of learning phenomena (stepwise learning dynamics for deep linear networks, and overfitting, spectral inductive bias, and more for kernel regression), but the most interesting stuff needs both blessings with neither drawback.
For example, it is known in folklore that neural networks exploit hierarchical structure in data, and that this is a key advantage over other learning methods. Neither of these two known toy models can do this: linear functions just aren’t a rich enough class to have much hierarchical structure, and kernel regression learns eigenmodes one at a time, independently of each other, with no knockon effects from hierarchical structure. It would be wonderful if we had a simple, exactly solvable toy model of deep nonlinear learning, ideally simpler than an MLP, which we could solve on many different types of task. A model like that would probably let us get insight into many different learning phenomena in one unified setting.
Discussion