Open Questions — Learning Mechanics

🔮 Open Direction 1: What are simple, solvable models of genuinely deep, nonlinear learning?

Deep linear networks and kernel methods are the two main workhorse solvable models of learning mechanics. The first captures nonlinear dynamics of the parameters, and the second learns nonlinear functions of the data. Is there a class of solvable model that captures both deep, nonlinear dynamics and nonlinear function learning, while still maintaining some level of generality?

Details and discussion

🐘 Open Direction 2: What would a theory capable of capturing natural data look like?

Deep neural networks find and exploit structure in natural data. This means that the structure of the data must somehow enter into our theories. What is this structure, and how do we find it?

Details and discussion

🧮 Open Direction 3: Does deep learning implicitly minimize some notion of functional complexity?

Deep networks are widely believed to have a bias towards learning simple functions, but this bias has only been characterized precisely in highly specific settings, and a general picture has not been found. Do deep neural networks broadly seek to minimize some precise notion of complexity among functions with low loss?

Details and discussion

🔬 Open Direction 4: How do we formally define the features learned by neural networks?

Mechanistic interpretability seeks to identify and disentangle the features, circuits, and mechanisms learned by neural networks. Can these concepts be given precise mathematical definitions grounded in first principles? What formal structures naturally emerge from such a definition? Can we use these notions to evaluate and formalize central assumptions of mechanistic interpretability, including linear representability, locality, sparsity, and compositionality?

Details and discussion

♾️ Open Direction 5: Are finite neural networks properly understood as approximations to infinite limits?

Is this the right way to understand width, depth, learning rate, and other finite hyperparameters in deep learning? What does the limiting continuum system look like?

Details and discussion

🧹 Open Direction 6: Can we understand and eliminate all hyperparameters?

There is an ongoing research program in which hyperparameters are systematically analyzed, disentangled, and in some cases removed by taking appropriate limits. How far can this program go? Can we reach zero hyperparameters, or are some hyperparameters irreducible? If we eliminate all hyperparameters, what remains?

Details and discussion

📐 Open Direction 7: Can we predict scaling law exponents a priori?

Large models exhibit robust power-law scaling of loss with respect to model size, data, and compute. Can we develop a theory of scaling laws that both explains why power laws arise and predicts their exponents a priori from properties of the dataset, architecture, and optimizer?

Details and discussion

🎢 Open Direction 8: How does loss curvature interplay with architecture, features, and generalization?

Deep learning optimization implicitly regularizes loss curvature (i.e. the Hessian) by steering towards regions of the loss landscape with lower curvature. How do these curvature dynamics relate to other concerns in deep learning theory, such as the learned features and test-time performance?

Details and discussion

🏎️ Open Direction 9: What makes for a good optimizer in deep learning?

Why do some optimizers work better than others? Why do adaptive methods like Adam and Muon outperform simpler alternatives like SGD when training large models? Can we identify fundamental principles that explain the success of modern optimizers, predict when they will fail, and guide the design of new ones?

Details and discussion

👯 Open Direction 10: In what sense do large models trained differently learn similar representations?

Large models trained from different random seeds — and sometimes even with different widths, architectures, data, or objectives — tend to learn similar internal representations. What is the appropriate metric for assessing this similarity? Can we use it to make a far more robust, precise version of this statement?

Details and discussion

⚛️ Open Direction 11: Can learning in realistic large models be decomposed into a sequence of fundamental "units"?

In several toy models of learning, we encounter the idea that the learning process naturally decomposes into units learned in sequence. Does a story of this sort describe learning in realistic large models?

Details and discussion

Question from: On neural scaling and the quanta hypothesis

🐎 Open Direction 12: Does training in realistic deep nonlinear networks proceed as a series of approximately low-rank steps?

When a deep neural network is trained, the updates to each weight matrix at any given time are approximately low-rank. Furthermore, as initialization scale decreases, discrete steps appear in the loss trajectory. Can we unify these observations in a saddle-to-saddle picture of neural network training? Can we use this picture to shed light on the training of realistic networks?

Details and discussion

Question from: Deep linear networks are a surprisingly useful toy model