Large models exhibit robust power-law scaling of loss with respect to model size, data, and compute. Can we develop a theory of scaling laws that both explains why power laws arise and predicts their exponents a priori from properties of the dataset, architecture, and optimizer?
Neural scaling laws tell us that we can reliably increase compute and dataset scale and get a power law improvement in model loss. This empirical regularity holds over many orders of magnitude. Why power laws, and why these exponents?
The observed exponents are nontrivial: they do not appear to be simple fractions which might result from elementary dimensionality or scaling arguments. It is widely believed that these values are driven largely by structure latent in the dataset, but may also depend on details of the architecture and optimizer. While many explanations for scaling laws have been proposed, a decisive test of any such theory is its ability to predict these exponents quantitatively from first principles. At present, no framework can robustly do so across realistic settings.
Some broad lines of work in this direction include:
There is a class of theories which relate power law neural scaling to some power law in the statistics of the data that networks are trained on. For instance, if the data covariance has a spectrum which decays as a power law, then random feature models trained on this data will have power law scaling Maloney et al. (2022). In kernel regression, if the kernel eigenvalues decay as a power law, then the test loss will also decay as a power law as dataset size is increased (Bordelon et al., 2020). Furthermore, if the prediction problem decomposes into a large number of distinct subtasks with power-law occurrence frequencies, then neural networks may solve an increasing number of these subtasks as they are scaled, leading again to power-law scaling (Michaud et al., 2023).
Recently, it has also been proposed that the phenomenon of feature superposition, from the mechanistic interpretability literature, drives neural scaling laws, (Liu et al., 2025; Chen et al., 2026). This line of work has predominantly studied toy setups inspired by Toy Models of Superposition (Elhage et al., 2022).
Even more recently, Francesco Cagnetta, Allan Raventós, Surya Ganguli, and Matthieu Wyart developed a theory of neural scaling laws that, perhaps for the first time, allows for the prediction of language model scaling exponents on realistic natural language corpora (Cagnetta et al., 2026). Although their experiments are relatively small-scale, further work testing this theory, and possibly attempting to unify it with the other lines of work listed above, is a exciting open direction.
Discussion