Can we predict scaling law exponents a priori?


Large models exhibit robust power-law scaling of loss with respect to model size, data, and compute. Can we develop a theory of scaling laws that both explains why power laws arise and predicts their exponents a priori from properties of the dataset, architecture, and optimizer?

Neural scaling laws tell us that we can reliably increase compute and dataset scale and get a power law improvement in model loss. This empirical regularity holds over many orders of magnitude. Why power laws, and why these exponents?

The observed exponents are nontrivial: they do not appear to be simple fractions which might result from elementary dimensionality or scaling arguments. It is widely believed that these values are driven largely by structure latent in the dataset, but may also depend on details of the architecture and optimizer. While many explanations for scaling laws have been proposed, a decisive test of any such theory is its ability to predict these exponents quantitatively from first principles. At present, no framework can robustly do so across realistic settings.

Some broad lines of work in this direction include:

Discussion