Are finite neural networks properly understood as approximations to infinite limits?


Is this the right way to understand width, depth, learning rate, and other finite hyperparameters in deep learning? What does the limiting continuum system look like?

As stated in the Learning Mechanics perspective paper, the Discretization Hypothesis states that finite neural networks are simply discretized approximations to infinite networks, analogous to how a spatiotemporal discretization is used to numerically approximate the solution to a differential equation. For network width, the limiting continuous object is the measure of neuron activity in hidden layers, while finite depth in a residual network can be viewed as a discretization of a neural SDE or ODE. Small step sizes can render stochastic optimization algorithms approximately equivalent to some kind of flow. In this view, increasing model size (and decreasing learning rate while commensurately increasing step count) serve essentially to improve model performance by decreasing discretization error, at the price of additional computation.

Is this the right way to understand width, depth, learning rate, and other finite hyperparameters in deep learning? What does the limiting continuum system look like?

Discussion