Does deep learning implicitly minimize some notion of functional complexity?

Deep networks are widely believed to have a bias towards learning simple functions, but this bias has only been characterized precisely in highly specific settings, and a general picture has not been found. Do deep neural networks broadly seek to minimize some precise notion of complexity among functions with low loss?

Folklore positively teems with the idea that deep learning has some sort of “simplicity bias,” in several senses. First, “simpler” functions are learned faster in time. Second, “simpler” functions become learnable with a smaller number of samples. And third, “simpler” functions are representable (to within some small error) with a smaller number of parameters. Putting these together, we arrive at the fourth sense that “simpler” functions require less total compute to learn.

The idea that simpler functions are more likely to be correct, all else being equal, is a very old one, predating even Occam’s Razor. From a modern statistical viewpoint, some sort of simplicity bias in deep learning is a logical necessity: overparameterized neural networks couldn’t generalize well without some implicit rule that selects preferable functions from the vast set of all those they can express.

In fulfillment of this prophecy, every simple learning rule whose learning behavior has been exactly solved displays some sort of bias towards learning simple functions first. This bias has surfaced many times and taken many names in the literature, including implicit regularization, inductive bias, maximum margin bias, and spectral bias. However, these biases seem to look different in different settings, and the right unifying principles — and a description of this bias for deep neural networks — have not yet been found.

So: can we give a precise characterization of the simplicity bias of deep learning? What notion of complexity is the right one: Kolmogorov complexity, circuit complexity, a measure of total parameter norm, none of these, or several of them? What are the underlying mechanisms by which this bias acts during the process of training?

Does deep learning implicitly minimize some notion of functional complexity?

Discussion