How do we formally define the features learned by neural networks?

Mechanistic interpretability seeks to identify and disentangle the features, circuits, and mechanisms learned by neural networks. Can these concepts be given precise mathematical definitions grounded in first principles? What formal structures naturally emerge from such a definition? Can we use these notions to evaluate and formalize central assumptions of mechanistic interpretability, including linear representability, locality, sparsity, and compositionality?

What does it mean to understand the internal mechanisms learned by neural networks? Does neural network computation even admit an interpretable description in the first place?

While many in deep learning are skeptical of this stance, mechanistic interpretability researchers, following in the footsteps of cognitive scientists, take these questions seriously, and base their work on an assumption that, with great effort, we will find positive and satisfying answers to them.

Currently, the field operates under the following paradigm: it is assumed that the internal activations of a network at any layer encode the state of a variety of computational variables. These variables are called “features”. Through the layers of the network, new features are computed from earlier ones, forming “circuits”. These circuits altogether implement intelligible algorithms for performing computation. When researchers have a hypothesis for how some circuit works, they can intervene on the activations at each layer to change the state of the relevant features represented there, and then see the effect these interventions have on network features in later layers and on the network output.

This picture depends crucially on the concept of the “feature” as the “atomic” functional unit of analysis in interpretability. However, the field does not yet have a formal definition for “feature”. Instead, we find empirically that a huge variety of properties can be linearly probed from the activations of neural networks. Along the right directions in activation space, activation vectors separate according to whether or not some property of the network’s input holds. When this is the case, it is said that networks linearly represent that “feature” along that direction. The Linear Representation Hypothesis asserts that in some sense all features computed by the network are linearly represented in some layer. It is also commonly assumed, under what is called the Superposition Hypothesis (Elhage et al., 2022), that most features are sparse, only represented by the network on a small subset of inputs across natural data distributions, and that their corresponding feature “directions” form an overcomplete basis of activation space, so that a neural network’s activation on any input can be written as a weighted sum over the feature directions, for the sparse subset of features the network computes on that input. This picture motivates the use of sparse dictionary learning to learn this overcomplete basis of feature directions given observations of the network’s activations (Cunningham et al., 2023, Bricken et al., 2023).

There are many unanswered questions about this picture and its conception of features. Some of these include: (1) is there any notion of a single “correct” decomposition of neural network activations into features? do some decompositions “slice the network at the joints” better than others? by what metrics would some decompositions be better than others? does having the “correct” decomposition of the network only become obvious when one tries to do a circuit analysis of the network using these features (Marks et al., 2024)? (2) is the Linear Representation Hypothesis correct? what about the multi-dimensional Linear Representation Hypothesis (Engels et al., 2024)? (3) why do known multi-dimensional features have the geometry they do? (Modell et al., 2025, Gurnee & Ameisen et al., 2025, Karkada et al., 2026). (4) is there a connection between the notion of “feature” from mechanistic interpretability, discussed above, and the notion of “feature learning” in deep learning theory?

How do we formally define the features learned by neural networks?

Discussion