Linear Representation Hypothesis

less than 1 minute read

That monosemantic concepts are represented in neural networks as directions (vectors) in activation space, that conjunctions of concepts are represented as linear combinations of these vectors, and that we can understand the computations done by models between layers as nonlinear operations on them.

One intuition for this is that the vast majority of the computation done inside neural networks involves linear operations, so it seems linear feature representations would be the most natural and efficient.

It’s important to note that the term “monosemantic concepts” is not clearly defined, so this hypothesis is substantially less formal than it might initially appear.

Backlinks

Interpreting Neural Networks through the Polytope Lens
Codebook Features - Sparse and Discrete Interpretability for Neural Networks
Not All Language Model Features Are Linear
Mechanistic Interpretability
Toy Models of Superposition

Tom Bewley

Backlinks