Interpreting Neural Networks through the Polytope Lens
2022 #Content/Paper by Sid Black, Lee Sharkey, Leo Grinsztajn, Eric Winsor, Dan Braun, Jacob Merizian, Kip Parker, Carlos Ramón Guevara, Beren Millidge, Gabriel Alfour, and Connor Leahy.
This paper argues that the features-as-directions perspective of the Linear Representation Hypothesis is insufficiently expressive, and that we should instead think of features as regions of representation spaces.
The more specific hypothesis is that the polytopes induced by the weights and activation functions of ReLU networks are monosemantic, and that boundaries between them reflect semantic boundaries.