Codebook Features - Sparse and Discrete Interpretability for Neural Networks
2024 ICML #Content/Paper by Alex Tamkin, Mohammad Taufeeque, and Noah D. Goodman.
This paper proposes an alternative solution to the superposition problem to the one embodied in Dictionary Learning: rather than learning a post hoc sparse representation, enforce one in the model itself.
The basic idea is to replace the activations
It is noted that codebook features embody a view of features-as-points rather than the features-as-directions perspective of the Linear Representation Hypothesis.
The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited seems to suggest that the method doesnβt work very well for interpreting RL agent representations.