Codebook Features - Sparse and Discrete Interpretability for Neural Networks

1 minute read

2024 ICML #Content/Paper by Alex Tamkin, Mohammad Taufeeque, and Noah D. Goodman.

This paper proposes an alternative solution to the superposition problem to the one embodied in Dictionary Learning: rather than learning a post hoc sparse representation, enforce one in the model itself.

The basic idea is to replace the activations $a\in\mathbb{R}^D$ in a chosen hidden layer of a pre-trained model with the sum of $k$ code vectors $\sum_{i=1}^k c_i:c_i\in\mathbb{R}^D$ taken from a finite codebook of size $F\gg k$ (and I think $F \gg D$). The chosen $c_i$ are the members of the codebook with the $k$ highest cosine similarities to the original activation, $\frac{a\cdot c_i}{|a||c_i|}$. The model is then fine-tuned with the original training loss, plus a stabilising term that prevents the code vectors from growing in magnitude. Straight-through estimation is used to enable gradient-based optimisation despite the discrete choice of codes.

It is noted that codebook features embody a view of features-as-points rather than the features-as-directions perspective of the Linear Representation Hypothesis.

The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited seems to suggest that the method doesnโ€™t work very well for interpreting RL agent representations.