Codebook Features - Sparse and Discrete Interpretability for Neural Networks

1 minute read

2024 ICML #Content/Paper by Alex Tamkin, Mohammad Taufeeque, and Noah D. Goodman.

This paper proposes an alternative solution to the superposition problem to the one embodied in Dictionary Learning: rather than learning a post hoc sparse representation, enforce one in the model itself.

The basic idea is to replace the activations a∈RD in a chosen hidden layer of a pre-trained model with the sum of k code vectors βˆ‘i=1kci:ci∈RD taken from a finite codebook of size F≫k (and I think F≫D). The chosen ci are the members of the codebook with the k highest cosine similarities to the original activation, aβ‹…ci|a||ci|. The model is then fine-tuned with the original training loss, plus a stabilising term that prevents the code vectors from growing in magnitude. Straight-through estimation is used to enable gradient-based optimisation despite the discrete choice of codes.

It is noted that codebook features embody a view of features-as-points rather than the features-as-directions perspective of the Linear Representation Hypothesis.

The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited seems to suggest that the method doesn’t work very well for interpreting RL agent representations.