📝 Notes
Full List
- A Mathematical Framework for Transformer Circuits
- Attention
- Codebook Features - Sparse and Discrete Interpretability for Neural Networks
- Dictionary Learning
- Interpreting Neural Networks through the Polytope Lens
- Linear Representation Hypothesis
- Mechanistic Interpretability
- Not All Language Model Features Are Linear
- Scaling Monosemanticity - Extracting Interpretable Features from Claude 3 Sonnet
- Toy Models of Superposition
- Transformer