Mechanistic Interpretability
Embodies a fundamental belief that neural networks are only black boxes by default, and that we can understand their function in terms of interpretable concepts and algorithms if we try hard enough.
This post draws an analogy to the three stages of software reverse engineering:
- Choose a mathematical description among (probably) many equivalent ones (e.g. neurons, polytopes, directions).
- Assign approximate semantic labels to elements of that description.
- Validate the semantic description by prediction and intervention.
The goal of mechanistic interpretability research is to push the Pareto front of the faithfulness (via validation) and the length of the semantic description.