Week 2 - Mechanistic Interpretability
This week goes deep into mechanistic interpretability, probably the most active subfield of AI alignment at the moment. Mechanistic interpretability is a subfield of interpretability which aims to understand networks on the level of individual neurons.
The readings build upon the interpretability-related readings in the Alignment 101 curriculum. The first three readings focus on reasoning about the internals of transformers. A phenomenon that will get a lot of attention is superposition: models appear to store more features than they have neurons, which means that they represent human-interpretable features in superposition over various different neurons. This seems to be one of the hardest challenges on our way to gaining a complete mechanistic overview of what goes on inside large transformer models. The third reading will focus on dictionary learning, which has been identified as a potential solution to this problem.
After completing the first three readings, you will have the opportunity to choose one reading from a list of papers that cover some of the other prominent research directions in the field of mechanistic interpretability and to make a short presentation to your groupmates about it.
Core readings:
A mathematical framework for transformer circuits (Elhage et al., 2021) (read as much as you have time for)
Elhage et al. build on previous circuits work that you learned about during the Alignment 101 course to provide a mathematical framework for thinking about the computations performed by a transformer model. For a deeper dive into the topic, see the associated videos.
Toy models of superposition (Elhage et al., 2022) (only sections 3 and 9. Also sections 1-2 if you haven’t read them yet) (20 mins)
Working towards understanding why some neurons respond to multiple unrelated features ('polysemanticity'), Elhage et al. discover that toy models use superposition to store more features than they have dimensions.
Choose two of the following papers and make a short presentation of the most important takeaways to the rest of your group:
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (Bricken et al., 2023)
The problem of superposition is widely regarded as one of the biggest barriers on our way to a mechanistic understanding of what goes on inside transformers. This work explores solving the problem of superposition by learning the features that models represent using dictionary learning. This is currently viewed as the most promising solution to the problem of superposition. The authors provide an interactive overview of thousands of features that they extracted from a small language model with this method, as well as a detailed analysis of the extracted features and the theory of dictionary learning.
In-context Learning and Induction Heads (Olsson et al., 2022)
This paper set out to explain language models’ ability to learn in context: for example, to learn a new classification task unseen during pre-training based on just a couple of demonstrations provided in the prompt. In doing so, they find that the induction head circuits emerge at the same time with in-context learning ability during training, leading to the hypothesis that induction heads are an important component of the mechanism behind in-context learning.
Discovering Latent Knowledge in Language Models Without Supervision (Burns et al., 2022)
This paper is an implementation of concept-based interpretability techniques. It explores a technique for automatically identifying whether a model believes that statements are true or false without requiring ground-truth data. This method helps uncover a model's internally represented beliefs which is otherwise a black box.
Locating and Editing Factual Associations in GPT (Meng et al., 2022)
Meng et al. demonstrate how concept-based interpretability can be used to modify neural weights in semantically meaningful ways. It demonstrates how concept-based interpretability may directly help with elements of alignment.
Wang et al. provide an excellent example of how we might go about interpreting language model circuits from a zoomed-out viewpoint instead of zooming in on every individual neuron.
Causal Scrubbing: a method for rigorously testing interpretability hypotheses (Chan et al., 2022)
How do we know whether an explanation of the behaviour of the model actually matches the mechanistic process that takes place inside the model? This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations.
Further readings:
Value loading in the human brain: a worked example (Byrnes, 2021) (25 mins)
In addition to interpretability research on neural networks, another approach to developing more interpretable AI involves studying human and animal brains. Byrnes gives an example of applying ideas from neuroscience to better understand AI.
Kim et al. introduce a technique for interpreting a neural net's internal state in terms of human concepts.
The following two posts give a good overview of arguments in favour and against interpretability research being useful for AI alignment:
A Longlist of Theories of Impact for Interpretability (Nanda, 2022) (10 mins)
Against Almost Every Theory of Impact of Interpretability (Segerie, 2023) (45 mins)
A Mechanistic Interpretability Analysis of Grokking (Nanda, 2022) (60 mins)
Podcast spotlight:
For an in-depth overview of the field of mechanistic interpretability and the ways in which it can lead to safer AI systems, you can listen to the 80,000 Hours podcast episode with Chris Olah, the research lead of Anthropic’s mechanistic interpretability team and one of the earliest mechanistic interpretability researchers. For a different perspective on similar topics, you can listen to the AXRP podcast episode with Neel Nanda, the research lead of DeepMind’s mechanistic interpretability team.