Week 2 - Mechanistic Interpretability

This week goes deep into mechanistic interpretability, probably the most active subfield of AI alignment at the moment. Mechanistic interpretability is a subfield of interpretability which aims to understand networks on the level of individual neurons.

The readings build upon the interpretability-related readings in the Alignment 101 curriculum. The first three readings focus on reasoning about the internals of transformers. A phenomenon that will get a lot of attention is superposition: models appear to store more features than they have neurons, which means that they represent human-interpretable features in superposition over various different neurons. This seems to be one of the hardest challenges on our way to gaining a complete mechanistic overview of what goes on inside large transformer models. The third reading will focus on dictionary learning, which has been identified as a potential solution to this problem.

After completing the first three readings, you will have the opportunity to choose one reading from a list of papers that cover some of the other prominent research directions in the field of mechanistic interpretability and to make a short presentation to your groupmates about it.

Core readings:

  1. A mathematical framework for transformer circuits (Elhage et al., 2021) (read as much as you have time for)

    • Elhage et al. build on previous circuits work that you learned about during the Alignment 101 course to provide a mathematical framework for thinking about the computations performed by a transformer model. For a deeper dive into the topic, see the associated videos.

  2. Toy models of superposition (Elhage et al., 2022) (only sections 3 and 9. Also sections 1-2 if you haven’t read them yet) (20 mins)

    • Working towards understanding why some neurons respond to multiple unrelated features ('polysemanticity'), Elhage et al. discover that toy models use superposition to store more features than they have dimensions.

  3. Choose two of the following papers and make a short presentation of the most important takeaways to the rest of your group:

Further readings:

  1. Value loading in the human brain: a worked example (Byrnes, 2021) (25 mins)

    • In addition to interpretability research on neural networks, another approach to developing more interpretable AI involves studying human and animal brains. Byrnes gives an example of applying ideas from neuroscience to better understand AI.

  2. Interpretability beyond feature attribution: quantitative testing with concept attribution vectors (Kim et al., 2018) (35 mins)

    • Kim et al. introduce a technique for interpreting a neural net's internal state in terms of human concepts.

  3. The following two posts give a good overview of arguments in favour and against interpretability research being useful for AI alignment:

  4. A Mechanistic Interpretability Analysis of Grokking (Nanda, 2022) (60 mins)

Podcast spotlight:

For an in-depth overview of the field of mechanistic interpretability and the ways in which it can lead to safer AI systems, you can listen to the 80,000 Hours podcast episode with Chris Olah, the research lead of Anthropic’s mechanistic interpretability team and one of the earliest mechanistic interpretability researchers. For a different perspective on similar topics, you can listen to the AXRP podcast episode with Neel Nanda, the research lead of DeepMind’s mechanistic interpretability team.