Week 5 - Interpretability and Governance

By studying the connections between neurons, we can find meaningful algorithms in the weights of neural networks.

Our current methods of training capable neural networks give us little insight into how or why they function. This week we cover the field of interpretability, which aims to change this by developing methods for understanding how neural networks think.

In some sense, the core alignment problem stems from the fact that we don't know what our networks actually learn. If interpretability research succeeds, we'll have a better understanding of what our networks are doing and how to change it.

This week’s curriculum starts with readings related to mechanistic interpretability. Mechanistic interpretability is a subfield of interpretability which aims to understand networks on the level of individual neurons. After understanding neurons, we can identify how they construct increasingly complex representations, and develop a bottom-up understanding of how neural networks work.

It then moves onto an area we refer to as concept-based interpretability which focuses on techniques for automatically probing (and potentially modifying) human-interpretable concepts stored in representations within neural networks.

Finally, we cover AI governance. So far on this course we focused on technical alignment solutions. We also require strategic and political solutions to manage safe deployment of potential AGI systems, as well as dealing with new questions and risks arising from the existence of AGI.

We'll focus specifically on new considerations that might need to be made due to the development of advanced AI systems. There are existing tangible harms posed by existing AI systems (which may well be exacerbated by more powerful systems) that also imply a need for AI governance, which we won't cover explicitly.

Core readings:

  1. Zoom In: an introduction to circuits (Olah et al., 2020) (35 mins)

    • Olah et al. (2020) explore how neural circuits build up representations of high-level features out of lower-level features. This work contributes towards trying to reverse engineer neural networks to understand what representations they have internalised and is a cornerstone of the 'mechanistic interpretability' field. Mechanistic' refers to the highly methodical, 'mechanical' way of examining representations from different layers in the neural network.

  2. Toy models of superposition (Elhage et al., 2022) (only sections 1 and 2) (30 mins)

    • Working towards understanding why some neurons respond to multiple unrelated features ('polysemanticity'), Elhage et al. discover that toy models use 'superposition' to store more features than they have dimensions.

  3. Locating and Editing Factual Associations in GPT (Meng et al., 2022) (10 mins)

    • Meng et al. demonstrate how concept-based interpretability can be used to modify neural weights in semantically meaningful ways. It demonstrates how concept-based interpretability may directly help with elements of alignment.

  4. Chris Olah’s views on AGI safety (Hubinger, 2019) (20 mins)

    • Chris Olah is the researcher many consider as the founder of the field of mechanistic interpretability. Currently, he is working on interpretability research at Anthropic. This blog post gives a big-picture overview of the field of mechanistic interpretability through explaining Olah’s views on topics such as why interpretability is useful for AI alignment and what kinds of interpretability research are the most useful for gaining a better understanding of AI models.

  5. AI Governance: Opportunity and Theory of Impact (Dafoe, 2020) (20 mins)

    • Dafoe gives a thorough overview of AI governance and how it might be important, particularly focusing on framing AI governance as field-building. It covers: 3 perspectives about how AI might develop; some concrete pathways to risk; and a framework for thinking about how to reduce those risks from a governance perspective.

  6. Racing through a minefield: the AI deployment problem (Karnofsky, 2022) (15 mins)

    • Karnofsky introduces the problem of AI deployment, and makes the case for caution in AI deployment over the next century. He discusses some ways we could do this, namely: alignment research, threat assessment, avoiding races, selective information sharing, global monitoring and defensive deployment.

Further readings on interpretability:

  1. A mathematical framework for transformer circuits (Elhage et al., 2021) (90 mins)

    • Elhage et al. build on previous circuits work to analyze transformers, the neural network architecture used by most of today’s cutting-edge models. For a deeper dive into the topic, see the associated videos.

  2. Rewriting a deep generative model (Bau et al., 2020) (20 mins)

    • Bau et al. find a way to change individual associations within a neural network, which allows them to replace specific components of an image. For work along similar lines in language models, see here.

  3. A Circuit for Indirect Object Identification in GPT-2 small (Wang et al., 2022) (50 mins)

    • Wang et al. provide an excellent example of how we might go about interpreting language models from a zoomed-out viewpoint instead of zooming in on every individual neuron.

  4. Language models can explain neurons in language models (Bills et al., 2023) (15 mins)

    • OpenAI researchers provide an example of how language models can already help us do better interpretability research.

  5. Understanding intermediate layers using linear classifier probes (Alain and Bengio, 2016) (only sections 1 and 3) (15 mins)

    • This paper introduces the technique of linear probing, a crucial tool in concept-based interpretability.

  6. Acquisition of Chess Knowledge in AlphaZero (McGrath et al., 2021) (only sections 1 and 2) (20 mins)

    • This paper provides a case study using concept-based interpretability techniques to understand AlphaZero’s development of human chess concepts.

  7. Eliciting Latent Knowledge (Christiano, Cotra and Xu, 2021) (only read the sections that look interesting to you)

    • This reading outlines the research agenda of Paul Christiano’s Alignment Research Center. The problem of eliciting latent knowledge can be seen as a long-term goal for interpretability research.

  8. Towards Developmental Interpretability (Hoogland et al., 2023) (10 mins)

    • This article gives an overview of a very recent interpretability approach: attempting to understand the emergence of complex abilities in neural networks by understanding the phase transitions during the training process of the networks through the lens of Singular Learning Theory.

Further readings on governance:

  1. Verifying Agreements on Large Scale ML Training via Compute Monitoring (Shavit, 2023) (only sections 1, 2 and 3.1) (20 mins)

    • This article introduces a strategic plan for regulating large-scale training runs via compute monitoring.

  2. Why and How Governments Should Monitor AI Development (Whittlestone and Clarke, 2021) (15 mins)

    • Whittlestone and Clarke advocate for the monitoring of AI development by governments. Evaluation and monitoring capabilities are likely necessary for being able to check that AI systems conform to standards before deployment.

  3. AI Policy Levers (Fischer et al., 2021) (10 mins)

    • The Centre for the Governance of AI (GovAI) lay out several 'policy levers' available to the US Government for regulating the development and deployment of AI systems. Many of these levers generalise to other governments, too.

  4. Information security considerations for AI and the long term future (Ladish and Heim, 2022) (15 mins)

    • Model weights and algorithms relating to AGI are likely to be highly economically valuable, meaning there will likely be substantial competitive pressure to gain these resources, sometimes illegitimately. This article discusses the information security considerations around this issue.

  5. Exploration of secure hardware solutions for safe AI deployment (Future of Life Institute) (10 minutes)

Further readings that give a holistic overview of the field of AI safety:

  1. Current work in AI alignment | Paul Christiano | EA Global: San Francisco 2019 (finish at 25:45) (25 mins)

    • We've covered a lot of topics in the course. This video puts them into perspective, showing how different solutions we’ve discussed over the last few weeks relate to the problems discussed during the first weeks of the course.

  2. An overview of 11 proposals for building safe advanced AI (Hubinger, 2020) (25 mins)

  3. Alignment Careers Guide (Rogers-Smith, 2022) (60 mins)

    • This article is long, but full of action-guiding advice which might help you to narrow in on what skills you might want to build, or what sort of long-term path in technical alignment you might want to pursue.

  4. Careers in Alignment (Ngo, 2022) (30 mins)

    • Ngo compiles a number of resources for thinking about careers in alignment research. Use this resource to get a sense of the career types that exist in technical alignment research, and to consider which paths suit and excite you.

  5. Levelling up in AI safety Research Engineering (Mukobi, 2022)

    • A helpful guide laying out some suggested steps for gaining skills towards an eventual role as a machine learning research engineer. These are highly applicable to many roles at alignment organisations.

Resource spotlight: 80,000 Hours

80,000 Hours is a nonprofit organization that provides research and guidance to help individuals make high-impact career choices. It offers a lot of useful resources for people interested in AI alignment, such as in-depth career guides for working in the fields of AI safety and AI policy, as well as individual coaching and podcast episodes with experienced AI safety researchers. In addition to AI alignment, the page provides in-depth overviews of several other fields that might have an outsized impact on the long-term future, such as preventing nuclear wars and catastrophic pandemics.

Resource spotlight: aisafety.training and aisafety.events

AI Safety Training and AI Safety Events are databases of training programs, courses, conferences, and other events for AI safety. They provide an overview of events all over the world and are excellent resources for finding follow-up courses and research programs to take your next steps in the field of AI safety after completing this course.

Exercises:

  1. Interpretability work on artificial neural networks is closely related to neuroscience work on human brains. Describe two ways in which the former is easier than the latter, and two ways in which it’s harder. Optionally, read this article by Chris Olah comparing the two.

  2. What are the individual tasks involved in machine learning research (or some other type of research important for technological progress)? Identify the parts of the process which have already been automated, the parts of the process which seem like they could plausibly soon be automated, and the parts of the process which seem hardest to automate.

  3. In what ways has humanity’s response to large-scale risks other than AI (e.g. nuclear weapons, pandemics) been better than we would have expected beforehand? In what ways has it been worse? What can we learn from this?