Week 3 - Diverse Topics in AI Safety

This week starts by giving an overview of two research agendas the course hasn’t discussed yet: eliciting latent knowledge, the research direction pursued by the Alignment Research Center, and agent foundations, the direction pursued by the Machine Intelligence Research Institute. After that, the blog post by Hubinger will give a concise overview of 11 research agendas within the field of AI safety, many of which you have already encountered during earlier weeks of the course.

After reading short summaries of various research directions, each of you will choose a research agenda to explore in depth. There will be some recommended readings for each research agenda, but you should feel free to go beyond those readings. You’re also not required to read all of the recommended readings: for example, there are a lot more readings under agent foundations than you’ll have time to go through. Each of them describes a different research direction within the subfield and you should choose the one that interests you the most.

The agendas you can choose from are specific and none of them will solve the alignment problem on their own, but all of them have potential to make important contributions. After reading about your chosen agenda, you are encouraged to make a five-minute presentation to the rest of your group about what you learned.

Core readings:

  1. Eliciting Latent Knowledge (Christiano, 2022) (5 mins)

  2. Embedded Agents (Garrabarant and Demski, 2018) (15 mins)

    • This piece gives a high-level overview of open problems in agent foundations and the links between them. Particularly focus on the term embedded agency and understanding the problems it implies for building aligned systems.

  3. An overview of 11 proposals for building safe advanced AI (Hubinger, 2020) (25 mins)

Research Agendas to Choose From

Eliciting Latent Knowledge

Suppose you have created ScienceAI, an AI model that is really good at writing research papers. Suppose you ask it to write a research paper about how to cure cancer. The resulting paper gives a recipe for a drug that should be taken by everyone and which should prevent cancer. However, even the experts in the relevant fields can’t tell you what that recipe would do.

No matter what questions you ask from this AI, you can never be sure you have obtained all the relevant information about the drug, even if the AI answers all of those questions truthfully. In other words, it is insufficient to just observe the AI’s inputs and outputs. The thing that you want to do instead is to elicit its latent knowledge: to find out all the internal beliefs and knowledge the AI has about this drug. The eliciting latent knowledge (ELK) research agenda attempts to solve this problem.

  1. Eliciting Latent Knowledge (Christiano, Cotra and Xu, 2021) (pages 1-38) (60 mins)

    • This work by the Alignment Research Center is the original presentation of the ELK problem, outlining their research agenda and introducing many related problems, as well as some early solutions to those problems.

  2. Mechanistic anomaly detection and ELK (Christiano, 2022) (30 mins)

    • Mechanistic anomaly detection aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding of the model internals. As a successful mechanistic anomaly detection tool would be able to flag all suspicious thought processes inside an AGI system, it would constitute a solution to the ELK problem. In case it is difficult for you to understand what mechanistic anomaly detection is about and why it would be useful, skimming this blog post can help.

  3. Discovering Latent Knowledge in Language Models Without Supervision (Burns et al., 2022)

    • This paper is an implementation of concept-based interpretability techniques. It explores a technique for automatically identifying whether a model believes that statements are true or false without requiring ground-truth data. This method helps uncover a model's internally represented beliefs which is otherwise a black box. As it uncovers the beliefs of the model in an unsupervised way, it can also be viewed as an early approach to solving the ELK problem.

Podcast spotlight:

For further context on this agenda, listen to the AXRP episode featuring Mark Xu, a researcher at the Alignment Research Center.

Model Evaluations

Most conversations around the societal impacts of artificial intelligence (AI) come down to discussing some quality of an AI system, such as its truthfulness or its performance on some task relative to humans. Being able to have informed debates around those characteristics heavily depends on the accuracy with which we are able to evaluate those qualities. Furthermore, frontier AI labs are increasingly committing to Responsible Scaling Policies (RSPs), which specify the security and safety measures they commit to taking at various model capability levels. Knowing whether a model has attained some specific capability level again depends on the quality of our model evaluation suites.

Unfortunately, building reliable model evaluations is incredibly difficult: for example, models are sensitive to the prompts that are used to elicit their capabilities, and it is challenging to elicit maximal rather than just average capabilities. The readings below discuss those challenges and propose some solutions.

  1. Discovering Language Model Behaviours with Model-Written Evaluations (Perez et al., 2022) (only sections 1-5) (30 mins)

    • This paper provides an overview of attempts to automate the process of evaluating model behaviours by evaluating AI models using other AI models.

  2. Evaluating Language-Model Agents on Realistic Autonomous Tasks (Kinniment et al., 2023) (10 mins)

    • This report presents a case study on evaluating LLM agents for dangerous capabilities, placing them inside realistically simulated virtual environments.

  3. Challenges in evaluating AI systems (Ganguli et al., 2023) (20 mins)

    • This post outlines some challenges in accurately evaluating the capabilities of advanced AI models.

  4. We need a Science of Evals (Apollo Research, 2023) (20 mins)

    • This blog post argues that model evaluations should be approached through a rigorous scientific process, similarly to any other subfield of AI safety.

  5. Large Language Models can Strategically Deceive their Users when Put Under Pressure (Scheurer et al., 2023) (30 mins)

    • This paper presents a case study of eliciting strategically deceptive behaviour from GPT-4 by putting it inside a realistic simulated environment. They give it the goal of maximising the profit of a company as a stock trading agent and create incentives for it to behave deceptively.

  6. METR Task Development Guide - Desiderata (METR, 2024) (15 mins)

    • This guide provides an overview of some desirable qualities new model evaluations should have.


Brain-Inspired AI Safety

The human brain is the only generally intelligent system we know of thus far. Even though the silicon-based AGI systems that we will eventually build will probably look very different from a human brain, one might thus expect there to be at least some parallels between human brains and artificial cognition. The field of brain-inspired AI safety aims to determine what lessons we can learn about AI safety by reasoning about human neuroscience.

  1. Intro to Brain-Like-AGI Safety (Byrnes, 2022)

  2. The shard theory of human values (Pope and Turner, 2022) (40 mins)

  3. Shard Theory in Nine Theses: a Distillation and Critical Appraisal (Chan, 2022) (30 mins)

Podcast spotlight:

For further context on this agenda, listen to the AXRP podcast episode featuring Quintin Pope.

Agent Foundations

Most AI threat models include a powerful agent pursuing a goal we don't want it to pursue. The field of agent foundations aims to develop mathematical models of agency as tools to better understand such agents and to build agents with provable safety guarantees.

  1. Why Agent Foundations? An Overly Abstract Explanation (Wentworth, 2022) (20 mins)

  2. Logical induction blog post (Garrabrant et al., 2016) (15 mins)

    • This blog post describes one of the biggest results in agent foundations research thus far: it provides an algorithm for assigning probabilities to logical statements (like mathematical claims).

  3. Decision Theory (Demski and Garrabarant, 2018) (20 mins)

  4. Selection Theorems: A Program For Understanding Agents (Wentworth, 2021) (20 mins)

  5. Natural Abstractions: Key claims, Theorems, and Critiques (Chan et al., 2023)

  6. Infra-bayesianism unwrapped (Shimi, 2021) (45 mins)

  7. Progress on Causal Influence diagrams: blog post (Everitt et al., 2021)

    • This blog post introduces the 'causality' research agenda. Everitt et al. formalise the concept of an RL agent having an incentive to influence different aspects of its training setup. The incentives are described in the language of causality. Causality is pitched by the authors as a unifying language for describing agent incentives, and therefore misalignment.

Podcast spotlight:

For further context on this research agenda, listen to the AXRP podcast episode featuring John Wentworth.


Imitative Generalisation

We want to be able to supervise models with superhuman knowledge of the world. For this, we likely need an overseer to be able to learn or access all the knowledge our models have, in order to be able to understand the consequences of suggestions or decisions from the model. If the overseers don’t have access to all the same knowledge as the model, it may be easy for the model to deceive us, suggesting plans that look good to us but that may have serious negative consequences.

We might hope to access what the model knows just by training it to answer questions. However, we can only train on questions that humans are able to answer. Imitative Generalisation is a protocol that aims to narrow the gap between the things our model knows and the questions we can train our model to answer honestly.

  1. Imitative Generalisation (Barnes, 2021) (20 mins)

  2. Learning the prior (Christiano, 2020) (15 mins)

Podcast spotlight:

For further context on this agenda, listen to the AXRP podcast episode featuring Beth Barnes, who is the founder of METR, a research non-profit working on model evaluations.


Science of Deep Learning

The field of science of deep learning aims to better understand what happens inside deep learning systems and how they learn concepts. Though there is significant overlap between this field and mechanistic and developmental interpretability, science of deep learning can be viewed as a broader field: in addition to understanding model internals, it also aims to build understanding about what sorts of function approximators an optimisation process is likely to generate or what is the relationship between data, model size and compute that leads to optimal learning. The following readings give an overview of some typical research directions within this field and how they can inform alignment research.

  1. Deep Double Descent (Nakkiran et al., 2019) (10 mins)

  2. Inductive biases stick around (Hubinger, 2019) (10 mins)

  3. Understanding deep learning requires rethinking generalization (Zhang et al., 2016)

  4. Chinchilla’s wild implications (Nostalgebraist, 2022) (20 mins)

  5. In-context Learning and Induction Heads (Olsson et al., 2022)

  6. A Mechanistic Interpretability Analysis of Grokking (Nanda, 2022) (60 mins)

Podcast spotlight:

For further context on this agenda, listen to the AXRP podcast episode featuring Vikrant Varma, a researcher in the DeepMind mechanistic interpretability team.


Assistance Games

Assistance games is an alignment protocol primarily researched by Berkeley’s Center for Human-Compatible AI (CHAI for short). An assistance game is a formalisation of a sequential decision making problem with two players: a human agent and an AI agent. It is assumed that the human player has some knowledge about their goals and the AI system has to learn these goals via interactions with the person. The following readings explore AI safety problems from this perspective.

  1. Summary of assistance games (Flint, 2020) (15 mins)

  2. Cooperative Inverse Reinforcement Learning (Hadfield-Menell et al., 2016)

  3. The Off-Switch Game (Hadfield-Menell et al., 2017)

Podcast spotlight:

For further context on this agenda, listen to the AXRP podcast episode featuring Dylan Hadfield-Menell, a researcher at MIT.


Developmental Interpretability

Developmental interpretability is a very recently developed approach to mechanistic interpretability. It attempts to understand the emergence of complex abilities in neural networks by understanding the phase transitions during the training process of the networks through the lens of Singular Learning Theory. The following readings provide an overview of this new exciting field, explaining why it is particularly important to understand phase transitions in order to understand neural networks, as well as what Singular Learning Theory is and why it’s the right mathematical tool for analysing the changes in neural networks throughout the training process.

  1. Towards Developmental Interpretability (Hoogland et al., 2023) (10 mins)

  2. Distilling Singular Learning Theory (Carroll, 2023)


For further context on this agenda, listen to the AXRP podcast episode featuring Daniel Murfet, a researcher at the University of Melbourne. You can also learn more about developmental interpretability on the devinterp.com website.