Week 4 - Scalable Oversight and Model Evaluations
This week introduces scalable oversight as an approach to preventing reward misspecification, and discusses one scalable oversight proposal: iterated amplification.
Scalable oversight refers to methods that enable humans to oversee AI systems that are solving tasks too complicated for a single human to evaluate. This week begins by justifying the problem of scalable oversight; we then examine iterated amplification and debate as potential solutions to the problem.
Iterated amplification is built around task decomposition: the strategy of training agents to perform well on complex tasks by decomposing them into smaller tasks which can be more easily evaluated and then combining those solutions to produce answers to the full task. Iterated amplification involves repeatedly using task decomposition to train increasingly powerful agents.
We then discuss another potential solution to the problem of scalable oversight: debate. Compared to iterated amplification, debate has the advantage don’t rely on the task decomposability assumption required for iterated amplificiation. However, a different strong assumption is required for the debate protocol to work: the assumption that truthful arguments are more persuasive.
Finally, we discuss model evaluations: the problem of accurately evaluating the capabilities of the models we build to determine whether they are safe to use and, in case they are deemed safe to use, how far away they are from having capabilities that could potentially threaten humans.
Core readings:
Learning complex goals with iterated amplification (Christiano, Shlegeris and Amodei, 2018) (10 mins)
Christiano describes the iterated amplification algorithm and demonstrates it using toy experiments. Iterated amplification is a proposal for scaling aligned subsystems to solve complex tasks. If you’re interested in this approach, you can also read the full paper corresponding to this blog post.
This blog post introduces another approach to scalable oversight: debate. Debate involves back-and-forth natural-language exchanges between multiple AIs, which is intended to make it easier for humans to judge which AI is more truthful. If interested, you can also read the full paper corresponding to this blog post.
Scalable agent alignment via reward modelling (Leike, 2018) (10 mins)
This blog post describes a third approach to scalable oversight: recursive reward modeling.
OpenAI’s approach to alignment research (Leike, Schulman and Wu, 2022) (15 mins)
This blog post discusses the three main pillars of OpenAI’s approach to alignment: training AI systems using human feedback, training AI systems to assist human evaluation, and training AI systems to do alignment research. It also discusses the usefulness of the three scalable oversight protocols you just learned about for this approach
Red Teaming Language Models with Language Models (Perez et al., 2022) (10 mins)
Perez et al. use a language model to automatically generate test cases that give rise to misbehavior without access to network weights, making this a 'black-box attack'. This is an example of generating 'unrestricted adversarial examples'. 'Unrestricted' refers to the fact that the language model can generate any example, whereas (restricted) adversarial examples are usually closely related to training datapoints.
Challenges in evaluating AI systems (Ganguli et al., 2023) (20 mins)
This post outlines some challenges in accurately evaluating the capabilities of an advanced AI model.
Safety evaluations and standards for AI | Beth Barnes | EAG Bay Area 23 (30 mins)
This video provides an overview of some current approaches to designing model evaluations and a theory of impact for how model evaluations can be useful for AI alignment.
Further readings:
Weak-to-strong generalization (Burns et al., 2023) (10 mins)
This blog post discusses some promising initial results on the supervision of more capable models with less capable ones.
Measuring Progress on Scalable Oversight for Large Language Models (Bowman et al., 2022) (only the introduction) (5 mins)
The paper introduces the experimental setup of 'sandwiching'. When an AI system is 'sandwiched' between human subject experts and laypeople, it is more capable than the laypeople but less capable than the experts in a specific domain. Sandwiching allows us to figure out whether our scalable oversight hypotheses will work in systems of tomorrow (when there will be no experts able to supervise the AI system).
Supervising strong learners by amplifying weak experts (Christiano, Shlegeris and Amodei, 2018) (45 mins)
This is the full paper accompanying the blog post about iterated amplification in the core readings. It provides a significantly more in-depth overview of the iterated amplification protocol.
Iterated Distillation and Amplification (Cotra, 2018) (20 mins)
This reading and the next video explain the iterated amplification protocol from a different perspective.
How to Keep Improving When You're Better Than Any Teacher - Iterated Distillation and Amplification (10 mins)
Language Models Perform Reasoning via Chain of Thought (Wei and Zhou, 2022) (10 mins)
Chain of thought is a technique for prompting large language models to provide better answers by going through a sequence of reasoning steps. It is an example of breaking down complex problems into simple steps in a similar vein to how it’s done in the iterated amplification protocol.
AI-written critiques help humans notice flaws (Leike et al., 2022) (10 mins)
The authors train a language model to critique the performance of another language model, helping humans evaluate it. This is a simple example of the debate protocol. Note in particular the gap between discrimination and critique ability, which is an important metric to reduce.
Debate update: obfuscated arguments problem (Barnes and Christiano, 2020) (20 mins)
Barnes and Christiano describe some theoretical problems with the debate protocol which may need to be surmounted before it can be used to evaluate more complex tasks.
Progress on AI safety via debate (Barnes et al., 2020) (60 mins)
Barnes discusses some progress on developing the debate protocol in the first two years after it was first introduced.
Robust Feature-Level Adversaries are Interpretability Tools (Casper, 2021) (40 mins)
While the core reading by Perez et al. described black-box adversarial attacks, Caspar et al. construct attacks by manipulating high-level features of inputs using access to network weights, making this a 'white-box' attack. This is another case of generating unrestricted adversarial examples.
Discovering Language Model Behaviors with Model-Written Evaluations (Perez et al., 2022) (only read the sections that look interesting to you)
This paper provides an overview of attempts to automate the process of evaluating model behaviors by evaluating AI models using other AI models.
Challenges in evaluating AI systems (Ganguli et al., 2023) (20 mins)
This post outlines some challenges in accurately evaluating the capabilities of advanced AI models.
Resource spotlight: The AXRP podcast
The AI X-risk Research Podcast (AXRP) is a podcast that hosts in-depth conversations with leading AI safety researchers. It discusses their work and the ways in which it will hopefully reduce risks from advanced AI systems. Past guests include Jan Leike, the former research lead of the OpenAI Superalignment team, and Paul Christiano, a researcher at the Alignment Research Center and the inventor of the RLHF fine-tuning procedure.
Exercises:
A complex task like running a factory can be broken down into subtasks in a fairly straightforward way, allowing a large team of workers to perform much better than even an exceptionally talented individual. Describe a task where teams have much less of an advantage over the best individuals. Why doesn’t your task benefit as much from being broken down into subtasks? How might we change that?
This week’s readings introduced two methods for scalable oversight: iterated amplification and debate. Which of those methods seems more promising to you? Why?
Think of one threat model of AI. You can refer back to the readings of Week 2 to revisit the main risks of advanced AI systems. Now, think of a way we might evaluate whether a language model is able to pose this threat.