Week 4 - Aligning Large Language Models

This week will summarise the entire AI Safety Fundamentals course by giving an overview of how the techniques you have learned about over the past eight weeks can be applied on large language models.

The first reading about Constitutional AI will build upon concepts from weeks 2 and 3 of Alignment 101, proposing the use of a list of ethical principles and AI feedback about the model’s ability to adhere to those principles as a scalable solution to the reward misspecification problem.

The second reading about Sleeper Agents puts the goal misgeneralisation problem into the context of state-of-the-art models, demonstrating the feasibility of deceptive language models that persist through safety training.

The third and fourth reading discuss scalable oversight - supervising larger language models with smaller ones. The fourth reading by Bills et al. demonstrates the potential use of such oversight in the field of interpretability, showing how language models can explain the functions of neurons in other language models.

The final reading by Perez et al. discusses model evaluations, again in the scalable setting: it discusses the ability of language models to evaluate the capabilities of other language models.

Core readings:

  1. Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) (read as much as you have time for. If you are in a rush, focus on sections 3.1, 3.4, 4.1, 6.1, 6.2.)

    • This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators. It summarises limitations with conventional RLHF, explains the constitutional AI approach, shows how it performs, and where future research might be directed.

  2. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training (Hubinger et al., 2024) (only sections 1, 2, 4) (40 mins)

    • Humans are capable of strategically deceptive behaviour: behaving helpfully in most situations, but then behaving very differently in order to pursue alternative objectives when given the opportunity. This paper deliberately builds such deceptive behaviour into an LLM, and finds that such backdoored behaviour can be made persistent, so that it is not removed by standard safety training techniques.

  3. Weak-to-strong generalisation (Burns et al., 2023) (5 mins)

    • This blog post discusses some promising initial results on the supervision of more capable models with less capable ones.

  4. Language models can explain neurons in language models (Bills et al., 2023) (10 mins)

    • OpenAI researchers provide an example of how language models can already help us do better interpretability research.

  5. Discovering Language Model Behaviours with Model-Written Evaluations (Perez et al., 2022) (only sections 1-5) (30 mins)

    • This paper provides an overview of attempts to automate the process of evaluating model behaviours by evaluating AI models using other AI models. One interesting observation the authors make is that the phenomenon of sycophancy - the tendency of LLMs to repeat back a user’s preferred answer - scales inversely with model size, getting worse as models become bigger.

Further readings:

  1. Adversarial Training for High-Stakes Reliability (Ziegler et al., 2022)

  2. Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research (Hubinger et al., 2023) (30 mins)

  3. Training language models with language feedback (Scheurer et al., 2022)

  4. Red Teaming Language Models with Language Models (Perez et al., 2022) (10 mins)

    • Perez et al. use a language model to automatically generate test cases that give rise to misbehaviour without access to network weights, making this a 'black-box attack'. This is an example of generating 'unrestricted adversarial examples'. 'Unrestricted' refers to the fact that the language model can generate any example, whereas (restricted) adversarial examples are usually closely related to training datapoints.

  5. Steering GPT-2-XL by adding an activation vector (Turner et al., 2023) (60 mins)

  6. A General Language Assistant as a Laboratory for Alignment (Askell et al., 2021)

  7. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)

  8. Simulators (janus, 2022) (60 mins)

  9. Mysteries of mode collapse (janus, 2022)

  10. Why Not Just Outsource Alignment Research To An AI? (Wentworth, 2023) (15 mins)

Podcast spotlight:

For additional context on the promises and pitfalls of scalable oversight, as well as on OpenAI’s alignment plan in general, listen to the AXRP podcast episode with Jan Leike, the former research lead of the OpenAI Superalignment team.