Week 3 - Goal Misgeneralisation and Learning from Humans

Last week, we discussed the reward misspecification problem: the phenomenon where our default techniques for training ML models often unintentionally assign high rewards to undesirable behavior. This week, we'll first look at two techniques engineers working with foundation models use to overcome reward misspecification in advanced systems: Reinforcement Learning from Human Feedback (RLHF) and Inverse Reinforcement Learning (IRL).

Even with a correctly chosen reward function, however, the rewards used during training don’t allow us to control how agents generalize to new situations. This week we cover the scenarios in which agents in new situations generalize to behaving in competent yet undesirable ways because of learning the wrong goals from previous training: the problem of goal misgeneralization.


Core readings:

  1. Illustrating Reinforcement Learning from Human Feedback (Lambert et al., 2022) (25 mins)

    • This article explains the motivations for training models with a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.

  2. Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022) (only section 1) (10 mins)

    • This paper introduces you to Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.

  3. Inverse reinforcement learning example (Udacity, 2016) (5 mins)

    • This video introduces a different approach to learning from humans: inverse reinforcement learning (IRL).

  4. The easy goal inference problem is still hard (Christiano, 2018) (5 mins)

    • Christiano makes the case that IRL is not a perfect solution to the 'goal inference problem'. It states that to infer preferences, it's necessary to understand how humans are biased and flawed, which seems very hard to do from observation alone.

  5. Goal Misgeneralisation: Why Correct Specifications Aren’t Enough For Correct Goals (Shah, 2022) (10 mins)

    • Shah et al. argue that even an agent trained on the 'right' reward function might learn goals which generalize in undesirable ways and provide both concrete and hypothetical illustrations of the phenomenon.

  6. The alignment problem from a deep learning perspective (Ngo, Mindermann and Chan, 2022) (only sections 2 and 3) (30 mins)

    • This paper gives a clear presentation of the differences between reward misspecification, the topic of last week, and goal misgeneralization, the topic of this week. It also provides a different framing of goal misgeneralization: while Shah et al. define goal misgeneralization in terms of undesirable behavior, this reading approaches the same idea differently — by reasoning about goals in terms of agents' internal representations.


Further readings:

  1. The OTHER AI Alignment Problem: Mesa-Optimizers and Inner Alignment (25 mins)

    • This video presents a different framing for the goal misgeneralization issue discussed in the last two core readings of this week, naming it the inner alignment problem. ‘Inner misalignment’ and ‘goal misgeneralization’ can be seen as roughly equivalent concepts, except that the former is typically defined in terms of behavior, while the latter is typically defined in terms of internal, learned representations.

  2. Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger et al., 2019) (only sections 1, 3 and 4) (45 mins)

    • This is the paper that originally introduced the inner alignment problem. It explains why inner alignment should be viewed as a separate problem from the problem of finding the correct reward function and explains why inner misalignment might lead to systems that are deceptive.

  3. ML Systems Will Have Weird Failure Modes (Steinhardt, 2022) (15 mins)

    • In this post, Steinhardt describes a hypothesized phenomenon called 'deceptive alignment', which was first introduced in the above paper by Hubinger et al. He assumes that during the training process, a neural network develops an internally-represented objective which diverges from the training objective. He then argues that the network will be incentivized to perform in a way that leads to that misaligned internally-represented objective being preserved during training, and that this could lead to sudden misbehavior during deployment.

  4. Goal misgeneralization in deep reinforcement learning (Koch et al., 2021) (30 mins)

    • This paper provides an overview of reinforcement learning settings where we can already observe goal misgeneralization.

  5. Learning from Human Preferences (Christiano, Ray and Amodei, 2017) (5 mins)

    • This blog post summarizes the paper that originally introduced the RLHF protocol.

  6. Learning to Summarize With Human Feedback (Stiennon et al., 2020) (10 mins)

    • This blog post describes a case study of applying RLHF fine-tuning to train models that are better at summarization.

  7. Training AI Without Writing A Reward Function, with Reward Modelling (20 mins)

    • If you’re still unsure whether you fully understand how RLHF works, this video provides further explanations of the protocol.

  8. Models Don’t “Get Reward” (Ringer, 2022) (5 mins)

    • This reading discusses an important misconception people often have when reasoning about reinforcement learning.

  9. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)

    • This paper describes how RLHF is used in practice on Anthropic’s state-of-the-art language models.

  10. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback (Casper et al., 2023) (only read the sections that look interesting to you)


Resource spotlight: The Alignment Forum

The AI Alignment Forum is an online forum for researchers working in the field of AI alignment. It encourages researchers to publicly share their work on reliable and trustworthy AI and to foster discussions on that research, even if it’s not yet polished or comprehensive enough to be written into a conference paper. Though participation on the forum is exclusive to researchers, the posts are always cross-posted to LessWrong where everyone is able to discuss them.


Exercises:

  1. Why is it not appropriate to describe the specification gaming agents from last week’s reading in Krakovna et al. (2020) as displaying goal misgeneralization?

  2. By some definitions, a chess AI has the goal of winning. When is it useful to describe it that way? What are the key differences between human goals and the “goals” of a chess AI?