Week 1 - Further Understanding the Risks from AI
In the Alignment 101 course, you learned about different views on how AI might transform our society over the coming decades. You also learned about various threat models and risks associated with advanced AI systems.
This week will build upon that knowledge. In the first reading, Christiano presents a detailed formulation of what it should mean for an advanced AI model to be aligned. In the second reading, Hendrycks provides additional breadth to the threat models you have learned about thus far, covering risks that weren’t discussed during Alignment 101.
In the third and fourth reading, you will go in-depth into arguments about how far AGI is, how quickly might an AGI system reach superhuman intelligence, how difficult it is to align an AGI system, and how optimistic or pessimistic all of this should make us about our prospects of aligning advanced AIs. Finally, Hubinger et al. go deep into one specific threat model which might turn out to be particularly important: that of a deceptive, inner misaligned AGI.
Core readings:
Christiano provides a precise formulation and an analogy of what he means when he talks of an AI system as being aligned.
An Overview of Catastrophic AI Risks (Hendrycks, 2023) (20 mins)
Hendrycks provides a broad overview of possible ways AI systems could pose catastrophic risks. The discussed risks go beyond the threat models that were covered during the Alignment 101 course: for example, the author discusses risks from malicious use, arms race risks, and accident risks.
AGI ruin: a list of lethalities (Yudkowsky, 2022) (35 mins) (only sections A and B)
Eliezer Yudkowsky, one of the early researchers in the AI alignment field and a founder and researcher at the Machine Intelligence Research Institute, provides a comprehensive list of reasons that make him pessimistic about humanity’s chances of building an aligned AGI system.
Where I agree and disagree with Eliezer (Christiano, 2022) (25 mins) (only sections Agreements and Disagreements)
Another prominent researcher in the field, Paul Christiano, responds to the above post by listing his agreements and disagreements with Yudkowsky, that make him more optimistic about the probability that the first AGI system will be aligned.
Risks from Learned Optimization in Advanced Machine Learning Systems (Hubinger et al., 2019) (only sections 1, 3 and 4) (45 mins)
This is the paper that originally introduced the inner alignment problem. It explains why inner alignment should be viewed as a separate problem from the problem of finding the correct reward function and explains why inner misalignment might lead to systems that are deceptive.
Further readings:
Some of my disagreements with List of Lethalities (Turner, 2023) (20 mins)
Worst-case guarantees (Christiano, 2019) (15 mins)
The alignment problem from a deep learning perspective (Ngo, Mindermann and Chan, 2022) (only sections 4 and 5) (20 mins)
Biological Anchors: A Trick That Might Or Might Not Work (Alexander, 2022) (30 mins)
This reading provides an in-depth critical review of one popular method for forecasting progress in AI that you learned about in the first part of the course: biological anchors.
The Bitter Lesson (Sutton, 2019) (10 mins)
This article argues that in the field of AI, general methods that scale with increased compute outperform (expert) human-knowledge-based approaches.
Worlds Where Iterative Design Fails (Wentworth, 2022) (15 mins)
Alignment By Default (Wentworth, 2020) (20 mins)
The case for ensuring that powerful AIs are controlled (Greenblatt and Shlegeris, 2024) (45 mins)
Counterarguments to the basic AI x-risk case (Grace, 2022) (45 mins)
Low-stakes alignment (Christiano, 2021) (10 mins)
Reward is not the optimization target (Turner, 2022) (15 mins)
Podcast spotlight:
For an in-depth explanation of why advanced AI systems pose an existential risk and what would it look like to develop safer systems, you can listen to episode 12 of the AI X-Risk Podcast (AXRP) with Paul Christiano, a researcher at the Alignment Research Center and one of the inventors of the RLHF fine-tuning protocol. For a long discussion with Yudkowsky on the reasons behind his pessimism about the feasibility of AGI alignment, check out his appearance on the Dwarkesh podcast. For further context on the Risks from Learned Optimization paper, you can listen to the AXRP episode with Evan Hubinger. Forecasting AI progress is discussed in the 80,000 Hours podcast episode featuring Danny Hernandez from Anthropic.