Week 2 - AGI Risks

This week starts off by focusing on reward misspecification: the phenomenon where our default techniques for training ML models often unintentionally assign high rewards to undesirable behaviour. Behaviour which exploits reward misspecification to get a high reward is known as reward hacking.

This type of alignment failure happens due to our failure to capture our exact desires for the resulting systems' behaviour in the reward function or loss function that we use to train machine learning systems. Failure to solve this problem is an important source of danger from advanced systems if they're built using current-day techniques, which is why we're spending a week on it.

The second key topic this week is instrumental convergence: the idea that AIs pursing a range of different rewards or goals will tend to converge to a set of similar instrumental strategies. Broadly speaking, we can summarize these strategies as aiming to gain power over the world. Instrumental convergence therefore provides a bridge between the narrow examples of reward misspecification we see today, and the possibility of large-scale disempowerment from AI; a reading by Christiano provides an illustration of how this might occur.

Core readings:

  1. Specification gaming: the flip side of AI ingenuity (Krakovna et al., 2020) (10 mins)

    • This reading demonstrates that default techniques for training RL agents often lead to bad behaviour due to the difficult of designing reward functions which correctly specify desirable behavior. Krakovna et al. showcase examples of agents exploiting misspecified rewards in simple environments (behaviour which is known as reward hacking). Try to get an intuitive understanding from this article of why reward hacking is hard to prevent.

  2. AGI Safety from First Principles (Ngo, 2020) (section 3.1 and 3.2) (20 mins)

    • This reading discusses the relationship between goal-directedness and agency, as well as the likelihood of developing highly agentic AI models in the future.

  3. Intelligence and Stupidity: The Orthogonality Thesis (15 mins) 

  4. Why Would AI Want to do Bad Things? Instrumental Convergence (10 mins)

    • An alternative for those who prefer text to video:
      Superintelligence, Chapter 7: The superintelligent will (Bostrom, 2014) (35 mins)

    • These videos/readings outline 2 theses on the relationship between intelligence and motivation in an artificial agent: the orthogonality thesis and instrumental convergence. The orthogonality thesis argues that intelligence and goals are orthogonal: an agent can have any combination of intelligence level and final goal. Instrumental convergence argues that some intermediate goals such as self-preservation and resource acquisition are useful for any final goal an agent might have. If true, those theses imply that it will be difficult to align future 

  5. What failure looks like (Christiano, 2019) (20 mins)

    • Christiano outlines two potential scenarios where humanity fails to align highly powerful AI systems with human values and loses control to these systems as a result.

  6. Intelligence Explosion: Evidence and Import (Muehlhauser and Salamon, 2012) (15 mins)

    • This reading outlines a core intuition for why we might be unable to prevent risks from advanced AI systems: that progress in AI will at some point speed up dramatically.

Further readings:

  1. AI "Stop Button" Problem - Computerphile (20 mins)

    • This video discusses the problem of implementing an off switch on a generally intelligent agent, which turns out to be much trickier than it seems at first glance.

  2. Corrigibility (Soares et al., 2015) (10 mins, skim everything up to section 2)

    • The authors introduce the concept of corrigibility, a formalization of the notion of an AI system that always abides to its creators’ interventions, including the intervention of shutting it off.

  3. There's No Rule That Says We'll Make It (10 mins)

    • This video argues that though failure in aligning advanced AI systems is not inevitable, it is a real possibility that needs to be taken seriously.

  4. Future ML Systems Will Be Qualitatively Different (Steinhardt, 2022) (10 mins)

    • This article argues that in machine learning, novel behaviours tend to emerge unpredictably as models get larger.

  5. Thought Experiments Provide a Third Anchor (Steinhardt, 2022) (10 mins)

    • Steinhardt gives some reasons for expecting thought experiments to be useful for thinking about how future machine learning systems will behave.

  6. Takeoff speeds (Christiano, 2018) (35 mins)

    • In response to Yudkowsky’s (2015) argument that there will be a sharp “intelligence explosion”, Christiano argues that the rate of progress will instead increase continuously over time. However, there is less distance between these positions than there may seem: Christiano still expects self-improving AI to eventually cause incredibly rapid growth.

  7. When discussing AI risks, talk about capabilities, not intelligence (Krakovna, 2023) (5 mins)

Resource spotlight: LessWrong
LessWrong is the online forum of the Rationality community dedicated to improving human reasoning and decision-making. It is a website and community with a strong interest in AI and specifically causing powerful AI systems to be safe and beneficial.

Exercises:

  1. What are the most plausible ways for the hypothesis “we will eventually build AGIs which have transformative impacts on the world” to be false? How likely are they?

  2. Try to explain to someone why generally intelligent AI systems will bring risks in 5 minutes. Alternatively, you could explain this to yourself out loud or create a document where you write down the main arguments on AGI risks, optionally also adding counterarguments as nested bullet points.

  3. Think of an objective function that would be safe for current AI systems to optimize. Then, try to come up with ways of how strong optimization of that objective would end in a catastrophe.