18 - Concept Extrapolation with Stuart Armstrong
Listen now
Description
Concept extrapolation is the idea of taking concepts an AI has about the world - say, "mass" or "does this picture contain a hot dog" - and extending them sensibly to situations where things are different - like learning that the world works via special relativity, or seeing a picture of a novel sausage-bread combination. For a while, Stuart Armstrong has been thinking about concept extrapolation and how it relates to AI alignment. In this episode, we discuss where his thoughts are at on this topic, what the relationship to AI alignment is, and what the open questions are. Topics we discuss, and timestamps: 00:00:44 - What is concept extrapolation 00:15:25 - When is concept extrapolation possible 00:30:44 - A toy formalism 00:37:25 - Uniqueness of extrapolations 00:48:34 - Unity of concept extrapolation methods 00:53:25 - Concept extrapolation and corrigibility 00:59:51 - Is concept extrapolation possible? 01:37:05 - Misunderstandings of Stuart's approach 01:44:13 - Following Stuart's work The transcript Stuart's startup, Aligned AI Research we discuss: The Concept Extrapolation sequence The HappyFaces benchmark Goal Misgeneralization in Deep Reinforcement Learning
More Episodes
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk...
Published 06/12/24
What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: patreon.com/axrpodcast Ko-fi:...
Published 05/30/24