16 - Preparing for Debate AI with Geoffrey Irving - Listen - AXRP

16 - Preparing for Debate AI with Geoffrey Irving

Listen now

Description

Many people in the AI alignment space have heard of AI safety via debate - check out AXRP episode 6 if you need a primer. But how do we get language models to the stage where they can usefully implement debate? In this episode, I talk to Geoffrey Irving about the role of language models in AI safety, as well as three projects he's done that get us closer to making debate happen: using language models to find flaws in themselves, getting language models to back up claims they make with citations, and figuring out how uncertain language models should be about the quality of various answers. Topics we discuss, and timestamps: 00:00:48 - Status update on AI safety via debate 00:10:24 - Language models and AI safety 00:19:34 - Red teaming language models with language models 00:35:31 - GopherCite 00:49:10 - Uncertainty Estimation for Language Reward Models 01:00:26 - Following Geoffrey's work, and working with him The transcript Geoffrey's twitter Research we discuss: Red Teaming Language Models With Language Models Teaching Language Models to Support Answers with Verified Quotes, aka GopherCite Uncertainty Estimation for Language Reward Models AI Safety via Debate Writeup: progress on AI safety via debate Eliciting Latent Knowledge Training Compute-Optimal Large Language Models, aka Chinchilla

More Episodes

See all »

33 - RLHF Problems with Scott Emmons

Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk...

Published 06/12/24

AXRP - the AI X-risk Research Podcast

Published 06/12/24

32 - Understanding Agency with Jan Kulveit

What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: patreon.com/axrpodcast Ko-fi:...

Published 05/30/24