17 - Training for Very High Reliability with Daniel Ziegler
Listen now
Description
Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned. Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'Fanfiction' section of the transcript). Topics we discuss, and timestamps: 00:00:40 - Summary of the paper 00:02:23 - Alignment as scalable oversight and catastrophe minimization 00:08:06 - Novel contribtions 00:14:20 - Evaluating adversarial robustness 00:20:26 - Adversary construction 00:35:14 - The task 00:38:23 - Fanfiction 00:42:15 - Estimators to reduce labelling burden 00:45:39 - Future work 00:50:12 - About Redwood Research The transcript Daniel Ziegler on Google Scholar Research we discuss: Daniel's paper, Adversarial Training for High-Stakes Reliability Low-stakes alignment Red Teaming Language Models with Language Models Uncertainty Estimation for Language Reward Models Eliciting Latent Knowledge
More Episodes
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk...
Published 06/12/24
What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: patreon.com/axrpodcast Ko-fi:...
Published 05/30/24