17 - Training for Very High Reliability with Daniel Ziegler
Listen now
Sometimes, people talk about making AI systems safe by taking examples where they fail and training them to do well on those. But how can we actually do this well, especially when we can't use a computer program to say what a 'failure' is? In this episode, I speak with Daniel Ziegler about his research group's efforts to try doing this with present-day language models, and what they learned. Listeners beware: this episode contains a spoiler for the Animorphs franchise around minute 41 (in the 'Fanfiction' section of the transcript). Topics we discuss, and timestamps: 00:00:40 - Summary of the paper 00:02:23 - Alignment as scalable oversight and catastrophe minimization 00:08:06 - Novel contribtions 00:14:20 - Evaluating adversarial robustness 00:20:26 - Adversary construction 00:35:14 - The task 00:38:23 - Fanfiction 00:42:15 - Estimators to reduce labelling burden 00:45:39 - Future work 00:50:12 - About Redwood Research The transcript Daniel Ziegler on Google Scholar Research we discuss: Daniel's paper, Adversarial Training for High-Stakes Reliability Low-stakes alignment Red Teaming Language Models with Language Models Uncertainty Estimation for Language Reward Models Eliciting Latent Knowledge
More Episodes
The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I...
Published 11/26/23