Episodes
One proposal to train AIs that can be useful is to have ML models debate each other about the answer to a human-provided question, where the human judges which side has won. In this episode, I talk with Beth Barnes about her thoughts on the pros and cons of this strategy, what she learned from seeing how humans behaved in debate protocols, and how a technique called imitative generalization can augment debate. Those who are already quite familiar with the basic proposal might want to skip...
Published 04/08/21
The theory of sequential decision-making has a problem: how can we deal with situations where we have some hypotheses about the environment we're acting in, but its exact form might be outside the range of possibilities we can possibly consider? Relatedly, how do we deal with situations where the environment can simulate what we'll do in the future, and put us in better or worse situations now depending on what we'll do then? Today's episode features Vanessa Kosoy talking about...
Published 03/10/21
In machine learning, typically optimization is done to produce a model that performs well according to some metric. Today's episode features Evan Hubinger talking about what happens when the learned model itself is doing optimization in order to perform well, how the goals of the learned model could differ from the goals we used to select the learned model, and what would happen if they did differ. Link to the paper - Risks from Learned Optimization in Advanced Machine Learning Systems Link...
Published 02/17/21
Link to the paper - Negotiable Reinforcement Learning for Pareto Optimal Sequential Decision-Making Link to the transcript Critch's Google Scholar profile
Published 12/11/20
Link to the paper - On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference Link to the transcript The Alignment Newsletter Rohin's contributions to the AI alignment forum Rohin's website
Published 12/11/20
Link to the paper - Adversarial Policies: Attacking Deep Reinforcement Learning Link to the transcript Adam's website Adam's twitter account
Published 12/11/20