2 - Learning Human Biases with Rohin Shah
Listen now
Description
Link to the paper - On the Feasibility of Learning, Rather than Assuming, Human Biases for Reward Inference Link to the transcript The Alignment Newsletter Rohin's contributions to the AI alignment forum Rohin's website
More Episodes
What's going on with deep learning? What sorts of models get learned, and what are the learning dynamics? Singular learning theory is a theory of Bayesian statistics broad enough in scope to encompass deep neural networks that may help answer these questions. In this episode, I speak with Daniel...
Published 05/07/24
Top labs use various forms of "safety training" on models before their release to make sure they don't do nasty stuff - but how robust is that? How can we ensure that the weights of powerful AIs don't get leaked or stolen? And what can AI even do these days? In this episode, I speak with Jeffrey...
Published 04/30/24