7 - Side Effects with Victoria Krakovna
Listen now
Description
One way of thinking about how AI might pose an existential threat is by taking drastic actions to maximize its achievement of some objective function, such as taking control of the power supply or the world's computers. This might suggest a mitigation strategy of minimizing the degree to which AI systems have large effects on the world that absolutely necessary for achieving their objective. In this episode, Victoria Krakovna talks about her research on quantifying and minimizing side effects. Topics discussed include how one goes about defining side effects and the difficulties in doing so, her work using relative reachability and the ability to achieve future tasks as side effects measures, and what she thinks the open problems and difficulties are. Link to the transcript Link to the paper "Penalizing Side Effects Using Stepwise Relative Reachability" Link to the paper "Avoiding Side Effects by Considering Future Tasks" Victoria Krakovna's website Victoria Krakovna's Alignment Forum profile Work mentioned in the episode: Rohin Shah on the difficulty of finding a value-agnostic impact measure Stuart Armstrong's bucket of water example Attainable Utility Preservation Low Impact Artificial Intelligences AI Safety Gridworlds Test Cases for Impact Regularisation Methods SafeLife Avoiding Side Effects in Complex Environments
More Episodes
Reinforcement Learning from Human Feedback, or RLHF, is one of the main ways that makers of large language models make them 'aligned'. But people have long noted that there are difficulties with this approach when the models are smarter than the humans providing feedback. In this episode, I talk...
Published 06/12/24
What's the difference between a large language model and the human brain? And what's wrong with our theories of agency? In this episode, I chat about these questions with Jan Kulveit, who leads the Alignment of Complex Systems research group. Patreon: patreon.com/axrpodcast Ko-fi:...
Published 05/30/24