Constitutional AI Harmlessness from AI Feedback - Listen - AI

Constitutional AI Harmlessness from AI Feedback

Listen now

Description

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises limitations with conventional RLHF, explains the constitutional AI approach, shows how it performs, and where future research might be directed.If you are in a rush, focus on sections 1.2, 3.1, 3.4, 4.1...

More Episodes

See all »

AI Safety Fundamentals: Alignment

Published 07/19/24

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Published 07/19/24

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.While reading, consider which parts of the technical implementation correspond to the 'values coach' and 'coherence coach'...

Published 07/19/24