AI Safety Fundamentals: Alignment - Listen - AI Safety

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises...

Published 07/19/24

Constitutional AI Harmlessness from AI Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises...

Published 07/19/24

Illustrating Reinforcement Learning from Human Feedback (RLHF)

This more technical article explains the motivations for a system like RLHF, and adds additional concrete details as to how the RLHF approach is applied to neural networks.While reading, consider which parts of the technical implementation correspond to the 'values coach' and 'coherence coach'...

Published 07/19/24