Toy Models of Superposition - Listen - AI Safety Fundamentals:

Toy Models of Superposition

Listen now

Description

It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, espec...

More Episodes

See all »

AI Safety Fundamentals: Alignment

Published 07/19/24

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises...

Published 07/19/24

Constitutional AI Harmlessness from AI Feedback

Published 07/19/24