Toy Models of Superposition
Listen now
Description
It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an “ideal” ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout. Empirically, in models we have studied, some of the neurons do cleanly map to features. But it isn't always the case that features correspond so cleanly to neurons, espec...
More Episodes
This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises...
Published 07/19/24
This paper explains Anthropic’s constitutional AI approach, which is largely an extension on RLHF but with AIs replacing human demonstrators and human evaluators.Everything in this paper is relevant to this week's learning objectives, and we recommend you read it in its entirety. It summarises...
Published 07/19/24