19 - Mechanistic Interpretability with Neel Nanda
Listen now
Description
How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking. Topics we discuss, and timestamps: 00:01:05 - What is mechanistic interpretability? 00:24:16 - Types of AI cognition 00:54:27 - Automating mechanistic interpretability 01:11:57 - Summarizing the papers 01:24:43 - 'A Mathematical Framework for Transformer Circuits' 01:39:31 - How attention works 01:49:26 - Composing attention heads 01:59:42 - Induction heads 02:11:05 - 'In-context Learning and Induction Heads' 02:12:55 - The multiplicity of induction heads 02:30:10 - Lines of evidence 02:38:47 - Evolution in loss-space 02:46:19 - Mysteries of in-context learning 02:50:57 - 'Progress measures for grokking via mechanistic interpretability' 02:50:57 - How neural nets learn modular addition 03:11:37 - The suddenness of grokking 03:34:16 - Relation to other research 03:43:57 - Could mechanistic interpretability possibly work? 03:49:28 - Following Neel's research The transcript Links to Neel's things: Neel on Twitter Neel on the Alignment Forum Neel's mechanistic interpretability blog TransformerLens Concrete Steps to Get Started in Transformer Mechanistic Interpretability Neel on YouTube 200 Concrete Open Problems in Mechanistic Interpretability Comprehesive mechanistic interpretability explainer Writings we discuss: A Mathematical Framework for Transformer Circuits In-context Learning and Induction Heads Progress measures for grokking via mechanistic interpretability Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper") interpreting GPT: the logit lens Locating and Editing Factual Associations in GPT (aka the ROME paper) Human-level play in the game of Diplomacy by combining language models with strategic reasoning Causal Scrubbing An Interpretability Illusion for BERT Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models Collaboration & Credit Principles Transformer Feed-Forward Layers Are Key-Value Memories Multi-Component Learning and S-Curves The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Linear Mode Connectivity and the Lottery Ticket Hypothesis    
More Episodes
In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on...
Published 04/25/24