19 - Mechanistic Interpretability with Neel Nanda
Listen now
How good are we at understanding the internal computation of advanced machine learning models, and do we have a hope at getting better? In this episode, Neel Nanda talks about the sub-field of mechanistic interpretability research, as well as papers he's contributed to that explore the basics of transformer circuits, induction heads, and grokking. Topics we discuss, and timestamps: 00:01:05 - What is mechanistic interpretability? 00:24:16 - Types of AI cognition 00:54:27 - Automating mechanistic interpretability 01:11:57 - Summarizing the papers 01:24:43 - 'A Mathematical Framework for Transformer Circuits' 01:39:31 - How attention works 01:49:26 - Composing attention heads 01:59:42 - Induction heads 02:11:05 - 'In-context Learning and Induction Heads' 02:12:55 - The multiplicity of induction heads 02:30:10 - Lines of evidence 02:38:47 - Evolution in loss-space 02:46:19 - Mysteries of in-context learning 02:50:57 - 'Progress measures for grokking via mechanistic interpretability' 02:50:57 - How neural nets learn modular addition 03:11:37 - The suddenness of grokking 03:34:16 - Relation to other research 03:43:57 - Could mechanistic interpretability possibly work? 03:49:28 - Following Neel's research The transcript Links to Neel's things: Neel on Twitter Neel on the Alignment Forum Neel's mechanistic interpretability blog TransformerLens Concrete Steps to Get Started in Transformer Mechanistic Interpretability Neel on YouTube 200 Concrete Open Problems in Mechanistic Interpretability Comprehesive mechanistic interpretability explainer Writings we discuss: A Mathematical Framework for Transformer Circuits In-context Learning and Induction Heads Progress measures for grokking via mechanistic interpretability Hungry Hungry Hippos: Towards Language Modeling with State Space Models (referred to in this episode as the "S4 paper") interpreting GPT: the logit lens Locating and Editing Factual Associations in GPT (aka the ROME paper) Human-level play in the game of Diplomacy by combining language models with strategic reasoning Causal Scrubbing An Interpretability Illusion for BERT Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models Collaboration & Credit Principles Transformer Feed-Forward Layers Are Key-Value Memories Multi-Component Learning and S-Curves The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks Linear Mode Connectivity and the Lottery Ticket Hypothesis    
More Episodes
The events of this year have highlighted important questions about the governance of artificial intelligence. For instance, what does it mean to democratize AI? And how should we balance benefits and dangers of open-sourcing powerful AI systems such as large language models? In this episode, I...
Published 11/26/23