“Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models” by Andrew Mack, TurnTrout
Description
Audio note: this article contains 449 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund.
TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the effect of causally intervening on the residual stream of a deep (i.e. _gtrsim10_-layer) slice of a transformer, using a shallow MLP. I find that the weights of these MLPs are highly interpretable -- input directions serve as diverse and coherently generalizable steering vectors, while output directions induce predictable changes in model behavior via directional ablation.
Summary I consider deep causal transcoders (DCTs) with various activation functions [...]
---
Outline:
(05:40) Introduction
(07:16) Related work
(09:59) Theory
(18:28) Method
(22:04) Fitting a Linear MLP
(25:08) Fitting a Quadratic MLP
(30:12) Alternative formulation of tensor decomposition objective: causal importance minus similarity penalty
(34:03) Fitting an Exponential MLP
(36:35) On the role of _R_
(37:37) Relation to original MELBO objective
(38:54) Calibrating _R_
(41:44) Case Study: Learning Jailbreak Vectors
(41:49) Generalization of linear, quadratic and exponential DCTs
(51:49) Evidence for multiple harmless directions
(56:00) Many loosely-correlated DCT features elicit jailbreaks
(59:44) Averaging doesnt improve generalization when we add features to the residual stream
(01:01:05) Averaging does improve jailbreak scores when we ablate features
(01:03:15) Ablating (averaged) target-layer features also works
(01:04:49) Deeper models: constant depth horizon (_t-s_) suffices for learning jailbreaks
(01:09:32) Application: Jailbreaking Representation-Rerouted Mistral-7B
(01:17:15) Application: Eliciting Capabilities in Password-Locked Models
(01:18:42) Future Work
(01:19:01) Studying feature multiplicity
(01:20:03) Quantifying a broader range of behaviors
(01:20:24) Effect of pre-training hyper-parameters
(01:22:05) Acknowledgements
(01:22:16) Appendix
(01:22:19) Hessian auto diff details
The original text contained 26 footnotes which were omitted from this narration.
The original text contained 2 images which were described by AI.
---
First published:
December 3rd, 2024
Source:
https://www.lesswrong.com/posts/fSRg5qs9TPbNy3sm5/deep-causal-transcoding-a-framework-for-mechanistically
---
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Preface
Several friends have asked me about what psychological effects I think could affect human judgement about x-risk.
This isn't a complete answer, but in 2018 I wrote a draft of "AI Research Considerations for Human Existential Safety" (ARCHES) that included an overview of cognitive biases...
Published 12/04/24
In the spirit of the season, you can book a call with me to help w/ your interp project (no large coding though)
Would you like someone to:
Review your paper or code? Brainstorm ideas on next steps? How to best communicate your results? Discuss conceptual problems Obvious Advice (e.g. being...
Published 12/03/24