“Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models” by Andrew Mack, TurnTrout
Listen now
Description
Audio note: this article contains 449 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund. TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the effect of causally intervening on the residual stream of a deep (i.e. _gtrsim10_-layer) slice of a transformer, using a shallow MLP. I find that the weights of these MLPs are highly interpretable -- input directions serve as diverse and coherently generalizable steering vectors, while output directions induce predictable changes in model behavior via directional ablation. Summary I consider deep causal transcoders (DCTs) with various activation functions [...] --- Outline: (05:40) Introduction (07:16) Related work (09:59) Theory (18:28) Method (22:04) Fitting a Linear MLP (25:08) Fitting a Quadratic MLP (30:12) Alternative formulation of tensor decomposition objective: causal importance minus similarity penalty (34:03) Fitting an Exponential MLP (36:35) On the role of _R_ (37:37) Relation to original MELBO objective (38:54) Calibrating _R_ (41:44) Case Study: Learning Jailbreak Vectors (41:49) Generalization of linear, quadratic and exponential DCTs (51:49) Evidence for multiple harmless directions (56:00) Many loosely-correlated DCT features elicit jailbreaks (59:44) Averaging doesnt improve generalization when we add features to the residual stream (01:01:05) Averaging does improve jailbreak scores when we ablate features (01:03:15) Ablating (averaged) target-layer features also works (01:04:49) Deeper models: constant depth horizon (_t-s_) suffices for learning jailbreaks (01:09:32) Application: Jailbreaking Representation-Rerouted Mistral-7B (01:17:15) Application: Eliciting Capabilities in Password-Locked Models (01:18:42) Future Work (01:19:01) Studying feature multiplicity (01:20:03) Quantifying a broader range of behaviors (01:20:24) Effect of pre-training hyper-parameters (01:22:05) Acknowledgements (01:22:16) Appendix (01:22:19) Hessian auto diff details The original text contained 26 footnotes which were omitted from this narration. The original text contained 2 images which were described by AI. --- First published: December 3rd, 2024 Source: https://www.lesswrong.com/posts/fSRg5qs9TPbNy3sm5/deep-causal-transcoding-a-framework-for-mechanistically --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
More Episodes
Preface Several friends have asked me about what psychological effects I think could affect human judgement about x-risk. This isn't a complete answer, but in 2018 I wrote a draft of "AI Research Considerations for Human Existential Safety" (ARCHES) that included an overview of cognitive biases...
Published 12/04/24
In the spirit of the season, you can book a call with me to help w/ your interp project (no large coding though) Would you like someone to: Review your paper or code? Brainstorm ideas on next steps? How to best communicate your results? Discuss conceptual problems Obvious Advice (e.g. being...
Published 12/03/24
Published 12/03/24