“Deep Causal Transcoding: A Framework for Mechanistically

LessWrong (30+ Karma)

“Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models” by Andrew Mack, TurnTrout

Listen now

Description

Audio note: this article contains 449 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund. TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the effect of causally intervening on the residual stream of a deep (i.e. _gtrsim10_-layer) slice of a transformer, using a shallow MLP. I find that the weights of these MLPs are highly interpretable -- input directions serve as diverse and coherently generalizable steering vectors, while output directions induce predictable changes in model behavior via directional ablation. Summary I consider deep causal transcoders (DCTs) with various activation functions [...] --- Outline: (05:40) Introduction (07:16) Related work (09:59) Theory (18:28) Method (22:04) Fitting a Linear MLP (25:08) Fitting a Quadratic MLP (30:12) Alternative formulation of tensor decomposition objective: causal importance minus similarity penalty (34:03) Fitting an Exponential MLP (36:35) On the role of _R_ (37:37) Relation to original MELBO objective (38:54) Calibrating _R_ (41:44) Case Study: Learning Jailbreak Vectors (41:49) Generalization of linear, quadratic and exponential DCTs (51:49) Evidence for multiple harmless directions (56:00) Many loosely-correlated DCT features elicit jailbreaks (59:44) Averaging doesnt improve generalization when we add features to the residual stream (01:01:05) Averaging does improve jailbreak scores when we ablate features (01:03:15) Ablating (averaged) target-layer features also works (01:04:49) Deeper models: constant depth horizon (_t-s_) suffices for learning jailbreaks (01:09:32) Application: Jailbreaking Representation-Rerouted Mistral-7B (01:17:15) Application: Eliciting Capabilities in Password-Locked Models (01:18:42) Future Work (01:19:01) Studying feature multiplicity (01:20:03) Quantifying a broader range of behaviors (01:20:24) Effect of pre-training hyper-parameters (01:22:05) Acknowledgements (01:22:16) Appendix (01:22:19) Hessian auto diff details The original text contained 26 footnotes which were omitted from this narration. The original text contained 2 images which were described by AI. --- First published: December 3rd, 2024 Source: https://www.lesswrong.com/posts/fSRg5qs9TPbNy3sm5/deep-causal-transcoding-a-framework-for-mechanistically --- Narrated by TYPE III AUDIO. --- Images from the article: Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

More Episodes

See all »

“Cognitive Biases Contributing to AI X-risk — a deleted excerpt from my 2018 ARCHES draft” by Andrew_Critch

Preface Several friends have asked me about what psychological effects I think could affect human judgement about x-risk. This isn't a complete answer, but in 2018 I wrote a draft of "AI Research Considerations for Human Existential Safety" (ARCHES) that included an overview of cognitive biases...

Published 12/04/24

“Book a Time to Chat about Interp Research” by Logan Riggs

In the spirit of the season, you can book a call with me to help w/ your interp project (no large coding though) Would you like someone to: Review your paper or code? Brainstorm ideas on next steps? How to best communicate your results? Discuss conceptual problems Obvious Advice (e.g. being...

Published 12/03/24

LessWrong (30+ Karma)

Published 12/03/24