Episodes
Audio note: this article contains 449 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund.
TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding -...
Published 12/04/24
Preface
Several friends have asked me about what psychological effects I think could affect human judgement about x-risk.
This isn't a complete answer, but in 2018 I wrote a draft of "AI Research Considerations for Human Existential Safety" (ARCHES) that included an overview of cognitive biases I thought (and still think) will impair AI risk assessments. Many cognitive bias experiments had already failed to reproduce well in the psychology reproducibility crisis, so I thought it would be a...
Published 12/04/24
In the spirit of the season, you can book a call with me to help w/ your interp project (no large coding though)
Would you like someone to:
Review your paper or code? Brainstorm ideas on next steps? How to best communicate your results? Discuss conceptual problems Obvious Advice (e.g. being affected by SAD because it's winter, not exercising, not getting enough sleep) [anything else that would be useful!]When we're chatting, you can interrupt me to better focus on what you specifically want....
Published 12/03/24
Tom did the original thinking; Rose helped with later thinking, structure and writing.
Some plans for AI governance involve centralising western AGI development.[1] Would this actually be a good idea? We don’t think this question has been analysed in enough detail, given how important it is. In this post, we’re going to:
Explore the strategic implications of having one project instead of severalDiscuss what we think the best path forwards is, given that strategic landscape(If at this point...
Published 12/03/24
MIRI is a nonprofit research organization with a mission of addressing the most serious hazards posed by smarter-than-human artificial intelligence. In our general strategy update and communications strategy update earlier this year, we announced a new strategy that we’re executing on at MIRI, and several new teams we’re spinning up as a result. This post serves as a status update on where things stand at MIRI today.
We originally planned to run a formal fundraiser this year, our first one...
Published 12/03/24
I work on Open Philanthropy's AI Governance and Policy team, but I’m writing this in my personal capacity – several senior employees at Open Phil have argued with me about this!
This is a brief-ish post addressed to people who are interested in making high-impact donations and are already concerned about potential risks from advanced AI. Ideally such a post would include a case that reducing those risks is an especially important (and sufficiently tractable and neglected) cause area, but I’m...
Published 12/02/24
On Carcinogenic Complexity, Software Senescence and Cognitive Provenance: Our roadmap for 2025 and beyond
It is mandatory to start any essay on AI in the post-ChatGPT era with the disclaimer that AI brings huge potential, and great risks. Unfortunately, on the path we are currently on, we will not realize those benefits, but are far more likely to simply drown in terrible AI slop, undermine systemic cybersecurity and blow ourselves up.
We believe AI on its current path will continue to...
Published 12/02/24
The Less Wrong General Census is unofficially here! You can take it at this link.
The oft-interrupted annual tradition of the Less Wrong Census is once more upon us!
If you are reading this post and identify as a LessWronger, then you are the target audience. If you are reading this post and don't identify as a LessWronger, you just read posts here or maybe go to house parties full of rationalists or possibly read rationalist fanfiction and like talking about it on the internet, or you're not...
Published 12/02/24
Two months ago I attended Eric Drexler's launch of
MSEP.one. It's open source software, written by
people with professional game design experience, intended to catalyze
better designs for atomically precise manufacturing (or generative
nanotechnology, as he now calls it).
Drexler wants to draw more attention to the benefits of nanotech, which
involve large enough exponents that our intuition boggles at handling
them. That includes permanent health (Drexler's new framing of life
extension and...
Published 12/02/24
403 Forbidden
---
Source:
https://www.lesswrong.com/posts/85xkq9Go9AAg3raJ8/sorry-for-the-downtime-looks-like-we-got-ddosd
---
Narrated by TYPE III AUDIO.
Published 12/02/24
YouTube link
The ‘model organisms of misalignment’ line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: “Sleeper Agents” and “Sycophancy to Subterfuge”.
Topics we discuss:
Model organisms and stress-testing
Sleeper Agents
Do ‘sleeper agents’ properly model...
Published 12/02/24
TLDR: LessWrong + Lighthaven need about $3M for the next 12 months. Donate here, or send me an email, DM or signal message (+1 510 944 3235) if you want to support what we do. Donations are tax-deductible in the US. Reach out for other countries, we can likely figure something out. We have big plans for the next year, and due to a shifting funding landscape we need support from a broader community more than in any previous year.
I've been running LessWrong/Lightcone Infrastructure for the...
Published 11/30/24
TLDR: In this post, I argue that if you are a junior AI safety researcher, you should consider applying to PhD programs in ML soon, especially if you have recently participated in an AI safety upskilling or research program like MATS or ARENA, might be interested in working on AI safety long term, but don't have immediate career plans. It is relatively cheap to apply, and provides good future option value. I don’t argue that you should necessarily do a PhD, but some other posts do. PhD...
Published 11/29/24
Note: This is meant to be an accessible introduction to causal inference. Comments appreciated.
Let's say you buy a basil plant and put it on the counter in your kitchen. Unfortunately, it dies in a week.
So the next week you buy another basil plant and feed it a special powder, Vitality Plus. This second plant lives. Does that mean Vitality Plus worked?
Not necessarily! Maybe the second week was a lot sunnier, you were better about watering, or you didn’t grab a few leaves for a pasta. In...
Published 11/29/24
There are lots of great charitable giving opportunities out there right now.
The first time that I served as a recommender in the Survival and Flourishing Fund (SFF) was back in 2021. I wrote in detail about my experiences then. At the time, I did not see many great opportunities, and was able to give out as much money as I found good places to do so.
How the world has changed in three years.
I recently had the opportunity to be an SFF recommender for the second time. This time I found...
Published 11/29/24
AI infrastructure numbers are hard to find with any precision. There are many reported numbers of “[company] spending Xbn on infrastructure this quarter” and “[company] has bought 100k H100s or “has a cluster of 100k H100s” but when I went looking for an estimate of how much compute a given company had access to, I could not find consistent numbers available. Here I’ve tried to pull together information from a variety of sources to get ballpark estimates of (i) as of EOY 2024, who do we...
Published 11/29/24
If you're interested in helping to run the ARENA program, note that we're currently hiring for an operations lead! For more details, and to apply, see here.
Summary
The purpose of this report is to evaluate ARENA 4.0's impact according to our four success criteria:
Source high-quality participantsUpskill these talented participants in ML skills for AI safety workIntegrate participants with the existing AI safety community and legitimise AI safety as a compelling field to work inAccelerate...
Published 11/28/24
People don’t give thanks enough, and it's actual Thanksgiving, so here goes.
Thank you for continuing to take this journey with me every week.
It's a lot of words. Even if you pick and choose, and you probably should, it's a lot of words. You don’t have many slots to spend on things like this. I appreciate it.
Thanks in particular for those who are actually thinking about all this, and taking it seriously, and forming their own opinions. It is the only way. To everyone who is standing up,...
Published 11/28/24
This is a link post.Note: The linked site is a personal project, and all views expressed here are my own.
TL;DR
I created an interactive flowchart about various scenarios how the future of AI might play out.By setting various conditional probabilities you can see charts of what the resulting estimated probabilities for good, ambiguous, and existentially bad outcomes are.Use the site as a conversation starter and tool for reflection. You can share your results either as images of the...
Published 11/28/24
(Epistemic status: Very loosely held and generated in a 90-minute workshop led by @Daniel Kokotajlo, @Thomas Larsen, @elifland, and Jonas Vollmer at The Curve Conference; explores how it might happen, if it happens soon. I expect there to be at least one "duh, that makes no sense" discovered with any significant level of attention that would require me to rethink some of this.)
Recently, at The Curve conference, I participated in a session that helped facilitate us writing AGI vignettes --...
Published 11/28/24
This is a link post.A new o1-like model based on Qwen-2.5-32B reportedly beats Claude 3.5 Sonnet[1] on a bunch of difficult reasoning benchmarks. A new regime dawns.
The blog post reveals nothing but the most inane slop ever sampled:
What does it mean to think, to question, to understand? These are the deep waters that QwQ (Qwen with Questions) wades into. Like an eternal student of wisdom, it approaches every problem - be it mathematics, code, or knowledge of our world - with genuine wonder...
Published 11/27/24
YouTube link
You may have heard of singular learning theory, and its “local learning coefficient”, or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.
Topics we discuss:
About Jesse
The Alignment Workshop
About Timaeus
SLT that isn’t developmental interpretability
The refined local learning coefficient
Finding the multigram circuit
Daniel Filan...
Published 11/27/24
I am sharing this call from the EU AI Office for organizations involved in evaluation. Please take a close look: among the selection criteria, organizations must be based in Europe, or their leader must be European. If these criteria pose challenges for some of you, feel free to reach out to me at
[email protected]. We can explore potential ways to collaborate through PRISM Eval. I believe it's crucial that we support one another on these complex and impactful issues.
The AI office is...
Published 11/27/24
The most ambitious direction I've worked on in the past few years is a theory of hierarchical agency (as mentioned in the ACS announcement). Recently, I've become worried that the inferential distance between "the median of current AI safety discourse" and "what I and my colleagues at ACS work on" has become too large. After multiple attempts to write a good introduction to hierarchical agency ended up in perpetual draft form, I decided to try something else: explain it to Claude. This is a...
Published 11/27/24