Why you should write your own LLM benchmarks — with Nicholas

Latent Space: Founders, Engineers, and News on...

Why you should write your own LLM benchmarks — with Nicholas Carlini, Google DeepMind

Listen now

Description

Today's guest, Nicholas Carlini, a research scientist at DeepMind, argues that we should be focusing more on what AI can do for us individually, rather than trying to have an answer for everyone. "How I Use AI" - A Pragmatic Approach Carlini's blog post "How I Use AI" went viral for good reason. Instead of giving a personal opinion about AI's potential, he simply laid out how he, as a security researcher, uses AI tools in his daily work. He divided it in 12 sections: * To make applications * As a tutor * To get started * To simplify code * For boring tasks * To automate tasks * As an API reference * As a search engine * To solve one-offs * To teach me * Solving solved problems * To fix errors Each of the sections has specific examples, so we recommend going through it. It also includes all prompts used for it; in the "make applications" case, it's 30,000 words total! My personal takeaway is that the majority of the work AI can do successfully is what humans dislike doing. Writing boilerplate code, looking up docs, taking repetitive actions, etc. These are usually boring tasks with little creativity, but with a lot of structure. This is the strongest arguments as to why LLMs, especially for code, are more beneficial to senior employees: if you can get the boring stuff out of the way, there's a lot more value you can generate. This is less and less true as you go entry level jobs which are mostly boring and repetitive tasks. Nicholas argues both sides ~21:34 in the pod. A New Approach to LLM Benchmarks We recently did a Benchmarks 201 episode, a follow up to our original Benchmarks 101, and some of the issues have stayed the same. Notably, there's a big discrepancy between what benchmarks like MMLU test, and what the models are used for. Carlini created his own domain-specific language for writing personalized LLM benchmarks. The idea is simple but powerful: * Take tasks you've actually needed AI for in the past. * Turn them into benchmark tests. * Use these to evaluate new models based on your specific needs. It can represent very complex tasks, from a single code generation to drawing a US flag using C: "Write hello world in python" >> LLMRun() >> PythonRun() >> SubstringEvaluator("hello world") "Write a C program that draws an american flag to stdout." >> LLMRun() >> CRun() >> \ VisionLLMRun("What flag is shown in this image?") >> \ (SubstringEvaluator("United States") | SubstringEvaluator("USA"))) This approach solves a few problems: * It measures what's actually useful to you, not abstract capabilities. * It's harder for model creators to "game" your specific benchmark, a problem that has plagued standardized tests. * It gives you a concrete way to decide if a new model is worth switching to, similar to how developers might run benchmarks before adopting a new library or framework. Carlini argues that if even a small percentage of AI users created personal benchmarks, we'd have a much better picture of model capabilities in practice. AI Security While much of the AI security discussion focuses on either jailbreaks or existential risks, Carlini's research targets the space in between. Some highlights from his recent work: * LAION 400M data poisoning: By buying expired domains referenced in the dataset, Carlini's team could inject arbitrary images into models trained on LAION 400M. You can read the paper "Poisoning Web-Scale Training Datasets is Practical", for all the details. This is a great example of expanding the scope beyond the model itself, and looking at the whole system and how ti can become vulnerable. * Stealing model weights: They demonstrated how to extract parts of production language models (like OpenAI's) through careful API queries. This research, "Extracting Training Data from Large Language Models", shows that even black-box access can leak sensitive information. * Extracting training data: In some cases, they found ways to make models regurgitate verbatim snippets from their training data. Him and Mila

More Episodes

See all »

Agents @ Work: Lindy.ai

Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here...

Published 11/15/24

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Published 11/15/24

Agents @ Work: Dust.tt

We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out...

Published 11/11/24