Truly Serverless Infra for AI Engineers - with Erik Bernhardsson

Latent Space: Founders, Engineers, and News on...

Truly Serverless Infra for AI Engineers - with Erik Bernhardsson of Modal

Listen now

Description

We’re writing this one day after the monster release of OpenAI’s Sora and Gemini 1.5. We covered this on ‘s ThursdAI space, so head over there for our takes. IRL: We’re ONE WEEK away from Latent Space: Final Frontiers, the second edition and anniversary of our first ever Latent Space event! Also: join us on June 25-27 for the biggest AI Engineer conference of the year! Online: All three Discord clubs are thriving. Join us every Wednesday/Friday! Almost 12 years ago, while working at Spotify, Erik Bernhardsson built one of the first open source vector databases, Annoy, based on ANN search. He also built Luigi, one of the predecessors to Airflow, which helps data teams orchestrate and execute data-intensive and long-running jobs. Surprisingly, he didn’t start yet another vector database company, but instead in 2021 founded Modal, the “high-performance cloud for developers”. In 2022 they opened doors to developers after their seed round, and in 2023 announced their GA with a $16m Series A. More importantly, they have won fans among both household names like Ramp, Scale AI, Substack, and Cohere, and newer startups like (upcoming guest!) Suno.ai and individual hackers (Modal was the top tool of choice in the Vercel AI Accelerator): We've covered the nuances of GPU workloads, and how we need new developer tooling and runtimes for them (see our episodes with Chris Lattner of Modular and George Hotz of tiny to start). In this episode, we run through the major limitations of the actual infrastructure behind the clouds that run these models, and how Erik envisions the “postmodern data stack”. In his 2021 blog post “Software infrastructure 2.0: a wishlist”, Erik had “Truly serverless” as one of his points: * The word cluster is an anachronism to an end-user in the cloud! I'm already running things in the cloud where there's elastic resources available at any time. Why do I have to think about the underlying pool of resources? Just maintain it for me. * I don't ever want to provision anything in advance of load. * I don't want to pay for idle resources. Just let me pay for whatever resources I'm actually using. * Serverless doesn't mean it's a burstable VM that saves its instance state to disk during periods of idle. Swyx called this Self Provisioning Runtimes back in the day. Modal doesn’t put you in YAML hell, preferring to colocate infra provisioning right next to the code that utilizes it, so you can just add GPU (and disk, and retries…): After 3 years, we finally have a big market push for this: running inference on generative models is going to be the killer app for serverless, for a few reasons: * AI models are stateless: even in conversational interfaces, each message generation is a fully-contained request to the LLM. There’s no knowledge that is stored in the model itself between messages, which means that tear down / spin up of resources doesn’t create any headaches with maintaining state. * Token-based pricing is better aligned with serverless infrastructure than fixed monthly costs of traditional software. * GPU scarcity makes it really expensive to have reserved instances that are available to you 24/7. It’s much more convenient to build with a serverless-like infrastructure. In the episode we covered a lot more topics like maximizing GPU utilization, why Oracle Cloud rocks, and how Erik has never owned a TV in his life. Enjoy! Show Notes * Modal * ErikBot * Erik’s Blog * Software Infra 2.0 Wishlist * Luigi * Annoy * Hetzner * CoreWeave * Cloudflare FaaS * Poolside AI * Modular Inference Engine Chapters * [00:00:00] Introductions * [00:02:00] Erik's OSS work at Spotify: Annoy and Luigi * [00:06:22] Starting Modal * [00:07:54] Vision for a "postmodern data stack" * [00:10:43] Solving container cold start problems * [00:12:57] Designing Modal's Python SDK * [00:15:18] Self-Revisioning Runtime * [00:19:14] Truly Serverless Infrastructure * [00:20:52] Beyond model inference * [00:22:

More Episodes

See all »

Agents @ Work: Lindy.ai

Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here...

Published 11/15/24

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Published 11/15/24

Agents @ Work: Dust.tt

We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out...

Published 11/11/24