Production AI Engineering starts with Evals — with Ankur Goyal of

Latent Space: Founders, Engineers, and News on...

Production AI Engineering starts with Evals — with Ankur Goyal of Braintrust

Listen now

Description

We are in 🗽 NYC this Monday! Join the AI Eng NYC meetup, bring demos and vibes! It is a bit of a meme that the first thing developer tooling founders think to build in AI is all the non-AI operational stuff outside the AI. There are well over 60 funded LLM Ops startups all with hoping to solve the new observability, cost tracking, security, and reliability problems that come with putting LLMs in production, not to mention new LLM oriented products from incumbent, established ops/o11y players like Datadog and Weights & Biases. 2 years in to the current hype cycle, the early winners have tended to be people with practical/research AI backgrounds rather than MLOps heavyweights or SWE tourists: * LangSmith: We covered how Harrison Chase worked on AI at Robust Intelligence and Kensho, the alma maters of many great AI founders * HumanLoop: We covered how Raza Habib worked at Google AI during his PhD * BrainTrust: Today’s guest Ankur Goyal founded Impira pre-Transformers and was acquihired to run Figma AI before realizing how to solve the Ops problem. There have been many VC think pieces and market maps describing what people thought were the essential pieces of the AI Engineering stack, but what was true for 2022-2023 has aged poorly. The basic insight that Ankur had is the same thesis that Hamel Husain is pushing in his World’s Fair talk and podcast with Raza and swyx: Evals are the centerpiece of systematic AI Engineering. REALLY believing in this is harder than it looks with the benefit of hindsight. It’s not like people didn’t know evals were important. Basically every LLM Ops feature list has them. It’s an obvious next step AFTER managing your prompts and logging your LLM calls. In fact, up til we met Braintrust, we were working on an expanded version of the Impossible Triangle Theory of the LLM Ops War that we first articulated in the Humanloop writeup: The single biggest criticism of the Rise of the AI Engineer piece is that we neglected to split out the role of product evals (as opposed to model evals) in the now infamous “API line” chart: With hindsight, we were very focused on the differentiating 0 to 1 phase that AI Engineers can bring to an existing team of ML engineers. As swyx says on the Day 2 keynote of AI Engineer, 2024 added a whole new set of concerns as AI Engineering grew up: A closer examination of Hamel’s product-oriented virtuous cycle and this infra-oriented SDLC would have eventually revealed that Evals, even more than logging, was the first point where teams start to get really serious about shipping to production, and therefore a great place to make an entry into the marketplace, which is exactly what Braintrust did. Also notice what’s NOT on this chart: shifting to shadow open source models, and finetuning them… per Ankur, Fine-tuning is not a viable standalone product: “The thing I would say is not debatable is whether or not fine-tuning is a business outcome or not. So let's think about the other components of your triangle. Ops/observability, that is a business… Frameworks, evals, databases [are a business, but] Fine-tuning is a very compelling method that achieves an outcome. The outcome is not fine-tuning, it is can I automatically optimize my use case to perform better if I throw data at the problem? And fine-tuning is one of multiple ways to achieve that.” OpenAI vs Open AI Market Share We last speculated about the market shifts in the End of OpenAI Hegemony and the Winds of AI Winter, and Ankur’s perspective is super valuable given his customer list: Some surprises based on what he is seeing: * Prior to Claude 3, OpenAI had near 100% market share. This tracks with what Harrison told us last year. * Claude 3.5 Sonnet and also notably Haiku have made serious dents * Open source model adoption is . Contra to Eugene Cheah’s ideal marketing pitch, virtually none of Braintrust’s customers are really finetuning open source models for cost, control, or privacy.

More Episodes

See all »

Agents @ Work: Lindy.ai

Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here...

Published 11/15/24

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Published 11/15/24

Agents @ Work: Dust.tt

We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out...

Published 11/11/24