Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and

Latent Space: Founders, Engineers, and News on...

Cloud Intelligence at the speed of 5000 tok/s - with Ce Zhang and Vipul Ved Prakash of Together AI

Listen now

Description

Our first ever demo day aimed for 15-20 people and ended up ballooning to >200 and covered in the news. We are now running the 2024 edition in SF on Feb 23: Latent Space Final Frontiers, a startup and research competition in “The Autonomous Workforce”, ”Beyond Transformers & GPUs”, and “Embodied AI”. RSVP here! You can find all LS online/IRL events on our new calendar. Super Early Bird tickets have just gone on sale for AI Engineer World’s Fair, June 25-27! Today we have the honor of hosting two of Together AI’s co-founders: Ce Zhang (CTO) and Vipul Ved Prakash (CEO). This is a rare opportunity to recap the history of the company since our last check-in with Tri Dao (Chief Scientist), some of their big releases, and do a deep dive into the state of the AI inference market. Together has emerged as one of the most consequential new startups in the new AI summer, last announcing a ~$100m Series A raise in November (at a ~$360-565m valuation). But there are at least three Togethers - Together the Research Lab, Together the Fine Tuning & Inference platform, and Together the custom models service. As we clarify on the pod, the overarching philosophy of Together is the ability to improve on all these fronts simultaneously by being “full stack”, from the lowest level kernel and systems programming to the highest level mathematical abstractions driving new model architectures and inference algorithms. Bringing Research and Industry Together In just one year, Together has been behind some of the most exciting research in AI: * RedPajama, a fully open source dataset for model pre-training which mirrored the Llama1 recipe. Then followed by RedPajama2, a 30T tokens dataset of filtered and de-duplicated tokens. * RedPajama-INCITE-3B and 7B, which were SOTA in a few benchmarks at the time of release. * FlashAttention-2, developed by Together’s Chief Scientist Tri Dao. We covered FA-2 in a previous episode with him. * Mamba-3B, the most promising transformer-alternative model that they released in collaboration with Cartesia. * StripedHyena, a SOTA graft of Hyena state space models and transformer models together * Medusa, an alternative to speculative decoding that lets you use multiple decoding heads instead of a draft model. * MonarchMixer, which was one of the most popular orals at NeurIPS 2023. It’s an approach to transformers that replaces many of its core parts with Monarch matrices for better computational efficiency. And I’m sure we missed something! As Vipul reveals, almost 50% of Together staff is researchers, and two of their co-founders (Chris Ré and Percy Liang) are professors at Stanford, so we can expect a lot more here. Bringing “Disaggregated” GPUs Together On their cloud, they offer inference as a service, fine-tuning, pre-training, etc, but unlike other providers they think of themselves as a disaggregated cloud. Today, they have ~8,000 A100 and H100 GPUs on their platform (an exclusive revealed on the pod!) totaling over 20 exaflops of compute, but instead of just buying more and putting them in a cluster and then exposing a `us-east-1` option for customers, they are taking heterogenous compute sources and adding a unified layer on top of it for developers to consume. Building on Ce’s research, Together’s GPU Clusters are taking on comparable AWS and GCP offerings in both cost and speed: Take the Hessian AI center in Germany or the DoE’s INCITE; they have GPUs that they want to share with researchers, but they lack the cloud layer over it. Similarly, there’s starting to be more and more differentiation amongst types of GPUs: H100s, A100s, MI3000s, etc. Each of them has different availability and performance based on task, and the end user shouldn’t have to be an hardware expert to run inference on a model, so Together abstracts a lot of that away. A big theme of the Together inference stack, a “bag of 50 tricks” that we discuss on the pod, is also “hardware-

More Episodes

See all »

Agents @ Work: Lindy.ai

Alessio will be at AWS re:Invent next week and hosting a casual coffee meetup on Wednesday, RSVP here! And subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! We are still taking questions for our next big recap episode! Submit questions and messages on Speakpipe here...

Published 11/15/24

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

Published 11/15/24

Agents @ Work: Dust.tt

We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show! Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups! In our first ever episode with Logan Kilpatrick we called out...

Published 11/11/24