📅 ThursdAI - June 20th - 👑 Claude Sonnet 3.5 new LLM king, DeepSeek new OSS code king, Runway Gen-3 SORA competitor, Ilya's back & more AI news from this crazy week
Description
Hey, this is Alex. Don't you just love when assumptions about LLMs hitting a wall just get shattered left and right and we get new incredible tools released that leapfrog previous state of the art models, that we barely got used to, from just a few months ago? I SURE DO!
Today is one such day, this week was already busy enough, I had a whole 2 hour show packed with releases, and then Anthropic decided to give me a reason to use the #breakingNews button (the one that does the news show like sound on the live show, you should join next time!) and announced Claude Sonnet 3.5 which is their best model, beating Opus while being 2x faster and 5x cheaper! (also beating GPT-4o and Turbo, so... new king! For how long? ¯\_(ツ)_/¯)
Critics are already raving, it's been half a day and they are raving! Ok, let's get to the TL;DR and then dive into Claude 3.5 and a few other incredible things that happened this week in AI! 👇
TL;DR of all topics covered:
* Open Source LLMs
* NVIDIA - Nemotron 340B - Base, Instruct and Reward model (X)
* DeepSeek coder V2 (230B MoE, 16B) (X, HF)
* Meta FAIR - Chameleon MMIO models (X)
* HF + BigCodeProject are deprecating HumanEval with BigCodeBench (X, Bench)
* NousResearch - Hermes 2 LLama3 Theta 70B - GPT-4 level OSS on MT-Bench (X, HF)
* Big CO LLMs + APIs
* Gemini Context Caching is available
* Anthropic releases Sonnet 3.5 - beating GPT-4o (X, Claude.ai)
* Ilya Sutskever starting SSI.inc - safe super intelligence (X)
* Nvidia is the biggest company in the world by market cap
* This weeks Buzz
* Alex in SF next week for AIQCon, AI Engineer. ThursdAI will be sporadic but will happen!
* W&B Weave now has support for tokens and cost + Anthropic SDK out of the box (Weave Docs)
* Vision & Video
* Microsoft open sources Florence 230M & 800M Vision Models (X, HF)
* Runway Gen-3 - (t2v, i2v, v2v) Video Model (X)
* Voice & Audio
* Google Deepmind teases V2A video-to-audio model (Blog)
* AI Art & Diffusion & 3D
* Flash Diffusion for SD3 is out - Stable Diffusion 3 in 4 steps! (X)
ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.
🦀 New king of LLMs in town - Claude 3.5 Sonnet 👑
Ok so first things first, Claude Sonnet, the previously forgotten middle child of the Claude 3 family, has now received a brain upgrade!
Achieving incredible performance on many benchmarks, this new model is 5 times cheaper than Opus at $3/1Mtok on input and $15/1Mtok on output. It's also competitive against GPT-4o and turbo on the standard benchmarks, achieving incredible scores on MMLU, HumanEval etc', but we know that those are already behind us.
Sonnet 3.5, aka Claw'd (which is a great marketing push by the Anthropic folks, I love to see it), is beating all other models on Aider.chat code editing leaderboard, winning on the new livebench.ai leaderboard and is getting top scores on MixEval Hard, which has 96% correlation with LMsys arena.
While benchmarks are great and all, real folks are reporting real findings of their own, here's what Friend of the Pod Pietro Skirano had to say after playing with it:
there's like a lot of things that I saw that I had never seen before in terms of like creativity and like how much of the model, you know, actually put some of his own understanding into your request
-@Skirano
What's notable a capability boost is this quote from the Anthropic release blog:
In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, outperforming Claude 3 Opus which solved 38%.
One detail that Alex Albert from Anthropic pointed out from this released was, that on GPQA (Graduate-Level Google-Proof Q&A) Benchmark, they achieved a 67% with various prompting techniques, beating PHD experts in respective fields in this benchmarks that average 65% on this. This... this is crazy
Beyond just the benchmarks
This to me is a ridiculous jump bec
This week is a very exciting one in the world of AI news, as we get 3 SOTA models, one in overall LLM rankings, on in OSS coding and one in OSS voice + a bunch of new breaking news during the show (which we reacted to live on the pod, and as we're now doing video, you can see us freak out in real...
Published 11/15/24
👋 Hey all, this is Alex, coming to you from the very Sunny California, as I'm in SF again, while there is a complete snow storm back home in Denver (brrr).
I flew here for the Hackathon I kept telling you about, and it was glorious, we had over 400 registered, over 200 approved hackers, 21 teams...
Published 11/08/24