📖 ThursdAI - Sunday special on datasets classification & alternative transformer architectures
Listen now
Description
Hello hello everyone, welcome to another special episode (some podcasts call them just.. episodes I guess, but here you get AI news every ThurdsdAI, and on Sunday you get the deeper dives) BTW, I'm writing these words, looking at a 300 inch monitor that's hovering above my usual workstation in the Apple Vision Pro, and while this is an AI newsletter, and I've yet to find a connecting link (there's like 3 AI apps in there right now, one fairly boring chatbot, and Siri... don't get me started on Siri), I'll definitely be covering my experience in the next ThursdAI, because well, I love everything new and technological, AI is a huge part of it, but not the ONLY part! 📖 It's all about the (big) Datasets Ok back to the matter at hand, if you've used, finetuned, trained or heard about an AI model, you may or may not realize how important the dataset the model was trained with is. We often talk of this model, that model, and often the only different is, additional data that folks (who I sometimes refer to as alchemists) have collected, curated and structured, and creating/curating/editing those datasets is an art and a science. For example, three friends of the pod, namely LDJ with Capybara, Austin with OpenChat and Teknium with Hermes, have been consistently taking of the shelves open source models and making them smarter, more instruction tuned, better for specific purposes. These datasets are paired with different techniques as well, for example, lately the so-called DPO (Direct preference optimization) is a technique that showed promise, since it not only shows a model which answer is the correct for a specific query, it shows an incorrect answer as well, and trains the model to prefer one over the other. (see the recent Capybara DPO improvement by Argilla, which improved model metrics across every evaluation) These datasets can range from super high quality 16K rows, to millions of rows (Teknium's recently released Hermes, one of the higher quality datasets comes in at just a tad over exactly 1 million rows) and often times it's an amalgamation of different other datasets into 1. In the case of Hermes, Teknium has compiled this 1 million chats from at least 15 different datasets, some his own, some by folks like Jon Durbin, Garage bAInd, and shareGPT, from LMsys.org, which was complied by scraping the very popular sharegpt.com website, from folks who used the shareGPT extension to share they GPT4 conversations. It's quite remarkable how much of these datasets are just, conversations that users had with GPT-4! Lilac brings Garden With that backdrop of information, today on the pod we've got the co-founders of Lilac, Nikhil Thorat and Daniel Smilkov, who came on to chat about the new thing they just released called Lilac Garden. Lilac is an open source tool (you can find it RIGHT HERE) which is built to help make dataset creation, curation and classification, more science than art, and help visualize the data, cluster it and make it easily available. In the case of Hermes, that could be more than millions of rows of data. On the pod, I talk with Nikhil and Daniel about the origin of what they both did at Google, working on Tensorflow.js and then something called "know your data" and how eventually they realized that in this era of LLMs, open sourcing a tool that can understand huge datasets, run LLM based classifiers on top of them, or even train specific ones, is important and needed! To strengthen the point, two friends of the pod (Teknium was in the crowd sending us 👍), LDJ and Austin (aka Alignment Lab) were on stage with us and basically said that "It was pretty much the dark ages before Lilac", since something like OpenOrca dataset is a whopping 4M rows of text. Visualizations in the Garden. So what does lilac actually look like? Here's a quick visualization of the top categories of texts from OpenOrca's 4 million rows, grouped by category title and showing each cluster. So you can see here, Translati
More Episodes
This week is a very exciting one in the world of AI news, as we get 3 SOTA models, one in overall LLM rankings, on in OSS coding and one in OSS voice + a bunch of new breaking news during the show (which we reacted to live on the pod, and as we're now doing video, you can see us freak out in real...
Published 11/15/24
👋 Hey all, this is Alex, coming to you from the very Sunny California, as I'm in SF again, while there is a complete snow storm back home in Denver (brrr). I flew here for the Hackathon I kept telling you about, and it was glorious, we had over 400 registered, over 200 approved hackers, 21 teams...
Published 11/08/24