RAG at Scale: The problems you will encounter and how to prevent

RAG at Scale: The problems you will encounter and how to prevent (or fix) them | S2 E4

Listen now

Description

Hey! Welcome back. Today we look at how we can get our RAG system ready for scale. We discuss common problems and their solutions, when you introduce more users and more requests to your system. For this we are joined by Nirant Kasliwal, the author of fastembed. Nirant shares practical insights on metadata extraction, evaluation strategies, and emerging technologies like Colipali. This episode is a must-listen for anyone looking to level up their RAG implementations. "Naive RAG has a lot of problems on the retrieval end and then there's a lot of problems on how LLMs look at these data points as well." "The first 30 to 50% of gains are relatively quick. The rest 50% takes forever." "You do not want to give the same answer about company's history to the co-founding CEO and the intern who has just joined." "Embedding similarity is the signal on which you want to build your entire search is just not quite complete." Key insights: Naive RAG often fails due to limitations of embeddings and LLMs' sensitivity to input ordering.Query profiling and expansion: Use clustering and tools like latent Scope to identify problematic query typesExpand queries offline and use parallel searches for better resultsMetadata extraction: Extract temporal, entity, and other relevant information from queriesUse LLMs for extraction, with checks against libraries like Stanford NLPUser personalization: Include user role, access privileges, and conversation historyAdapt responses based on user expertise and readability scoresEvaluation and improvement: Create synthetic datasets and use real user feedbackEmploy tools like DSPY for prompt engineeringAdvanced techniques: Query routing based on type and urgencyUse smaller models (1-3B parameters) for easier iteration and error spottingImplement error handling and cross-validation for extracted metadataNirant Kasliwal: X (Twitter)LinkedInSearch in the LLM Era for AI Engineers (course)Nicolay Gerold: ⁠LinkedIn⁠⁠X (Twitter)query understanding, AI-powered search, Lambda Mart, e-commerce ranking, networking, experts, recommendation, search

More Episodes

See all »

From Ambiguous to AI-Ready: Improving Documentation Quality for RAG Systems | S2 E15

Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them. Today we are talking to Max Buckley on how to find and fix these errors. Max works at Google and has built...

Published 11/21/24

BM25 is the workhorse of search; vectors are its visionary cousin | S2 E14

Ever wondered why vector search isn't always the best path for information retrieval? Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub. Discover how BM25 transforms search efficiency, even at GitHub's immense scale. BM25,...

Published 11/15/24

How AI Is Built

Published 11/15/24