Limits of Embeddings: Out-of-Domain Data, Long Context, Finetuning

Limits of Embeddings: Out-of-Domain Data, Long Context, Finetuning (and How We're Fixing It) | S2 E5

Listen now

Description

Text embeddings have limitations when it comes to handling long documents and out-of-domain data. Today, we are talking to Nils Reimers. He is one of the researchers who kickstarted the field of dense embeddings, developed sentence transformers, started HuggingFace’s Neural Search team and now leads the development of search foundational models at Cohere. Tbh, he has too many accolades to count off here. We talk about the main limitations of embeddings: Failing out of domainStruggling with long documentsVery hard to debugHard to find formalize what actually is similarAre you still not sure whether to listen? Here are some teasers: Interpreting embeddings can be challenging, and current models are not easily explainable.Fine-tuning is necessary to adapt embeddings to specific domains, but it requires careful consideration of the data and objectives.Re-ranking is an effective approach to handle long documents and incorporate additional factors like recency and trustworthiness.The future of embeddings lies in addressing scalability issues and exploring new research directions.Nils Reimers: LinkedInX (Twitter)WebsiteCohereNicolay Gerold: ⁠LinkedIn⁠⁠X (Twitter)text embeddings, limitations, long documents, interpretation, fine-tuning, re-ranking, future research 00:00 Introduction and Guest Introduction 00:43 Early Work with BERT and Argument Mining 02:24 Evolution and Innovations in Embeddings 03:39 Constructive Learning and Hard Negatives 05:17 Training and Fine-Tuning Embedding Models 12:48 Challenges and Limitations of Embeddings 18:16 Adapting Embeddings to New Domains 22:41 Handling Long Documents and Re-Ranking 31:08 Combining Embeddings with Traditional ML 45:16 Conclusion and Upcoming Episodes

More Episodes

See all »

From Ambiguous to AI-Ready: Improving Documentation Quality for RAG Systems | S2 E15

Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them. Today we are talking to Max Buckley on how to find and fix these errors. Max works at Google and has built...

Published 11/21/24

BM25 is the workhorse of search; vectors are its visionary cousin | S2 E14

Ever wondered why vector search isn't always the best path for information retrieval? Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub. Discover how BM25 transforms search efficiency, even at GitHub's immense scale. BM25,...

Published 11/15/24

How AI Is Built

Published 11/15/24