Navigating the Modern Data Stack, Choosing the Right OSS Tools,

Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7

Listen now

Description

From Problem to Requirements to Architecture. In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data. "Don't overcomplicate what you're actually doing." "Getting your basic programming software development skills down is super important to becoming a good data engineer." "Who has time to learn 500 new tools? It's like, this is not humanly possible anymore." Key Takeaways: Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this. Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use. Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts. Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management. The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success. The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated. Jon Erik Kemi Warghed: LinkedIn Nicolay Gerold: ⁠LinkedIn⁠ ⁠X (Twitter) Chapters 00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords 00:57 How to Choose the Right Tools: Considerations for startups and large companies 03:13 Evaluating Open Source Tools: Background checks and due diligence 07:52 Defining Data Governance: Transparency and understanding of data 10:15 The Importance of Data Governance: Challenges and solutions 12:21 Data Governance Tools: dbt and Dagster 17:05 The Impact of Dagster: Software-defined assets and declarative thinking 19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage 21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management 26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines 28:47 The Importance of Tool Selection: Thinking about long-term sustainability 31:10 When to Adopt Orchestration: Identifying the need for orchestration tools --- Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message

More Episodes

See all »

From Ambiguous to AI-Ready: Improving Documentation Quality for RAG Systems | S2 E15

Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them. Today we are talking to Max Buckley on how to find and fix these errors. Max works at Google and has built...

Published 11/21/24

BM25 is the workhorse of search; vectors are its visionary cousin | S2 E14

Ever wondered why vector search isn't always the best path for information retrieval? Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub. Discover how BM25 transforms search efficiency, even at GitHub's immense scale. BM25,...

Published 11/15/24

How AI Is Built

Published 11/15/24