Navigating the Modern Data Stack, Choosing the Right OSS Tools, From Problem to Requirements to Architecture | ep 7
Description
From Problem to Requirements to Architecture.
In this episode, Nicolay Gerold and Jon Erich Kemi Warghed discuss the landscape of data engineering, sharing insights on selecting the right tools, implementing effective data governance, and leveraging powerful concepts like software-defined assets. They discuss the challenges of keeping up with the ever-evolving tech landscape and offer practical advice for building sustainable data platforms. Tune in to discover how to simplify complex data pipelines, unlock the power of orchestration tools, and ultimately create more value from your data.
"Don't overcomplicate what you're actually doing."
"Getting your basic programming software development skills down is super important to becoming a good data engineer."
"Who has time to learn 500 new tools? It's like, this is not humanly possible anymore."
Key Takeaways:
Data Governance: Data governance is about transparency and understanding the data you have. It's crucial for organizations as they scale and data becomes more complex. Tools like dbt and Dagster can help achieve this.
Open Source Tooling: When choosing open source tools, assess their backing, commit frequency, community support, and ease of use.
Agile Data Platforms: Focus on the capabilities you want to enable and prioritize solving the core problems of your data engineers and analysts.
Software Defined Assets: This concept, exemplified by Dagster, shifts the focus from how data is processed to what data should exist. This change in mindset can greatly simplify data orchestration and management.
The Importance of Fundamentals: Strong programming and software development skills are crucial for data engineers, and understanding the basics of data management and orchestration is essential for success.
The Importance of Versioning Data: Data has to be versioned so you can easily track changes, revert to previous states if needed, and ensure reproducibility in your data pipelines. lakeFS applies the concepts of Git to your data lake. This gives you the ability to create branches for different development environments, commit changes to specific versions, and merge branches together once changes have been tested and validated.
Jon Erik Kemi Warghed:
LinkedIn
Nicolay Gerold:
LinkedIn
X (Twitter)
Chapters
00:00 The Problem with the Modern Data Stack: Too many tools and buzzwords
00:57 How to Choose the Right Tools: Considerations for startups and large companies
03:13 Evaluating Open Source Tools: Background checks and due diligence
07:52 Defining Data Governance: Transparency and understanding of data
10:15 The Importance of Data Governance: Challenges and solutions
12:21 Data Governance Tools: dbt and Dagster
17:05 The Impact of Dagster: Software-defined assets and declarative thinking
19:31 The Power of Software Defined Assets: How Dagster differs from Airflow and Mage
21:52 State Management and Orchestration in Dagster: Real-time updates and dependency management
26:24 Why Use Orchestration Tools?: The role of orchestration in complex data pipelines
28:47 The Importance of Tool Selection: Thinking about long-term sustainability
31:10 When to Adopt Orchestration: Identifying the need for orchestration tools
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them.
Today we are talking to Max Buckley on how to find and fix these errors.
Max works at Google and has built...
Published 11/21/24
Ever wondered why vector search isn't always the best path for information retrieval?
Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub.
Discover how BM25 transforms search efficiency, even at GitHub's immense scale.
BM25,...
Published 11/15/24