Building Robust AI and Data Systems, Data Architecture, Data Quality, Data Storage | ep 10
Description
In this episode of "How AI is Built", data architect Anjan Banerjee provides an in-depth look at the world of data architecture and building complex AI and data systems. Anjan breaks down the basics using simple analogies, explaining how data architecture involves sorting, cleaning, and painting a picture with data, much like organizing Lego bricks to build a structure.
Summary by Section
Introduction
Anjan Banerjee, a data architect, discusses building complex AI and data systems
Explains the basics of data architecture using Lego and chat app examples
Sources and Tools
Identifying data sources is the first step in designing a data architecture
Pick the right tools to extract data based on use cases (block storage for images, time series DB, etc.)
Use one tool for most activities if possible, but specialized tools offer benefits
Multi-modal storage engines are gaining popularity (Snowflake, Databricks, BigQuery)
Airflow and Orchestration
Airflow is versatile but has a learning curve; good for orgs with Python/data engineering skills
For less technical orgs, GUI-based tools like Talend, Alteryx may be better
AWS Step Functions and managed Airflow are improving native orchestration capabilities
For multi-cloud, prefer platform-agnostic tools like Astronomer, Prefect, Airbyte
AI and Data Processing
ML is key for data-intensive use cases to avoid storing/processing petabytes in cloud
TinyML and edge computing enable ML inference on device (drones, manufacturing)
Cloud batch processing still dominates for user targeting, recommendations
Data Lakes and Storage
Storage choice depends on data types, use cases, cloud ecosystem
Delta Lake excels at data versioning and consistency; Iceberg at partitioning and metadata
Pulling data into separate system often needed for advanced analytics beyond source system
Data Quality and Standardization
"Poka-yoke" error-proofing of input screens is vital for downstream data quality
Impose data quality rules and unified schemas (e.g. UTC timestamps) during ingestion
Complexity arises with multi-region compliance (GDPR, CCPA) requiring encryption, sanitization
Hot Takes and Wishes
Snowflake is overhyped; great UX but costly at scale. Databricks is preferred.
Automated data set joining and entity resolution across systems would be a game-changer
Anjan Banerjee:
LinkedIn
Nicolay Gerold:
LinkedIn
X (Twitter)
00:00 Understanding Data Architecture
12:36 Choosing the Right Tools
20:36 The Benefits of Serverless Functions
21:34 Integrating AI in Data Acquisition
24:31 The Trend Towards Single Node Engines
26:51 Choosing the Right Database Management System and Storage
29:45 Adding Additional Storage Components
32:35 Reducing Human Errors for Better Data Quality
39:07 Overhyped and Underutilized Tools
Data architecture, AI, data systems, data sources, data extraction, data storage, multi-modal storage engines, data orchestration, Airflow, edge computing, batch processing, data lakes, Delta Lake, Iceberg, data quality, standardization, poka-yoke, compliance, entity resolution
---
Send in a voice message: https://podcasters.spotify.com/pod/show/nicolaygerold/message
Documentation quality is the silent killer of RAG systems. A single ambiguous sentence might corrupt an entire set of responses. But the hardest part isn't fixing errors - it's finding them.
Today we are talking to Max Buckley on how to find and fix these errors.
Max works at Google and has built...
Published 11/21/24
Ever wondered why vector search isn't always the best path for information retrieval?
Join us as we dive deep into BM25 and its unmatched efficiency in our latest podcast episode with David Tippett from GitHub.
Discover how BM25 transforms search efficiency, even at GitHub's immense scale.
BM25,...
Published 11/15/24