26: Building Data Engineering Pipelines at Scale (with Data Warehouse, Spark and Airflow)
Listen now
Description
Imagine you are at a beach and you are hanging out and seeing all the waves come and go and all the shells on the beach. And you get an idea. How about you collect these shells and make necklaces to sell? Well how would you go about doing this? Maybe you’d collect a few shells and make a small necklace and try to show to your friend. This is where we begin our journey on learning about data engineering pipelines.  Using an example of running a necklace business from shells - we learn about the following data engineering concepts:  1. ETL - Extract Transform Load vs ELT - Extract Load Transform concepts. Why Data Warehouses are great for analytics.  2. Spark for large data processing and hosting / running 3. Data orchestration using Airflow My blog on Towards Data Science about moving from Pandas to Spark: https://towardsdatascience.com/moving-from-pandas-to-spark-7b0b7d956adb  Great book to learn about Spark: https://www.amazon.com/dp/1492050040/?tag=omnilence-20  Tools covered in the episode:  dbt: https://www.getdbt.com/  Databricks: https://databricks.com/ EMR: https://aws.amazon.com/emr/ AWS Redshift: https://aws.amazon.com/redshift/ Snowflake: https://www.snowflake.com/ Delta Lake: https://databricks.com/product/delta-lake-on-databricks  --- Send in a voice message: https://anchor.fm/the-data-life-podcast/message Support this podcast: https://anchor.fm/the-data-life-podcast/support
More Episodes
We talk with Michel Tricot, who is the Founder and CEO of Airbyte, which is an open source data integration Y Combinator startup. It has raised over $30M in capital and has been growing quite fast. It was a great conversation and I think you will also enjoy it. 🎉 We cover lots of things in the...
Published 10/11/21
Published 10/11/21
In this episode, I'm excited to be talking with Jeff Bermant, who is the founder and CEO of Cocoon Mydata Rewards browser. It is a browser based off Chrome and it pays people to use it! ✨  In this episode we talk about data ethics and privacy, and how Jeff believes that users should be paid for...
Published 08/04/21