All episodes of 10-Minute System Design

Episodes

In this episode, we'll take a look at Meta’s ambitious approach to scaling large language models. We'll explore the shift from handling many smaller models for recommendation engines to building colossal generative AI models, and the immense challenges that come with it. From hardware and software optimizations to managing power and dealing with inevitable hardware failures, we'll break down the critical pieces that make Meta's infrastructure tick. What does it take to run systems this large...

Published 10/24/24

How Netflix Streams High-Quality Video

In this episode, let's explore how Netflix revamped their video processing pipeline, moving from a monolithic system to a microservices architecture. What drove such a major shift? You'll hear how their original platform, Reloaded, couldn’t keep up with Netflix’s rapid pace of innovation, and why Cosmos, their new system, is now the backbone of everything from streaming to studio operations. But what challenges did they face along the way? And is Cosmos truly the future-proof solution it...

Published 10/23/24

10-Minute System Design

Published 10/23/24

How Apple Stores Billions of Data in iCloud

In this episode, we'll explore the intricate system and architecture design behind Apple's iCloud. We'll break down how Apple seamlessly handles billions of users by combining Cassandra and FoundationDB to power iCloud's backbone. What prompted Apple to shift from Cassandra to FoundationDB, and how does this choice impact scalability and performance? Get a closer look at the architecture that makes iCloud tick, and discover how it enables such a smooth user experience. The surprising reason...

Published 10/22/24

How Uber Shows Nearby Drivers Quickly and Reliably

In this episode, we explore the system behind Uber's driver-matching functionality, capable of handling an incredible one million requests per second. We break down the key technologies that make it work, from H3, the hexagonal grid system for location indexing, to Ringpop, which scales services across servers. You'll hear about how GPS data is transformed into road segments, and how databases like Cassandra and Redis power this high-demand platform. Whether you're curious about large-scale...

Published 10/21/24

How Instagram Scaled to 2.5 Billion Users

In this episode, we'll learn how Instagram scaled to 2.5 billion users. We'll discuss the major challenges Instagram faced — from resource constraints to data consistency and performance, and unpack the innovative strategies the team used to tackle them. From replacing Python with more performant languages to leveraging Cassandra for distributed data storage, we'll learn how Instagram managed to keep things running smoothly at such massive scale. Curious how they did it? Tune in to hear how a...

Published 10/14/24

How Facebook Scaled Memcached

In this episode, we explore how Facebook engineers scaled Memcached, the open-source caching system, to handle billions of requests and trillions of items. We’ll break down the challenges they faced and the smart solutions they developed — from reducing latency to optimizing memory usage. Join us as we uncover how they transitioned from a single cluster to a distributed system spread across the globe, tackling data replication, load balancing, and more. If you’re curious about the inner...

Published 10/13/24

Spanner: Google's Globally-Distributed Database

In this episode, we explore another important piece of technology from Google: Spanner — a globally distributed database that reshapes how massive datasets are managed. We’ll talk about its unique architecture, including the TrueTime API, which solves clock uncertainty to ensure consistency across data centers. We’ll also cover Spanner’s concurrency control, two-phase commit, and lock-free read-only transactions. Plus, discover how Google’s ad platform, F1, leverages Spanner to handle...

Published 10/12/24

The Google File System

In this 10-minute episode, we explore the Google File System (GFS), a scalable, fault-tolerant distributed file system designed for Google’s vast data needs. Built on commodity hardware, GFS ensures high performance for many clients. We’ll cover key design principles like handling frequent component failures, large file operations, and atomic appends. We’ll also dive into its architecture—featuring a master server for metadata management and chunkservers for storage—along with data handling,...

Published 10/09/24

Dynamo: Amazon’s Highly Available Key-Value Store

In this episode, our hosts take a closer look at a groundbreaking research paper on Dynamo, Amazon’s innovative distributed data storage system. With a focus on availability over consistency, Dynamo employs cutting-edge techniques like consistent hashing and gossip-based failure detection to deliver high performance. Join us as we unpack the paper’s insights into its design and implementation, its real-world applications within Amazon, and the fascinating trade-offs between performance and...

Published 10/09/24

MapReduce: How Google Simplifies Large-Scale Data Processing

Join us in this episode as we dive into MapReduce. We’ll explore how it revolutionizes the way we process vast datasets on large clusters. With a focus on simplicity, the MapReduce framework abstracts complex tasks like data partitioning and fault tolerance, allowing users to easily define two essential functions: “Map” and “Reduce.” We’ll discuss real-world applications that showcase its power—from distributed grep to web link analysis. If you’re curious about how to harness the...

Published 10/09/24

Chubby: Google's Distributed Lock Service

In this episode, our hosts delve into the legendary research paper detailing the creation and implementation of Chubby, Google's innovative distributed lock service. Designed for large-scale, loosely-coupled systems, Chubby offers a reliable mechanism for synchronization, such as electing primary servers among peers. The paper explores the critical design choices prioritizing availability over raw performance, revealing the system's architecture, implementation intricacies, and essential...

Published 10/09/24

Bigtable: Google's Distributed Storage System

Imagine a revolutionary storage system that can handle petabytes of data across thousands of ordinary servers. This is Bigtable — a groundbreaking solution that redefines how structured data is managed at scale. Discover how Bigtable handles petabytes of structured data across thousands of servers, enabling unparalleled scalability and flexibility. Join us as we uncover its real-world applications—from Google Analytics to Personalized Search — and the vital lessons learned in designing...

Published 10/09/24

Cassandra: A Decentralized Structured Storage System from Facebook

In this episode, our hosts delve into Cassandra, the distributed storage system developed at Facebook to tackle the immense challenges of managing structured data. Designed for high availability and scalability, Cassandra emerged from the need to support billions of daily writes for the Inbox Search feature. Join us as we explore this game-changing piece of tech that influences modern distributed systems today.

Published 10/09/24

Hadoop: Yahoo's Distributed File System

In this episode, we take a closer look at the Hadoop Distributed File System (HDFS), a key part of the Hadoop framework that helps store and manage huge amounts of data. We’ll explore how HDFS spreads data across many affordable servers, making it both scalable and cost-effective. You’ll learn about its main components, like the NameNode and DataNodes, and how they work together. We’ll also discuss features that keep your data safe and ensure it moves efficiently. Join us, we’ll touch on the...

Published 10/09/24

Kafka: LinkedIn's Distributed Messaging System

This episode focuses on Kafka, the distributed messaging system born at LinkedIn. Learn how Kafka was designed to tackle the massive streams of log data driving personalized recommendations, search algorithms, and real-time security. We'll explore how it outperforms traditional systems like ActiveMQ and RabbitMQ with its streamlined architecture, decentralized coordination, and focus on efficiency. Tune in to explore Kafka's unique design and how it’s becoming essential for modern data...

Published 10/09/24

Redis Distributed Lock

Ever wondered how multiple processes can safely share resources without stepping on each other's toes? In this episode, we'll talk about Redis's distributed lock and discover how it ensures mutual exclusion for shared resources across a network of Redis servers, allowing only one process at a time to gain access. We’ll delve into its safety and liveness properties that guarantee reliable lock management, even amidst failures. Join us as we unpack potential challenges like network partitions...

Published 10/09/24