System Design Class

Learn how to design and build large scale distributed systems

How do you build real-world distributed systems?

I recall how hard it was to find the right information when I first started out. Some books on distributed systems were too theoretical to build real stuff. Some felt like marketing material for a technology stack that was going to be obsolete in a few years. All I wanted was something pragmatic like a timeless classic. Only whitepapers of real systems satisfied me. But without a solid grasp of the fundamentals, I spent hours trying to connect all the missing dots.

This is why I decided to create a video class to teach the fundamentals of large scale distributed systems design. This class contains knowledge I have used over the years on the field to solve concrete problems, the kind that needs to scale to millions of requests per second, and billions of devices. But no matter what scale you are working on today, the core principles are universal.

After you complete this class, you are not going to look at a network calls the same way. And you will start applying that knowledge from day one at your job and on personal projects. You will start wondering how real systems that you use every day work under the hood, create theories of your own, and compare them to actual designs. Armed with the core fundamentals, you will have the tools to understand technical whitepapers, design systems of your own, and nail interviews.

Class Content

Enough talk, let’s take a look at the content.

Instability Sources
Integration Points
Unreliable Network
Slow Processes
Unexpected Load
Cascading Failures
01

Instability Sources

We will start our journey with the simple truth that at scale, anything that can go wrong will go wrong. Writing distributed code is different than writing code that runs on a single machine. If you thought multi-threading was hard, think again.

As a systems engineer, it is your job to bring order to the chaos, to build reliable abstractions on top of unreliable components. But, to do that, you first need to understand how systems fail. If you don’t know what can go wrong, how can you be prepared for it?

Networks
Latency and Throughput
Domain Name System
Transfer Control Protocol
Transport Layer Security
HyperText Transfer Protocol
Proxies and Load Balancing
Content Delivery Network
Polling and Streaming
02

Networks

Networking is the mother of all instability sources. It’s what gives a distributed system its name. You can’t build anything distributed until you have a solid foundation of how networks work.

Building fault-tolerant systems is only possible by leveraging general-purpose abstractions with known guarantees. The network stack does precisely that, and we will learn how.

Consistency
Consistency Models
Failure Detection
Leader Election
Consensus: Two Phase Commit
Consensus: Raft
Ordering
03

Consistency

Imagine some code that assigns a value to a variable. Then the same code reads the variable right after only to find out the write had no effect! Madness! But with eventual consistency, this is what can happen when one machine writes a value to a store and another, perhaps the same, reads it.

This is where consistency guarantees come in. Guarantees define what can and can’t happen. Strong consistency guarantees make our lives easier. But to provide these guarantees, we need to find a way to make networked machines cooperate in harmony. In this chapter, we will explore how to achieve that by solving consensus.

04

Design Patterns

How do you build a system that is scalable, reliable, and available? “Using more machines,” I hear you say. Now that we know how to make them cooperate, we can take a look at how to use replication and sharding to guarantee the above properties.

Storage
Relational
Key-Value
Document
Graph
Configuration
05

Storage

Armed with the knowledge of how sharding and replication work, we can start creating real applications. We will dive into distributed storage engines and discuss how to build relational, key-value, and document engines on top of the concepts we have learned so far.

Stability Patterns
Timeouts
Retry
Circuit Breaker
Load Shedding
Rate Limiting
Health Endpoint
06

Stability Patterns

We are not yet through the woods. The systems we build need to be robust against failures and unexpected events. Think of spikes of incoming requests and failing downstream dependencies. In this chapter, we will look into self-healing mechanisms that will guard our systems against these agents of chaos.

Operations
Metrics
Logs
Alerts
Watchdog
Chaos Tests
SLOs
07

Operations

Your system stops working in the middle of the night, and you only find out through a Reddit post. Whoops. No matter how elegant your design is, if the system lacks monitoring and logging, it’s doomed to fail. Nobody wants to be responsible for a black box. In this chapter, you will learn the best practices on how to instrument and operate large scale systems.

Design Process
Requirements
Estimation
Interfaces
System Design
Risk Reduction
Bottleneck Optimization
08

Design Process

Congratulations, you have learned the fundamentals of how to build real-world distributed systems! In this final chapter we are going to put all the pieces together and design a distributed database on top of a blob store.

Teaser Lesson

Check out one of the videos in the Stability Patterns module explaining how rate limiting works.

The Author

Roberto Vitillo

Hi! My name is Roberto Vitillo. I have over 15 years’ experience in the tech industry as a software engineer, tech lead, and manager.

In 2017 I joined Microsoft to work on an internal data platform as a SaaS product. Since then, I have helped launch two public SaaS products, Product Insights and Playfab. The data pipeline I am responsible for is one of the largest in the world. It processes millions of events per second from billions of devices worldwide.

Before that, I worked for Mozilla, where I wore different hats, from performance engineer to data platform engineer. What I am most proud of is having set the direction of the data platform from its very early days and built a large part of it, including the team.

Interested?

Sign up to be notified when the class is released.

Frequently Asked Questions

Can I use this class to prepare for the system design interview?

The best way to breeze through a system design round is to learn how to design actual real systems. Who would have thought, right? Unlike algorithmic puzzles, it’s not an “interview-only” skill. It requires hard work and experience. But if you approach it the right way, then you will add a powerful tool to your toolbelt. One that will make you stand out from the crowd.

When I say system design, I mean the distributed kind. It used to be that large scale system design questions were only asked by the likes of Amazon, Google, and Microsoft. But that’s no longer true. The reality is that nowadays we are all distributed systems engineers. And we need to understand the implications of building complex systems out of a networked mesh of simpler ones.

If you are a junior engineer, then you might be able to wing the design round and still get an offer. If you are a senior engineer, then you need to be able to design complex systems and dive deep in any vertical. You can be a world champion in balancing trees, but if you fail the design round, you are out. If you just meet the bar, then don’t be surprised when your offer is well below what you were expecting.