If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock.

— Kleppmann, pg. 343

As an engineer with a fair amount of professional experience, I found Designing Data-Intensive Applications by Martin Kleppmann to be extremely illuminating. It's a comprehensive overview of modern data storage and data processing technology, with a little bit of history thrown in.

I was already familiar with many of the concepts and tools in the book but it was interesting to read about them in a broader context. Kleppmann takes time to describe the rationale behind each software framework or tool he introduces, including the problems it solved that existing technologies didn't and what trade-offs it makes. This helped improve my mental model and solidify my overall understanding.

Kleppmann sticks to a particular topic in each chapter — concepts like data encoding, partitioning, or stream processing. As such, in theory this book could be read piece-wise; every chapter stands up on its own like a textbook. But Kleppmann did a good job organizing topics in the book so that reading it end-to-end is worthwhile too. The lessons of one chapter build upon the next in complimentary ways.

Designing Data-Intensive Applications Book Cover

The book is divided into three parts. Part I starts with some fundamental concepts of data systems as they relate to a single machine — how you model, store, and query for data. Then we move beyond a single computer in Part II and a new host of problems are introduced, like how to encode data, replicate it, partition it, and achieving consistency across a distributed system - one of the scarier words in software engineering. In the final section of the book, Kleppmann focuses on heterogeneous systems of data storage. How data is stored and transferred across disparate datastores, indexes, caches, streams and more, and some best practices for building such systems.

All these topics require a familiarity with the topics covered in previous chapters to fully understand the intricacies of the problems being solved. You continuously zoom out to higher levels of abstraction as the book progresses. That's something I really liked about Designing Data-Intensive Applications.

Next I will share 3 lessons that stood out to me after reading the book. Hopefully you will find something useful in my brief summaries.

1. There's no such thing as a schemaless database

A large portion of this book is dedicated to databases, obviously. Chapter 2 covers how to model and store your data in different types of databases. In chapter 3, Kleppmann dives in to how databases are actually implemented, particularly relational / SQL-style databases. But he also introduces several other database classes, generally grouped under the umbrella term "NoSQL". Which just means not SQL. Document-oriented, wide-column, key-value, and graph databases are all alternatives meant for different use cases.

Several of these databases are quite popular, for example MongoDB, a document database that stores unstructured JSON in named collections. I've always considered this type of database to be schemaless since each JSON document in a collection can have completely different fields. But Kleppmann explains that it should be considered schema-on-read instead. This is in contrast to a SQL database, which is schema-on-write because it enforces a predefined schema when you attempt to add or update records.

Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database [20]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read)

— Kleppmann, pg. 51

I think this is an important distinction to make, and makes sense when you think about it. Stored data has no use if it can't be parsed into some known set of fields, so of course at some point a schema needs to be "applied". Sure, you can add if-exists checks to every field to avoid making any assumptions, but the same thing could be done with SQL by making every column nullable.

Schema-on-read is analogous to runtime type checking in programming languages, whereas schema-on-write is similar to compile-time type checking.

The schema-on-read approach is advantageous if you have little control over the data your system is storing and it may change frequently over time. It's easier to change your schema if it's part of your application code. This is why MongoDB, and more broadly any schema-on-read approach, is generally considered more flexible and less error-prone as your application changes.

2. Make reads do more work to make writes easier

Read-heavy systems are very common in software, especially web applications. I'm personally more accustomed to optimizing read efficiency, using techniques such as database indices. A database index will make writes slower (because they need to update the index) but reads much faster.

But sometimes, the system you're designing needs to handle extremely high write throughput. In this case, we want to shift some of the work back to the read path in order to make the write path faster.

For instance, in chapter 5 Kleppmann covers database replication. There is a style of database called Dynamo, which is a set techniques developed at AWS and implemented in their DynamoDB service. Other popular databases like Cassandra and Voldemort were modelled based on Dynamo. These databases are leaderless and use asynchronous consistency to achieve high write throughput.

One technique that stood out to me is called read repair:

When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10, user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica.

— Kleppmann, pg. 199

Because Dynamo-style databases use quorums for consistency between nodes, the read operation also updates any outdated values found when those nodes responsible for a given key are queried. The coordinator node will "repair" those outdated values synchronously before completing the read operation. This is an example of doing more work during reads in order to make writes faster.

Another more general principle for increasing write throughput is by storing data as an immutable log of append-only records. Writing to a single ledger is relatively fast computationally and can be scaled in many ways. This technique is touched on countless times by Kleppmann throughout the book. Change data capture via the binlog, event sourcing, batch and stream processing — these are all examples of systems that write data as an append-only log available for downstream consumers to parse and derive the relevant data needed for their use case. There is more work needed when reading (or by intermediary systems to construct the data in an appropriate form), to the benefit of allowing very high rates of write operations.

3. Time is an illusion

One concept that Kleppmann spends a great deal of ...*ahem*... time on is dealing with time in a distributed system. It turns out to be a tricky business, some may even go so far as to call time an illusion.

Having a notion of time is critical for so many algorithms and mechanisms in software applications, so understanding the edge cases and complexities of it are important. Kleppmann spends a good portion of chapter 8 just discussing the myriad different ways that assumptions about time can be wrong.

Even for a single computer, time is an illusion. There are generally two different clocks available for you to mess up use. The time-of-day clock represents the "real-world" time and is used when you need a timestamp, while the monotonic clock is based on an arbitrary time, but is guaranteed to always increase (at approximately 1 second per second). You use the monotonic clock when you want to measure the duration between two events.

Once multiple machines are involved, simple questions like "what time is it" and "what happened first" become deceptively hard to answer. For instance, a common way of dealing with write conflicts is via the "Last Write Wins" mechanism. When two machines want to modify the same record at the same time, just choose the write that happened later. The problem is, how do you determine which write happened last? If you use the time you received the writes, then you risk violating causality since the write could've been delayed for any number of reasons. If the clients generate timestamps themselves, you suddenly need to deal with differences in their local clocks. If a node has a lagging clock, then all its writes might be overwritten for awhile before it gets noticed.

So, to make sure all the nodes in our system have the right time, we use the Network Time Protocol (NTP) to periodically synchronize all the clocks using a super accurate time source. But, like any network communication, NTP is also susceptible to a number of fault modes. I won't detail them here.

Leap seconds are another good example of time's illusory nature. Leap seconds have crashed entire systems before. Nowadays leap seconds are handled by having the NTP server "lie" via a process called smearing which gradually applies the extra second over an entire day. If you can't trust your NTP server who can you really trust?

I think the complexity of time is emblematic of distributed systems as a whole. You can only reason so much about how a system of computers will behave and every possible way things can go wrong before you start diving into the limits of quantum physics and the notion of truth itself! It can be overwhelming!

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.

— Kleppmann, pg. 330

Comforting words Martin.

Last Words

The last chapter of the book is a divergence from the rest. It's still technical, but it's a forward facing look at how data system technology might evolve in the future. Kleppmann shares his personal views on what makes good system design and how to best leverage different tools and paradigms for today's data needs.

In particular, Kleppmann makes the argument that maintaining data synchronicity between systems is best achieved through log-based derived data vs. trying to implement distributed transactions:

In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems.

— Kleppmann, pg. 542

In other words, datastores should broadcast changes in a durable and persistent way for other systems to consume at their own pace. This naturally entails dealing with eventual consistency when reading from derived sources, but Kleppmann believes that's a more manageable problem to solve compared to the performance and complexity concerns of coordinating synchronous data updates. From everything I learned by reading this book, I understand why he believes that.

For example, some "classic" data constraints such as "two people can't book the same hotel room" might not need to be so inviolable. Fixing those conflicts asynchronously and issuing a notification to the inconvenienced user (plus a coupon or some other small compensation) is an acceptable trade-off your business could consider.

The book ends with Kleppmann examining, with a broader societal perspective, the consequences of these enormous data processing systems that have been built over the last 20 years. The data involved in the majority of these systems is data about us, the users. Privacy and data consent concerns are paramount questions to wrangle with as these systems get better and more accurate. I've previously written about the current state of consumer privacy in my review of The Age of Surveillance Capitalism, so I won't go into more detail here.

Kleppmann also talks about predictive analytics and the world of data science. These days, a machine learning model is almost always going to be one of the consumers of a data-intensive application. Machine learning models usually provide automated decision making or predictions. Kleppmann ponders how accountability will be assessed in this world. For instance, he shares his thoughts on bias within data:

it seems ridiculous to believe that an algorithm could somehow take biased data as input and produce fair and impartial output from it. Yet this belief often seems to be implied by proponents of data-driven decision making, an attitude that has been satirized as “machine learning is like money laundering for bias”.

— Kleppmann, pg. 590

Part of our role as engineers is to have "moral imagination", as Kleppmann puts it - and a desire for the systems we build to improve upon the past. These are all novel issues we are encountering in the information age and they have broad societal implications. Engineers have a big role to play in helping to improve the technology and algorithms underpinning the software that runs our lives.