Sunday, June 23 2024

Last Updated: 24-06-2024

If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock.

— Kleppmann, pg. 343

As an engineer with a fair amount of professional experience, I found Designing Data-Intensive Applications by Martin Kleppmann to be extremely illuminating. It's a comprehensive overview of modern data storage and data processing technology, with a little bit of history thrown in.

I was already familiar with many of the concepts and tools in the book but it was interesting to read about them in a broader context. Kleppmann takes time to describe the rationale behind each software framework or tool he introduces, including the problems it solved that existing technologies didn't and what trade-offs it makes. This helped improve my mental model and solidify my overall understanding.

Kleppmann sticks to a particular topic in each chapter — concepts like data encoding, partitioning, or stream processing. As such, in theory this book could be read piece-wise; every chapter stands up on its own like a textbook. But Kleppmann did a good job organizing topics in the book so that reading it end-to-end is worthwhile too. The lessons of one chapter build upon the next in complimentary ways.

Designing Data-Intensive Applications Book Cover

The book is divided into three parts. Part I starts with some fundamental concepts of data systems as they relate to a single machine — how you model, store, and query for data. Then we move beyond a single computer in Part II and a new host of problems are introduced, like how to encode data, replicate it, partition it, and achieving consistency across a distributed system - one of the scarier words in software engineering. In the final section of the book, Kleppmann focuses on heterogeneous systems of data storage. How data is stored and transferred across disparate datastores, indexes, caches, streams and more, and some best practices for building such systems.

All these topics require a familiarity with the topics covered in previous chapters to fully understand the intricacies of the problems being solved. You continuously zoom out to higher levels of abstraction as the book progresses. That's something I really liked about Designing Data-Intensive Applications.

Next I will share 3 lessons that stood out to me after reading the book. Hopefully you will find something useful in my brief summaries.

1. There's no such thing as a schemaless database

A large portion of this book is dedicated to databases, obviously. Chapter 2 covers how to model and store your data in different types of databases. In chapter 3, Kleppmann dives in to how databases are actually implemented, particularly relational / SQL-style databases. But he also introduces several other database classes, generally grouped under the umbrella term "NoSQL". Which just means not SQL. Document-oriented, wide-column, key-value, and graph databases are all alternatives meant for different use cases.

Several of these databases are quite popular, for example MongoDB, a document database that stores unstructured JSON in named collections. I've always considered this type of database to be schemaless since each JSON document in a collection can have completely different fields. But Kleppmann explains that it should be considered schema-on-read instead. This is in contrast to a SQL database, which is schema-on-write because it enforces a predefined schema when you attempt to add or update records.

Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database [20]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read)

— Kleppmann, pg. 51

I think this is an important distinction to make, and makes sense when you think about it. Stored data has no use if it can't be parsed into some known set of fields, so of course at some point a schema needs to be "applied". Sure, you can add if-exists checks to every field to avoid making any assumptions, but the same thing could be done with SQL by making every column nullable.

Schema-on-read is analogous to runtime type checking in programming languages, whereas schema-on-write is similar to compile-time type checking.

The schema-on-read approach is advantageous if you have little control over the data your system is storing and it may change frequently over time. It's easier to change your schema if it's part of your application code. This is why MongoDB, and more broadly any schema-on-read approach, is generally considered more flexible and less error-prone as your application changes.

2. Make reads do more work to make writes easier

Read-heavy systems are very common in software, especially web applications. I'm personally more accustomed to optimizing read efficiency, using techniques such as database indices. A database index will make writes slower (because they need to update the index) but reads much faster.

But sometimes, the system you're designing needs to handle extremely high write throughput. In this case, we want to shift some of the work back to the read path in order to make the write path faster.

For instance, in chapter 5 Kleppmann covers database replication. There is a style of database called Dynamo, which is a set techniques developed at AWS and implemented in their DynamoDB service. Other popular databases like Cassandra and Voldemort were modelled based on Dynamo. These databases are leaderless and use asynchronous consistency to achieve high write throughput.

One technique that stood out to me is called read repair:

When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10, user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica.

— Kleppmann, pg. 199

Because Dynamo-style databases use quorums for consistency between nodes, the read operation also updates any outdated values found when those nodes responsible for a given key are queried. The coordinator node will "repair" those outdated values synchronously before completing the read operation. This is an example of doing more work during reads in order to make writes faster.

Another more general principle for increasing write throughput is by storing data as an immutable log of append-only records. Writing to a single ledger is relatively fast computationally and can be scaled in many ways. This technique is touched on countless times by Kleppmann throughout the book. Change data capture via the binlog, event sourcing, batch and stream processing — these are all examples of systems that write data as an append-only log available for downstream consumers to parse and derive the relevant data needed for their use case. There is more work needed when reading (or by intermediary systems to construct the data in an appropriate form), to the benefit of allowing very high rates of write operations.

3. Time is an illusion

One concept that Kleppmann spends a great deal of ...*ahem*... time on is dealing with time in a distributed system. It turns out to be a tricky business, some may even go so far as to call time an illusion.

Having a notion of time is critical for so many algorithms and mechanisms in software applications, so understanding the edge cases and complexities of it are important. Kleppmann spends a good portion of chapter 8 just discussing the myriad different ways that assumptions about time can be wrong.

Even for a single computer, time is an illusion. There are generally two different clocks available for you to mess up use. The time-of-day clock represents the "real-world" time and is used when you need a timestamp, while the monotonic clock is based on an arbitrary time, but is guaranteed to always increase (at approximately 1 second per second). You use the monotonic clock when you want to measure the duration between two events.

Once multiple machines are involved, simple questions like "what time is it" and "what happened first" become deceptively hard to answer. For instance, a common way of dealing with write conflicts is via the "Last Write Wins" mechanism. When two machines want to modify the same record at the same time, just choose the write that happened later. The problem is, how do you determine which write happened last? If you use the time you received the writes, then you risk violating causality since the write could've been delayed for any number of reasons. If the clients generate timestamps themselves, you suddenly need to deal with differences in their local clocks. If a node has a lagging clock, then all its writes might be overwritten for awhile before it gets noticed.

So, to make sure all the nodes in our system have the right time, we use the Network Time Protocol (NTP) to periodically synchronize all the clocks using a super accurate time source. But, like any network communication, NTP is also susceptible to a number of fault modes. I won't detail them here.

Leap seconds are another good example of time's illusory nature. Leap seconds have crashed entire systems before. Nowadays leap seconds are handled by having the NTP server "lie" via a process called smearing which gradually applies the extra second over an entire day. If you can't trust your NTP server who can you really trust?

I think the complexity of time is emblematic of distributed systems as a whole. You can only reason so much about how a system of computers will behave and every possible way things can go wrong before you start diving into the limits of quantum physics and the notion of truth itself! It can be overwhelming!

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.

— Kleppmann, pg. 330

Comforting words Martin.

Last Words

The last chapter of the book is a divergence from the rest. It's still technical, but it's a forward facing look at how data system technology might evolve in the future. Kleppmann shares his personal views on what makes good system design and how to best leverage different tools and paradigms for today's data needs.

In particular, Kleppmann makes the argument that maintaining data synchronicity between systems is best achieved through log-based derived data vs. trying to implement distributed transactions:

In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems.

— Kleppmann, pg. 542

In other words, datastores should broadcast changes in a durable and persistent way for other systems to consume at their own pace. This naturally entails dealing with eventual consistency when reading from derived sources, but Kleppmann believes that's a more manageable problem to solve compared to the performance and complexity concerns of coordinating synchronous data updates. From everything I learned by reading this book, I understand why he believes that.

For example, some "classic" data constraints such as "two people can't book the same hotel room" might not need to be so inviolable. Fixing those conflicts asynchronously and issuing a notification to the inconvenienced user (plus a coupon or some other small compensation) is an acceptable trade-off your business could consider.

The book ends with Kleppmann examining, with a broader societal perspective, the consequences of these enormous data processing systems that have been built over the last 20 years. The data involved in the majority of these systems is data about us, the users. Privacy and data consent concerns are paramount questions to wrangle with as these systems get better and more accurate. I've previously written about the current state of consumer privacy in my review of The Age of Surveillance Capitalism, so I won't go into more detail here.

Kleppmann also talks about predictive analytics and the world of data science. These days, a machine learning model is almost always going to be one of the consumers of a data-intensive application. Machine learning models usually provide automated decision making or predictions. Kleppmann ponders how accountability will be assessed in this world. For instance, he shares his thoughts on bias within data:

it seems ridiculous to believe that an algorithm could somehow take biased data as input and produce fair and impartial output from it. Yet this belief often seems to be implied by proponents of data-driven decision making, an attitude that has been satirized as “machine learning is like money laundering for bias”.

— Kleppmann, pg. 590

Part of our role as engineers is to have "moral imagination", as Kleppmann puts it - and a desire for the systems we build to improve upon the past. These are all novel issues we are encountering in the information age and they have broad societal implications. Engineers have a big role to play in helping to improve the technology and algorithms underpinning the software that runs our lives.

Rating

Non-Fiction 

Value

6/7

Interest

3/7

Writing

5/7

🥪️If this book was a sandwich it would be: egg salad on a brioche bun with alfalfa sprouts and spicy dijon mustard

Sunday, January 23 2022

Screamin' at home, and at phones, we all hurtin'
Freakin' ya soul, like the pack, we all herb
Pay me to feel with the funk, we all need ya
Mixin' the feel with the facts, we all hurt
— Isaiah Rashad, All Herb

I like stuff. I decided to recognize the stuff that I believe is the best stuff out of all the stuff last year. I did this in 2020 and I've done it again this year.

This is a list of my favourite music, film, and TV shows from 2021.

Music


Song: slowthai — feel away(feat. James Blake & Mount Kimbie)

I immediately fell in love with this song the first time I heard it.

Slowthai is a new artist for me; he uses a blend of hip-hop with more familiar British (his home country) genres like grime and punk. He experiments with all of these sounds on his 2021 album Tyron, of which feel away is the 2nd last song.

Feel away feels like slowthai's approach to a love song. In this case, it's like the love is lost already. The piano refrain is melancholic and ethereal; echoing away throughout the whole song. It evokes the feeling of losing something, like something is slipping away.

The bridge by James Blake in the 2nd half of the song is beautiful and haunting. As the beat breaks back in, Blake sings:

I'll leave the dent in my car
To remind me what I could have lost

Again, it's about losing something. The aftermath of having something important taken from you. It could be a relationship, a friend, a pet, or your favourite coffee shop closing down due to economic hardship. No matter what it is, songs like this have a way of reminding us about these things.

Even the outro by Mount Kimbie is really cool and fits perfectly with the song. Just an excellent composition all around.

Album: Isaiah Rashad — The House is Burning

Shoutout to Cautious Clay who made this a tough decision with the release of Deadpan Love this year. It's an amazing record and was definitely a close second.

Isaiah Rashad hasn't released an album since 2016. He's barely released any new music in that time. I first heard him via Civilia Demo, his EP from 2012. It's an incredible debut album. I instantly fell in love with his melodic, laid back style of hip-hop. His cadence is finely tuned and so easy to listen to. Rashad doesnt place much importance on enunciation...but if this is mumble rap then it's the best mumble rap out there.

The House is Burning is so well done. Rashad has perfected his sound on this album. Each track is different but as a body of work its got everything. Rashad can carry a song by himself, but he selected some great featured artists for most tracks on the album. A good example is Lil Uzi Vert's verse on From the Garden, one of the album's standout tracks.

My favourite track is RIP Young though. It's got a beat that needs to go an a diet and a really catchy chorus. It also showcases Rashad's impressive lyricism. He might not be easy to understand sometimes but his verses are pure poetry.

Artist: The Blue Stones

The Blue Stones are a two-piece rock band hailing from Windsor, Canada. With only a guitar, drums, and two mouths, The Blue Stones manage to make some really dynamic, catchy rock music.

They released their sophomore album Hidden Gems in 2021 and it was one of my most played albums last year. It's a follow up to their first studio album from 2018, Black Holes. I'm giving them artist of the year because both albums are stellar.

I think it'd be easy to criticize their sound as repetitive. Every song from this new album would've fit perfectly fine on the last one. Even though their sound hasn't evolved significantly, there's something to be said for consistency. I was happy to get 10 more tracks from a band I already loved with the release of Hidden Gems.

If they don't do anything different on their next album though...might be time to add a 3rd band member.

Movies


Film: Sound of Metal

I'm not exactly sure what the rules of my annual awards list should be. I technically watched Sound of Metal for the first time in 2020. But I've watched it a bunch of times since then, including this year. And yesterday.

Sound of Metal is a superb film. It's about a recovering heroin addict named Ruben who plays drums in a metal band with his girlfriend. Ruben, played by Riz Ahmed, starts to lose his hearing and he has to figure out a way to cope with this burgeoning disability.

Director Darius Marder did an amazing job bringing this original screenplay to life. Sound of Metal presents a realistic and eye-opening view of what deafness is like. Ahmed's performance was outstanding; he's become one of my favourite actors in recent years.

Sound of Metal
MFW I literally go deaf

The sound editing is incredible and adds so much to the experience. In fact, Sound of Metal won the Oscars for Best Editing and Best Sound last year. It was also nominated for Best Picture, which it definitely could've won.

Documentary: Coded Bias

Coded Bias is a documentary about modern technology and the biases that are imbued within technology. In particular, it looks at the racial bias present in facial recognition. It also exlores how software and algorithms are being used more and more to make decisions about us across all aspects of life. Coded Bias It was really eye-opening. The stark difference in accuracy found in popular facial recognition services against women and people of colour was astounding. It's so surprising that these major tech companies like Google and Amazon would release these services without checking for such obvious biases.

The film was really well done. It explores a range of topics and has interviews with experts who are active in these debates about facial recognition, widespread surveillance, and algorithmic bias.

Coded Bias was an illuminating film that I learned a lot from. It definitely made me think about how much control we are ceding to these algorithms and the people who develop them.

Director: Denis Villeneuve

I've loved every movie by Denis Villeneuve I've seen. Arrival, Sicario, and Blade Runner 2049 are all fantastic...great stories with great cinematography.

His latest movie, Dune, was released back in October. It's the latest attempt to make a film adaptation for a book that's been notoriously hard to adapt. I think Villeneuve did a decent job all things considered. Dune was a visually stunning movie, especially in IMAX.

Dune meme
Hans Zimmer also awed audiences with his goose use

Television


Limited Series: The Serpent

Netflix has produced some pretty fantastic shows and movies the past few years. The Serpent was a co-production between Netflix and the BBC; it's an 8 part limited series released last April.

It's based on the actual story of serial killer Charles Sobhraj, who drugged and killed tourists in Thailand during the 1970s. It's suspenseful, engaging and super creepy. Extra creepy when you remember that it actually happened.

It's a must-watch for anyone that's not planning a trip to Southeast Asia anytime soon.

Show: The Sopranos

I'm still not done The Sopranos but I'd be kidding myself to say this wasn't the best TV I've watched all year. I'm a little late to the party with this one (the final episode aired in June 2007) but it was better late than never.

Despite the show's age, the themes and story-lines hold up surprisingly well today. It's a classic mob drama told with a modern lens, aware of the Godfathers and Goodfellas that came before it. Mixing mafia crime and family drama, The Sopranos is a show that finds deadpan humour embedded in its realism.

But the show is propelled by the excellent casting and the performances of every lead character. James Gandolfini is Tony Soprano—he commands the role of a mafia boss succumbing to the pressures of his responsibilities. It's really entertaining TV.

Episode: High Maintenance — Cruise (S03E09)

High Maintenance is a really unique show. It's kinda an anthology—each episode explores different lives of people living in New York City. It's a very modern, very progressive look at life today.

The only thing that loosely ties the stories together is The Guy...the nameless protagonist who bikes around the city delivering weed to the people in the show. Some episodes feature him more than others, but in general his life is not really the focus of the show.

I loved the final episode of season 3, Cruise. It wasn't especially better than any of the other episodes, but it had a bit of everything. My favourite part is the last 10 minutes of the episode, which felt like an homage to bicycling in the city. It ends with The Guy biking home at night overlaid with a monologue from the famous poet and NYC tour guide, Speed Levitch. Levitch is also featured in a few different scenes in the episode.

High Maintenance

As an avid city biker myself, I appreciated the tribute. Biking through a busy downtown is an immersive mix of chaos and order. All your senses are saturated by the buzz of the city as you cruise through it.

I suppose if I had an essential goal on the cruise right now, it would be to exhibit the fact that I'm thrilled to be alive and to still be respected. I suppose the soulful or the Buddhist out there might ask, 'Why do you need respect from others? The thrill to be alive, that's your own business. You can do that in your living room.' But that's not what the cruise is for me. The cruise is about the searching for everything worthwhile in existence.

I mean, I will appreciate the beauty of a flower, and then likewise, I will stand exhibitionistic and have the flower appreciate the beauty of me. Well, that's how I feel about cruising right now. And I would say having a quote, unquote, 'intimate love affair' with a flower is far more psychotic and riveting than having an 'intimate love affair', quote, unquote, with some of the banal creatures of the human race. Although I'd be into that too.
— Speed Levitch, The Cruise

Cruising through 2021 was a ride in itself. But the cruise is about searching for everything worthwhile in existence. Let's keep searching.

Monday, November 22 2021

Every year, I discover more and more, that I'm the same as everyone else. Which is kind of great, because it means that life is not so mysterious. You just do what other people do. Say please. Floss. When you're making scrambled eggs, stir them really fast so they don't get crusty. Find a few good people and try to hang on to them. Don't lose all your pieces.
— Sasha Chapin, All the Wrong Moves, pg. 70

You know how some people hate movie trailers? Like, they'd rather watch movies without seeing the trailer first because it'll usually spoil some things about the plot. I'm the same way with books.

I like starting a book without knowing too much about it; pretty much for the same reason as those trailer-haters. I enjoy figuring out the story as I read it so I can make my own judgments first. It's far more engaging, especially with fiction novels.

The "to-read" list on my iPhone is compiled from multiple sources (the internet, friend's recommendations, Oprah, etc.) and it's getting pretty long. As a consequence, the time between adding a book and actually reading it is usually enough time to forget why I added it in the first place. I just trust that past Joe added it for a good reason. Usually works out for future Joe (who becomes present Joe at the time of reading)

That's exactly what happened with All The Wrong Moves by Sasha Chapin. I forgot what it was about. I pictured it being some dramatic tale about a chess player that went crazy or something like 100 years ago. It wasn't that at all. It was much better and far funnier than I expected.


Just for Laughs


All The Wrong Moves is all about chess, kinda. Sasha Chapin is obsessed, but also repulsed, by the main subject of his memoir. And just like all good toxic relationships, he ends up with some wild stories from his time spent playing this game.

I thoroughly enjoyed the book. Chapin's sharp wit shines through on each page. Most of the humour stems from Chapin's acute perspective on life and how it's often stranger than fiction. When you view things through the right lens.

In terms of subject matter, Chapin toes a fine line between funny, self-deprecating cynicism and profound observations about the human experience. Like describing his personal preference between an aggressive or defensive play style:

I was like a child who couldn't draw a house with crayons deciding whether to be more like Jackson Pollock or Francis Bacon.
— Chapin, pg. 63

And then, in the same breath (or whatever the written version of breathing is), Chapin will expound on the mystery of determinism:

But it's so hard to tell, from the inside of a life, whether we can control our fate, or whether consciousness is merely the ability to observe ourselves obeying our irrevocable course, as if we were all self-aware pinballs
— Chapin, pg. 100

This self-aware pinball found the writing to be absolutely hilarious. Chapin embeds humour into every subject in the book. At a rapid fire pace too—I would audibly laugh several times in between two page turns. His timing and rhythm reminded me of stand-up comedy.

I love how Chapin portrays chess as a character in his memoir. It was such an integral part of his life for so long that it felt like a person. But it wasn't his friend. Chess was essentially the antagonist of All the Wrong Moves.

All the Wrong Moves Book Cover

It lures him in with its abstract beauty and illustrious history. Chess also feels like a world separate from our own—it occupies a higher plane within our minds. Chess can feel like an escape from the viscerality of life.

Yet the world of chess is its own special form of hell for Chapin. He becomes consumed by his drive to conquer the game, to understand its inner workings and secret rhythms better than his opponents. This obsession takes him around the world; he sacrifices relationships, sleep, and his own health.

Specifically, he wants to beat someone with an ELO rating above 2000 at a tournament in Los Angeles. Mostly because it's a nice round number.

All The Wrong Moves is all about chess, but it's also not. Chess could be substituted by almost anything in this story, because Chapin isn't writing about it, he's writing about his relationship with it.

I learned a lot of things from reading All the Wrong Moves, or at least it expanded my views on many important things (Funny how a book can do that). Obsession, conformance, following your passions, and dealing with the gradual realization that you aren't that special. At least from any reasonably zoomed out perspective. We all know your mom thinks you're special.


Leaning In


Chapin fully admits, from the start, that his obsession with chess was unhealthy. It was absolutely not good for him and his well-being. From sleepless nights playing online chess with strangers, to the anxiety and stress he dealt with during tournaments, Chapin wasn't in control of this hobby. Chess was in control.

What I found interesting was how self-aware Chapin was. Trying to resist the urge to play was futile, and he accepted that.

He explains how chess entered his life in high school, when he joined the chess club. Despite some brief breaks from it, Chapin was consumed by the game for most of his 20s.

Was this unrelenting pursuit of chess mastery Chapin's choice? It doesn't sound like it:

Frankly, I didn't feel like I was doing much until chess came along. [...] it felt like a possession---like a spirit had slipped a long finger up through my spine, making me a marionette, pausing only briefly to ask, "you weren't doing anything with this, were you?"
— Chapin, pg. 4

This fact, that Chapin never really had a choice about devoting himself to this game—it feels like the central point he was trying to address in this memoir.

I really appreciate how Chapin leaned in to his obsession. He fueled his passion for chess for years until he could feel satisfied. Maybe not satisfied with the outcome, but satisfied with the effort he put in.

So much of our lives are determined by what we're exposed to---the ebbs and flows of life around us. These are the tides that can push us out to sea. The question is whether you choose to sink, or learn how to swim.

Ultimately, we hold on to the belief that we control what we want to pursue in life. What we want to give ourselves to and become passionate about; the mountains we choose to climb. But maybe which mountain we choose isn't that important. It's about deciding to climb.

Nature analogies aside—chess playing could be seen as one of the least useful skills to devote time to. It's just a game after all. It's a great example of how applying yourself will change you, no matter what the application is. Chapin believes that no matter how it works out, you'll be a better person at the end of the day.

Life often contains the discovery that your place in humanity isn't quite what you thought it was. You find out that you weren't meant to be the lover of the thing you first loved. But it's not so bad. If you're lucky, you end up loving something else. When failure removes you from the wrong path, as wrenching as that feels, you ought to be grateful. You're a little closer to where you should be, even if you don't know where that is yet.
— Chapin, pg. 121

Rating

Non-Fiction 

Value

3/7

Interest

6/7

Writing

5/7

🥪️If this book was a sandwich it would be: bacon, melted cheese & curly fries in a gluten-free sourdough bun

More