Sunday, October 6 2024

It's time to update my website.

Over the last couple years I took a bit of a hiatus from posting new content, but this year I've rediscovered the motivation for writing. I'm also interested in writing about more subjects, not just book reviews. In particular, I'm going to start writing about software engineering and coding more, which will require some changes to the formatting of posts.

Because of these new requirements, I've decided to rebuild my writing "stack" and the platform that powers my blog.

The main things I don't like about my current writing flow and site:

  • The site's UI is outdated
    • While I'm proud of the handcrafted *artisanal* HTML I wrote for the original site, I want to redesign it to match my current tastes
  • There's too much manual HTML editing required for new posts
    • I write posts in markdown and then convert them to HTML programmatically. But then I usually have to modify the HTML output to finalize the formatting.
  • My deployment infrastructure is not cloud optimized or properly de-coupled.
    • Sure, hosting your Spring Boot App, MongoDB server, media assets and Jenkins server on a single EC2 instance is possible. Is it a good idea? No.

So there's nothing horribly wrong about the current site. It works. Which isn't surprising given it's a static blog that changes infrequently. But I think rebuilding the site will help invigorate my writing and improve my efficiency for generating new content. Secondly, any project is an opportunity to learn so I'm excite to work with some stuff I don't use often and try out some new technology.

My goals for this new site (code named flow2) are the following:

  • Refresh the UI
  • Containerize and use better cloud tooling for the infrastructure
  • Markdown files as the single source of truth for content. No manual HTML editing required
  • Streamline the entire process between writing and posting
  • Overclock the Lighthouse scores as much as possible for fun
  • Try out Ktor — an async web framework built in Kotlin with coroutines

And that's it! I'll probably get a new domain name too.

Looking forward to building. You can check out my progess on GitHub if you like.

Monday, August 19 2024

When awake, we see only a narrow set of all possible memory interrelationships. The opposite is true, however, when we enter the dream state and start looking through the other end of the memory-surveying telescope. Using that wide-angle dream lens, we can apprehend the full constellation of stored information and their diverse combinatorial possibilities

— Matthew Walker, pg. 203

I have a hunch that most people who identify as "book readers" like to read before bed. In fact, I'm confident that's where a lot of readers get most of their reading done. However, I admit I have no evidence or statistics to back this up; mostly because I'm not a scientist and I haven't done any research. Welcome to my blog.

Fortunately, there are people like Dr. Matthew Walker who IS a scientist and DOES do research before sharing their theories with the world. Walker is a professor of neuroscience at USC Berkeley and he has spent his career studying sleep. This book - Why We Sleep - is his magnum opus; a distillation of decades spent revealing the secrets of the strange, yet essential, nocturnal phase of our existence.

Why We Sleep book cover

Y'all Sleeping on Sleep

It feels like an injustice to this book to sum it up by saying "sleep is good for you". I was astounded by the breadth of topics covered by Dr. Walker in regards to sleep and its effects on our body and health. Part 2 of the book is entitled "Why Should You Sleep" and I kid you not, the following is an incomplete list of the benefits that adequate sleep has been shown to promote:

  • creativity
  • emotional regulation
  • learning efficacy
  • memory retention
  • expected lifespan
  • decreased psychiatric disorder risk
  • decreased injury risk
  • decreased cancer risk
  • decreased Alzheimer's disease disk
  • decreased type-2 diabetes risk
  • decreased car crash risk
  • increased testosterone levels
  • increased testicle size

Yes even that last one. After reading the details of all the studies and methodologies behind proving these correlations, I felt the emotional impact of each lesson began to dull after awhile. Like hearing that the world's on fire every day in the news, being told that sleep is really good for you starts to get repetitive. So I think that Part 2 sort of drags on a bit.

Thankfully, the remainder of Why we Sleep was more focused on how to get better sleep — and what the hell is going on when we dream! I found both these topics to be much more interesting.

Like I mentioned before, I do a lot of reading in bed. This was both a good and bad book to read before sleeping. On the one hand, learning about all the enumerable ways that sleep is good for me was a great headspace to end my day in, allowing my mind to drift off and start to ride those rejuvenating REM and NREM brainwaves.

On the other hand, for the days when I couldn't' sleep, or was going to sleep late, or was in an uncomfortable environment where I wasn't in control of my sleeping space — knowing the exact reasons why I couldn't sleep ("ugh my hands and feet are too hot" ) was almost worse ("agh I looked at a screen too recently") and tended to exacerbate my stress ("shouldn't have had that green tea at 4PM").

But alas, I truly believe that knowledge is powerful and ignorance is not blissful. I chose to read this book because I wanted to know more about sleep and now I do. I took away a lot of interesting information and useful tips. I think this knowledge will help me sleep better, but more importantly I now have random factoids to drop into conversations whenever they turn to sleep.

In particular, I learned that humans naturally have a biphasic circadian rhythm, which is a really cool term to break out at parties. In English, it means we're biologically hardwired to nap once a day. That mid-afternoon lull you feel everyday turns out to be totally natural and not just because your lunch was a big bowl of cheesy gnocchi.

Well, the gnocchi could be part of it honestly (reminder: not a scientist).

Not only is it natural, a short afternoon nap is apparently healthy for you too. Walker points to several studies, some from his own lab, that have illuminated the subtle but measurable ways that a short afternoon nap is beneficial.

Those who were awake throughout the day became progressively worse at learning, even though their ability to concentrate remained stable (determined by separate attention and response time tests). In contrast, those who napped did markedly better, and actually improved in their capacity to memorize facts.

— Walker, pg. 102

From a longitudinal study of Greece and the decline of it's siesta culture over the late 20th century, there was a clear relationship between reduced naps and the risk of heart disease:

However, those that abandoned regular siestas went on to suffer a 37 percent increased risk of death from heart disease across the six-year period, relative to those who maintained regular daytime naps. The effect was especially strong in workingmen, where the ensuing mortality risk of not napping increased by well over 60 percent.

— Walker, pg. 69

Leading to the natural conclusion, in Walker's own words:

From a prescription written long ago in our ancestral genetic code, the practice of natural biphasic sleep, and a healthy diet, appear to be the keys to a long-sustained life.

— Walker, pg. 70

While reading about napping and all its great benefits in Why we Sleep, I was reminded of a headline I'd seen somewhere years ago that said something along the lines of:

BREAKING: New Scientific Study by Scientists Shows Napping Causes Bad Health Things to Happen and You Might Die Sooner, According to Science

That might not be verbatim, but you get the point.

Now I'm not one to believe everything I read on the internet. But I do like to base my entire worldview on a subject according to a single headline I skimmed over once and never looked into further. So needless to say, finding out there was conflicting evidence on the benefits of napping was rather shocking. I was losing sleep over it.

With such uncertainty circling in my head, I felt the need to do more investigation. So I decided to finally put on my scientist hat and do some good old fashioned research on the matter.

Naps: A Literature Review

There have indeed been several studies published over the last couple decades which associated napping with increased risk of hypertension, cardiovascular disease and diabetes, to name a few. According to that last one, the correlation with diabetes risk was "partly explained by adiposity". Adiposity is a really technical term for being fat; and being fat, interestingly enough, was also found to be correlated with regular daytime napping in a separate study! This begs the question — what came first, the couch or the glucose intolerance?

It's hard to say since these long-term epidemiological studies are observational and can only suggest potential causes of a disease based on statistical evidence. The risk of confounding factors makes these studies hard to trust entirely, which is why the conclusions drawn must use wording like:

"increased daytime nap frequency may represent a potential causal risk factor for essential hypertension."

For us normies, these studies lead to clickbait headlines and articles touting the risks of this seemingly benign activity:

Long Naps May Be Bad For Your Health | Forbes Napping regularly linked to high blood pressure and stroke, study finds | CNN You snooze, you lose: why long naps can be bad for your health | The Guardian

With all these morbidities being linked to napping, it's easy to see how one could conclude that naps are indeed bad for them. I haven't found any explanations offered as to why napping might be bad for your health; so my research, along with the entire state of nap science, is apparently stalled for now.

But what does our sleep expert, Dr. Walker, say about all this? Walker makes no mention of any of these studies in Why We Sleep (granted, many of them were published after his book was written). Walker's only words of warning come in the Twelve Tips for Healthy Sleep Appendix:

  1. Don’t take naps after 3 p.m. Naps can help make up for lost sleep, but late afternoon naps can make it harder to fall asleep at night.

— Walker, pg. 325

So, according to Walker, naps are good for you as long as you take them early enough in the day.

Most studies I read as part of my research mentioned nap length as an important factor in these negative health correlations. For example, this study showed those who regularly take short naps (< 30 minutes) were less likely to have high blood pressure, and those who take long naps were more likely to have high blood pressure.

Putting it all together, I've decided my new worldview on naps boils down to the following:

  • Napping early in the afternoon for no more than 30 minutes is:
    • probably fine
    • possibly beneficial for you
  • Napping for longer than 30 minutes regularly is:
    • possibly bad for you
    • more likely a symptom of some other underlying health issue(s) or poor lifestyle habits.

And of course, my final research source was asking ChatGPT, which obviously told me exactly the same thing:

ChatGPT output
dude nobody likes a know-it-all...

Why bother researching the internet when ChatGPT has already read the whole thing? Oh well, now I know for next time.

Reading books is probably outdated by this point too, but it still helps me get to bed at night. Especially a book like Why We Sleep that evangelises the life-changing power of slumber. So until AI can start singing me lullabies*, I'm clinging to my books!

*editor's note: it turns out AI can already do this

Rating

Non-Fiction 

Value

4/7

Interest

5/7

Writing

4/7

🥪️If this book was a sandwich it would be: sliced turkey with mashed potatoes and gravy on buttered white bread

Sunday, June 23 2024

Last Updated: 24-06-2024

If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock.

— Kleppmann, pg. 343

As an engineer with a fair amount of professional experience, I found Designing Data-Intensive Applications by Martin Kleppmann to be extremely illuminating. It's a comprehensive overview of modern data storage and data processing technology, with a little bit of history thrown in.

I was already familiar with many of the concepts and tools in the book but it was interesting to read about them in a broader context. Kleppmann takes time to describe the rationale behind each software framework or tool he introduces, including the problems it solved that existing technologies didn't and what trade-offs it makes. This helped improve my mental model and solidify my overall understanding.

Kleppmann sticks to a particular topic in each chapter — concepts like data encoding, partitioning, or stream processing. As such, in theory this book could be read piece-wise; every chapter stands up on its own like a textbook. But Kleppmann did a good job organizing topics in the book so that reading it end-to-end is worthwhile too. The lessons of one chapter build upon the next in complimentary ways.

Designing Data-Intensive Applications Book Cover

The book is divided into three parts. Part I starts with some fundamental concepts of data systems as they relate to a single machine — how you model, store, and query for data. Then we move beyond a single computer in Part II and a new host of problems are introduced, like how to encode data, replicate it, partition it, and achieving consistency across a distributed system - one of the scarier words in software engineering. In the final section of the book, Kleppmann focuses on heterogeneous systems of data storage. How data is stored and transferred across disparate datastores, indexes, caches, streams and more, and some best practices for building such systems.

All these topics require a familiarity with the topics covered in previous chapters to fully understand the intricacies of the problems being solved. You continuously zoom out to higher levels of abstraction as the book progresses. That's something I really liked about Designing Data-Intensive Applications.

Next I will share 3 lessons that stood out to me after reading the book. Hopefully you will find something useful in my brief summaries.

1. There's no such thing as a schemaless database

A large portion of this book is dedicated to databases, obviously. Chapter 2 covers how to model and store your data in different types of databases. In chapter 3, Kleppmann dives in to how databases are actually implemented, particularly relational / SQL-style databases. But he also introduces several other database classes, generally grouped under the umbrella term "NoSQL". Which just means not SQL. Document-oriented, wide-column, key-value, and graph databases are all alternatives meant for different use cases.

Several of these databases are quite popular, for example MongoDB, a document database that stores unstructured JSON in named collections. I've always considered this type of database to be schemaless since each JSON document in a collection can have completely different fields. But Kleppmann explains that it should be considered schema-on-read instead. This is in contrast to a SQL database, which is schema-on-write because it enforces a predefined schema when you attempt to add or update records.

Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database [20]. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read)

— Kleppmann, pg. 51

I think this is an important distinction to make, and makes sense when you think about it. Stored data has no use if it can't be parsed into some known set of fields, so of course at some point a schema needs to be "applied". Sure, you can add if-exists checks to every field to avoid making any assumptions, but the same thing could be done with SQL by making every column nullable.

Schema-on-read is analogous to runtime type checking in programming languages, whereas schema-on-write is similar to compile-time type checking.

The schema-on-read approach is advantageous if you have little control over the data your system is storing and it may change frequently over time. It's easier to change your schema if it's part of your application code. This is why MongoDB, and more broadly any schema-on-read approach, is generally considered more flexible and less error-prone as your application changes.

2. Make reads do more work to make writes easier

Read-heavy systems are very common in software, especially web applications. I'm personally more accustomed to optimizing read efficiency, using techniques such as database indices. A database index will make writes slower (because they need to update the index) but reads much faster.

But sometimes, the system you're designing needs to handle extremely high write throughput. In this case, we want to shift some of the work back to the read path in order to make the write path faster.

For instance, in chapter 5 Kleppmann covers database replication. There is a style of database called Dynamo, which is a set techniques developed at AWS and implemented in their DynamoDB service. Other popular databases like Cassandra and Voldemort were modelled based on Dynamo. These databases are leaderless and use asynchronous consistency to achieve high write throughput.

One technique that stood out to me is called read repair:

When a client makes a read from several nodes in parallel, it can detect any stale responses. For example, in Figure 5-10, user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica.

— Kleppmann, pg. 199

Because Dynamo-style databases use quorums for consistency between nodes, the read operation also updates any outdated values found when those nodes responsible for a given key are queried. The coordinator node will "repair" those outdated values synchronously before completing the read operation. This is an example of doing more work during reads in order to make writes faster.

Another more general principle for increasing write throughput is by storing data as an immutable log of append-only records. Writing to a single ledger is relatively fast computationally and can be scaled in many ways. This technique is touched on countless times by Kleppmann throughout the book. Change data capture via the binlog, event sourcing, batch and stream processing — these are all examples of systems that write data as an append-only log available for downstream consumers to parse and derive the relevant data needed for their use case. There is more work needed when reading (or by intermediary systems to construct the data in an appropriate form), to the benefit of allowing very high rates of write operations.

3. Time is an illusion

One concept that Kleppmann spends a great deal of ...*ahem*... time on is dealing with time in a distributed system. It turns out to be a tricky business, some may even go so far as to call time an illusion.

Having a notion of time is critical for so many algorithms and mechanisms in software applications, so understanding the edge cases and complexities of it are important. Kleppmann spends a good portion of chapter 8 just discussing the myriad different ways that assumptions about time can be wrong.

Even for a single computer, time is an illusion. There are generally two different clocks available for you to mess up use. The time-of-day clock represents the "real-world" time and is used when you need a timestamp, while the monotonic clock is based on an arbitrary time, but is guaranteed to always increase (at approximately 1 second per second). You use the monotonic clock when you want to measure the duration between two events.

Once multiple machines are involved, simple questions like "what time is it" and "what happened first" become deceptively hard to answer. For instance, a common way of dealing with write conflicts is via the "Last Write Wins" mechanism. When two machines want to modify the same record at the same time, just choose the write that happened later. The problem is, how do you determine which write happened last? If you use the time you received the writes, then you risk violating causality since the write could've been delayed for any number of reasons. If the clients generate timestamps themselves, you suddenly need to deal with differences in their local clocks. If a node has a lagging clock, then all its writes might be overwritten for awhile before it gets noticed.

So, to make sure all the nodes in our system have the right time, we use the Network Time Protocol (NTP) to periodically synchronize all the clocks using a super accurate time source. But, like any network communication, NTP is also susceptible to a number of fault modes. I won't detail them here.

Leap seconds are another good example of time's illusory nature. Leap seconds have crashed entire systems before. Nowadays leap seconds are handled by having the NTP server "lie" via a process called smearing which gradually applies the extra second over an entire day. If you can't trust your NTP server who can you really trust?

I think the complexity of time is emblematic of distributed systems as a whole. You can only reason so much about how a system of computers will behave and every possible way things can go wrong before you start diving into the limits of quantum physics and the notion of truth itself! It can be overwhelming!

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed system, we can state the assumptions we are making about the behavior (the system model) and design the actual system in such a way that it meets those assumptions. Algorithms can be proved to function correctly within a certain system model. This means that reliable behavior is achievable, even if the underlying system model provides very few guarantees.

— Kleppmann, pg. 330

Comforting words Martin.

Last Words

The last chapter of the book is a divergence from the rest. It's still technical, but it's a forward facing look at how data system technology might evolve in the future. Kleppmann shares his personal views on what makes good system design and how to best leverage different tools and paradigms for today's data needs.

In particular, Kleppmann makes the argument that maintaining data synchronicity between systems is best achieved through log-based derived data vs. trying to implement distributed transactions:

In the absence of widespread support for a good distributed transaction protocol, I believe that log-based derived data is the most promising approach for integrating different data systems.

— Kleppmann, pg. 542

In other words, datastores should broadcast changes in a durable and persistent way for other systems to consume at their own pace. This naturally entails dealing with eventual consistency when reading from derived sources, but Kleppmann believes that's a more manageable problem to solve compared to the performance and complexity concerns of coordinating synchronous data updates. From everything I learned by reading this book, I understand why he believes that.

For example, some "classic" data constraints such as "two people can't book the same hotel room" might not need to be so inviolable. Fixing those conflicts asynchronously and issuing a notification to the inconvenienced user (plus a coupon or some other small compensation) is an acceptable trade-off your business could consider.

The book ends with Kleppmann examining, with a broader societal perspective, the consequences of these enormous data processing systems that have been built over the last 20 years. The data involved in the majority of these systems is data about us, the users. Privacy and data consent concerns are paramount questions to wrangle with as these systems get better and more accurate. I've previously written about the current state of consumer privacy in my review of The Age of Surveillance Capitalism, so I won't go into more detail here.

Kleppmann also talks about predictive analytics and the world of data science. These days, a machine learning model is almost always going to be one of the consumers of a data-intensive application. Machine learning models usually provide automated decision making or predictions. Kleppmann ponders how accountability will be assessed in this world. For instance, he shares his thoughts on bias within data:

it seems ridiculous to believe that an algorithm could somehow take biased data as input and produce fair and impartial output from it. Yet this belief often seems to be implied by proponents of data-driven decision making, an attitude that has been satirized as “machine learning is like money laundering for bias”.

— Kleppmann, pg. 590

Part of our role as engineers is to have "moral imagination", as Kleppmann puts it - and a desire for the systems we build to improve upon the past. These are all novel issues we are encountering in the information age and they have broad societal implications. Engineers have a big role to play in helping to improve the technology and algorithms underpinning the software that runs our lives.

Rating

Non-Fiction 

Value

6/7

Interest

3/7

Writing

5/7

🥪️If this book was a sandwich it would be: egg salad on a brioche bun with alfalfa sprouts and spicy dijon mustard

More