Wednesday, December 11 2024

I mostly stay away from individual stock investing. It's too risky, and I don't pretend to know better than the rest of the financial market. My "dumb money" mostly goes into broad market ETFs with low management fees. However, sometimes you just want to try your luck at the casino.

That's how I felt last November when I had a lump sum of money I needed to invest. Instead of putting all of it into a low-risk ETF, I decided to do a bit of research and choose some other investment options. Now, I'll admit I didn't pull up any of these company's financial records. I didn't do fundamental analysis. But I read some things—not just /r/wallstreetbets—and chose some stocks. So my choices were based on some data, but mostly vibes.

It's been exactly 12 months since I invested this money. I wanted to look back and see how these investment choices panned out. Of course, 1 year isn't super long in the grand scheme of market cycles, but it's a useful exercise nonetheless.

This also gives me a chance to do some data visualization, which I haven't done much before. I like Kotlin, so I decided to try out Kotlin Notebooks combined with Kandy, a graphing library natively supported by Kotlin Notebooks and Jetbrains IDEs. The complete code is on my GitHub if you're interested.

The actual dollar amount isn't relevant for this discussion, so let's just refer to it as D here.

Here's the breakdown of how I invested the D dollars.

val relativeAmounts = mapOf(  
    // disney
    DIS to 5.17,
    // mcdonalds 
    MCD to 6.24,
    // intermediate-term treasury bonds ETF
    VGIT to 19.52,
    // solar technology ETF
    TAN to 25.78,
    // Nasdaq-100 tracking ETF
    QQQ to 100 - 5.17 - 6.24 - 19.52 - 25.78 // about 43%  
)

Let's visualize this breakdown:

So, I still put most of the money in ETFs, but only QQQ has broad market exposure. And VGIT is a bond ETF, which is generally considered a hedge against market downturns.

It's pretty straightforward to get historical price data for all these securities from the [Nasdaq website](nasdaq url). I downloaded the last 5 years of data for each of them, imported them into my Notebook, sanitized the data a bit, and then plotted the last 1.5 years on a graph:

// take only last 1.5 year (252 trading days in a year)  
val NUM_DAYS: Int = 252 + 126  
  
var dis = DataFrame.read("historical-price-data/disney.csv").formatStonks()  
var mcd = DataFrame.read("historical-price-data/mcd.csv").formatStonks()  
var tan = DataFrame.read("historical-price-data/tan.csv").formatStonks()  
var vgit = DataFrame.read("historical-price-data/vgit.csv").formatStonks()  
var qqq = DataFrame.read("historical-price-data/qqq.csv").formatStonks()  
  
fun AnyFrame.formatStonks() = this  
    .remove("Volume", "Open", "High", "Low")  
    .take(NUM_DAYS)  
    .add("DateFmt") {   
        SimpleDateFormat("MM/dd/yyyy").parse(it["Date"] as String)  
    }  
    .add("Price") {  
        val price = it["Close/Last"]  
        // Some of the prices have a preceeding '$', some don't...
        if (price is String) {  
            price.substring(1).toFloat()  
        } else {  
            price  
        }  
    }  
    .sortBy("DateFmt") // sort ascending order by date  
    .remove("Close/Last", "DateFmt")

The result is shown below, with a dashed line indicating the 11/30/2023 purchase date:

This gives a basic indication of how things went . QQQ did well. TAN did not. But a better comparison is, of course, the relative change instead of the absolute price of the stock. In other words, I want to visualize my return on each investment instead.

// November 30th, 2023 is the 126th day in the data  
val NOV_30 = 126  
fun List<Number>.getRelativeReturnAsPercent(): List<Float> = this.map { (it.toFloat() - this[NOV_30].toFloat()) / this[NOV_30].toFloat() * 100 }  
  
val disReturn = disPrice.getRelativeReturnAsPercent()  
val mcdReturn = mcdPrice.getRelativeReturnAsPercent()  
val tanReturn = tanPrice.getRelativeReturnAsPercent()  
val vgitReturn = vgitPrice.getRelativeReturnAsPercent()  
val qqqReturn = qqqPrice.getRelativeReturnAsPercent()  
  
val allReturns = dataFrameOf(  
    "Date" to datesX5,  
    "Return (%)" to disReturn + mcdReturn + tanReturn + vgitReturn + qqqReturn,  
    "Symbol" to symbols  
)  
  
allReturns.groupBy("Symbol")  
    .plot {  
        line {  
            x("Date")  
            y("Return (%)")  
            color("Symbol")  
        }  
        vLine {  
            xIntercept(listOf("11/30/2023"))  
            type = LineType.DASHED  
        }  
    }
    

Side-note: I found out halfway through this exercise that the graphing library, Kandy, is very much in development still. Current version is 0.7 and the developer documentation is incomplete. Plotting multiple series of data on the same graph is not supported well. You have to do some hacky list concatenation which was really lame. Don't think I'll use Kandy again until they've improved this.

This gives us a consistent comparison of each stock's performance since the buy date:

Here we see QQQ is up 31% right now, while TAN is down 20%. Everything else is somewhere in the middle.

I want to analyze how my investment choices fared—not just what I bought but also how much I bought, relative to the total D dollars I had. This is called the portfolio return. The formula to calculate it is straightforward since I'm only considering a single purchase date. You just multiply each investment's return by the relative amount invested and then sum them up. It's "weighting" the return by the percentage of your portfolio exposed to it, essentially.

val disWReturn = disReturn.map { it * relativeAmounts[DIS]!! / 100 }  
val mcdWReturn = mcdReturn.map { it * relativeAmounts[MCD]!! / 100 }  
val tanWReturn = tanReturn.map { it * relativeAmounts[TAN]!! / 100 }  
val vgitWReturn = vgitReturn.map { it * relativeAmounts[VGIT]!! / 100 }  
val qqqWReturn = qqqReturn.map { it * relativeAmounts[QQQ]!! / 100 }  
  
// grab one of the lists' indices to get an array of day ordinals  
val days = disWReturn.indices  
  
val all = listOf(disWReturn, mcdWReturn, tanWReturn, vgitWReturn, qqqWReturn)  
val portfolioWReturn = days  
    .map { index -> all  
        .map { it[index] }  
        .sum()   
    }

I plotted the overall portfolio return alongside each investment's portfolio weighted return:

This shows me how most of the portfolio's gains were due to QQQ, which is not surprising. It also shows how DIS and MCD barely affected my overall portfolio since I only invested about 5% of D in each.

Let's also plot the portfolio return alongside the absolute returns of each stock. I think this is more illuminating:

Here's where we really see how my investment choices balanced each other out in the aggregate. Fortunately, I still have a positive return after a year, mostly because the Nasdaq-100 (mostly tech) had a great year. But investing a quarter of my money in solar didn't pan out so well. It's not always sunny on Wall St.

Finally, just to drive the stake into my wallet further, here's my portfolio return plotted against just QQQ:

Overall, a difference of 31.1 - 10.4 = 20.7 percentage points missed out on because of my decision to cosplay as a day trader for fun.

To put that in perspective, if I had D = $10,000 dollars to invest last year, I would've had an additional $2070 in my pocket today. That's like, half a Taylor Swift ticket.

Oh well, lesson learned: Don't use this unfinished graphing library for data viz anymore. And probably stick to ETFs.

Saturday, November 30 2024

After all, people regularly use www.google.com to check if their Internet connection is set up correctly.

— JC van Winkel, Site Reliability Engineering pg. 25

Site Reliability Engineering is a collection of essays written by senior engineers at Google describing how they run their production systems. It's mostly within the context of the Site Reliability Engineering (SRE) organization, which is an actual job title there. However, I found the subject matter to be quite wide-ranging, everything from people management to distributed consensus algorithms. It didn't focus strictly on the SRE discipline, which partly explains why it's 500 pages long.

The whole book is actually available for free online if you're interested in reading it. Or just parts of it, since each chapter is a separate topic and there's not much overlap between them.

In essence, the SRE organization is a specialized discipline within Google meant to promote and maintain system-wide reliability for their services and infrastructure.

book cover

Reliability is such a multi-faceted objective that the expertise and responsibilities required are wide-ranging. The end goal seems simple to explain: Ensure systems are operating as intended. But reaching that goal requires a combination of technical, operational, and organizational objectives. As a result, this book touches on basically every topic of interest for a software company.

I spent a couple years working in a Nuclear Power Plant, so I've seen what peak reliability and safety culture looks like. The consequences of errors there are so much higher compared to most other companies, including Google. So it's not a surprise that reliability and safety are the paramount objectives and they take priority over everything else.

This safety culture permeated everything we did within the plant, and around it. There were lines painted on each row of the parking lot to indicate the path to the entrance. If you didn't follow them, you would get coached and written up by someone. It was intense. And don't even think about not holding the railing while taking the stairs either...

Any changes you want to make to a system within the plant needs extensive documentation, review, and planning before being approved. Thus, the turnaround on any change takes months, if not longer.

Contrast that with software companies like Google where 1000s of changes are made on a daily basis. The consequences of a mistake can still be serious, depending on the application. But instead of aiming for no errors, errors are managed like a budget, and the rate at which this budget is spent determines how much change can be made in a given period of time:

In effect, the product development team becomes self-policing. They know the budget and can manage their own risk. (Of course, this outcome relies on an SRE team having the authority to actually stop launches if the SLO is broken).

— Marc Alvidrez, pg. 51

Learning about Google's software development process was interesting. In the first few chapters, there was a lot of useful information on measuring risk, monitoring, alerting, and eliminating toil. These were some of the more insightful chapters in my opinion.

But there were also a few...less insightful chapters. Chapter 17 was about testing code and it was really just stating obvious things about writing tests; it wasn't specific to SRE at all. Then there was a lot of time spent on organizational stuff, like postmortem culture and how to have effective meetings. So much of the writing came off as anecdotal and rather useless advice that the author tried to generalize (or just make up) from past experiences.

So there were good and bad parts of the book. I wouldn't recommend reading it cover to cover like I did. It'd be better to just read a chapter on a topic that's relevant for you.

For instance, I found the section on load balancing to be really informative. Below is a summary of how Google does load balancing.

Balancing Act

Chapters 19 and 20 are about how Google handles their global traffic ingress. Google, by operating one of the largest distributed software systems in the world, definitely knows a thing or two about traffic load balancing. Or to put it in their words:

Google’s production environment is—by some measures—one of the most complex machines humanity has ever built.

— Dave Helstroom, pg. 216

Melodrama aside, I appreciated the clear and concise breakdown of their networking and traffic management in these chapters.

Load balancing needs to consider multiple measures of quality. Latency, throughput, and reliability are all important and are prioritized differently based on the type of request.

1. DNS

Chapter 19 is about load balancing across datacenters. Google runs globally replicated systems, so figuring out which datacenter to send a particular request to is the first step in traffic management. The main mechanism for configuring this is via DNS — a.k.a the phone book of the internet.

The goals of this routing layer are twofold: - Balance traffic across servers and deployment regions fairly - Provide optimal latency for users

DNS responses can include multiple IP addresses for a single domain name, which is standard practice. This provides a rudimentary way of distributing traffic, as well as increasing service availability for clients. Most clients (i.e. browsers) will automatically retry requests to different records in the DNS response until they successfully connect to something. The downside is that the service provider, Google, has little control over which IP address actually gets chosen in the DNS response. So it can't be solely relied on to distribute traffic.

The second goal of DNS is to provide optimal latency to users, which means trying to route their requests to the geographically closest server available to them. This is accomplished by having different DNS name-servers set up in each region Google operates in, and then using anycast routing to ensure the client connects to the closest one. The DNS server can then serve a response tailored to that region.

This sounds great in theory, but in practice DNS resolution is more hairy and there's lots of issues specifically around the caching introduced by intermediary name-servers. I won't go into those details here.

Despite all of these problems, DNS is still the simplest and most effective way to balance load before the user’s connection even starts. On the other hand, it should be clear that load balancing with DNS on its own is not sufficient.

— Piotr Lewandowski, pg 240

2. Reverse Proxy

The second layer of load balancing happens at the "front door" to the datacenter—using a Network Load Balancer (NLB), also known as a reverse proxy. These handle all incoming requests by broadcasting a Virtual IP (VIP) address. Then it can proxy incoming requests to any number of actual application servers. In order retain the originating client details after proxying a request, Google uses Generic Routing Encapsulation (GRE) which wraps the entire IP packet in another IP packet.

There's some complexity here, of course, in terms of the actual routing algorithm used by the NLB. Supporting stateful protocols like WebSockets requires the NLB to keep track of connections and forward all requests to the same backend for a given client session.

Once the request has reached an application server, there will likely be a multitude of internal requests initiated in order to serve the request.

In order to produce the response payloads, these applications often use these same algorithms in turn, to communicate with the infrastructure or complementary services they depend on. Sometimes the stack of dependencies can get relatively deep, where a single incoming HTTP request can trigger a long transitive chain of dependent requests to several systems, potentially with high fan-out at various points.

— Alejandro Forero Cuervo, pg. 243

And besides that, there's plenty of requests and computational work that aren't originated by end-users. Cronjobs, batch processes, queue workers, internal tooling, machine learning pipelines, and more are all different forms of load that must be balanced within a network. That's what Chapter 20 covers.

3. Connection Pool Subsets

The goal of internal load balancing is mostly the same as for external requests. Latency is still important, but the main focus is on optimizing compute and distributing work as efficiently as possible. Since there's only so much actual CPU capacity available, it's vital to ensure load is distributed as evenly as possible to prevent bottlenecks or the system falling over due to a single overloaded service.

Within Google, SRE has established a distinction between "backend tasks" and "client tasks" in their system architecture:

We call these processes backend tasks (or just backends). Other tasks, known as client tasks, hold connections to the backend tasks. For each incoming query, a client task must decide which backend task should handle the query.

— Cuervo, pg. 243

Each backend task or service can be composed of 100s or 1000s of processes in a single machine. Ideally, all backend tasks operate at the same capacity and the total wasted CPU is minimized.

The client tasks will hold persistent connections to the backend tasks in a local connection pool. Due to the scale of these services, it would be inefficient for every single client to hold a connection to every single backend task, because connections cost memory and CPU to maintain.

So Google's job is to optimize an overlapping subset problem—which subset of backend tasks should each client connect to in order to evenly spread out work.

Using random subsetting didn't work. The graph below shows the worst backend is only 63% utilized and the most is 121% utilized.

SRE CPU distribution with random subsetting

Instead, Google uses deterministic subsetting which perfectly balances the connections between clients. It's an algorithm that shuffles and assigns backends to each subset evenly. Again, I won't go into detail about it.

4. Weighted Routing

Once the pool of connections has been established for each client task, the final step is to build an effective load balancing policy for these backends.

Using a simple round robin algorithm didn't work, as evidenced from historical operational data. The main reason is because different clients will issue requests to the same backends at vastly different rates, since they could be serving completely different downstream applications. There may also be variation in the cost of different queries, backend machine diversity, and unpredictable factors like antagonistic neighbours.

Instead, Google uses weighted round robin which keeps track of each backend's current load and distributes work based on that. First they built it based on active requests to each backend, but this also doesn't tell the whole story of how healthy a particular backend is. So instead, each backend sends load information to the client in every response. It includes active request count, CPU, and memory utilization. The client uses this data to distribute the flow of work optimally.

Here's a diagram I made to visualize all that:

Load Balancing diagram

Conclusions

Site Reliability Engineering offers many insights shared by senior engineers from one of the world's leading software companies. I particularly enjoyed the sections on alerting, load balancing, and distributed computing. But there were some chapters I found boring and without much useful, actionable advice.

Google has been a leader and innovator in tech for many years. They're known for building internal tools for basically every part of the production software stack and development life cycle. A lot of these tools have been re-released as open source libraries, or even new companies started by ex-Googlers.

For instance, Google has been running containerized applications for over 20 years. As the scale of running services and jobs this way expanded, the manual orchestration and automation scripts used to administer these applications became unwieldy. Thus, around 2004 Google built Borg — a cluster operating system which abstracted these jobs away from physical machines and allowed for remote cluster management via an API. And then 10 years later, Google announced Kubernetes, the open source successor to Borg. Today, Kubernetes is the de-facto standard for container orchestration in the software industry.

All this to say—Google has encountered many unique problems over the years due to its sheer complexity and unprecedented scale; it's forced the company to develop novel solutions. As such, it's helpful looking to them as a benchmark for the entire software industry. Understanding how they maintain their software systems is helpful for anyone looking to improve their own.

Rating

Non-Fiction 

Value

4/7

Interest

3/7

Writing

3/7

🥪️If this book was a sandwich it would be: california burrito with extra avocado

Sunday, October 6 2024

It's time to update my website.

Over the last couple years I took a bit of a hiatus from posting new content, but this year I've rediscovered the motivation for writing. I'm also interested in writing about more subjects, not just book reviews. In particular, I'm going to start writing about software engineering and coding more, which will require some changes to the formatting of posts.

Because of these new requirements, I've decided to rebuild my writing "stack" and the platform that powers my blog.

The main things I don't like about my current writing flow and site:

  • The site's UI is outdated
    • While I'm proud of the handcrafted *artisanal* HTML I wrote for the original site, I want to redesign it to match my current tastes
  • There's too much manual HTML editing required for new posts
    • I write posts in markdown and then convert them to HTML programmatically. But then I usually have to modify the HTML output to finalize the formatting.
  • My deployment infrastructure is not cloud optimized or properly de-coupled.
    • Sure, hosting your Spring Boot App, MongoDB server, media assets and Jenkins server on a single EC2 instance is possible. Is it a good idea? No.

So there's nothing horribly wrong about the current site. It works. Which isn't surprising given it's a static blog that changes infrequently. But I think rebuilding the site will help invigorate my writing and improve my efficiency for generating new content. Secondly, any project is an opportunity to learn so I'm excite to work with some stuff I don't use often and try out some new technology.

My goals for this new site (code named flow2) are the following:

  • Refresh the UI
  • Containerize and use better cloud tooling for the infrastructure
  • Markdown files as the single source of truth for content. No manual HTML editing required
  • Streamline the entire process between writing and posting
  • Overclock the Lighthouse scores as much as possible for fun
  • Try out Ktor — an async web framework built in Kotlin with coroutines

And that's it! I'll probably get a new domain name too.

Looking forward to building. You can check out my progess on GitHub if you like.

More