Matt Chung's Site

Author: mattchung

COVID-19 stunting Elliott’s social skills

Like almost every other parent, my wife and I are doing our best to shelter our 16-month year old daughter, Elliott, in the midst of the COVID-19 pandemic, us parents trying to fabricate a bubble with some sense of normalcy. Up until recently, I tricked myself into believing that we could mask (or minimize) the impact of social distancing on Elliott. Unfortunately, I no longer hold on to that belief.

On Monday afternoons, Jess attends a (remote) hour long appointment and during that time, I break away from my office and watch over Elliott. This past Monday, Elliott and I walked — well, I mostly carried her — to the neighborhood park located right around the corner. And when we arrived, in the distance (about 50 feet) were two young girls (around six and two years old) and their nanny, the three of them sitting cross-legged in a circle on the ground, feasting on their homemade picnic.

Elliott waved her little hand at them and caught the attention of the one of the little girls, who raced over and introduced herself, abruptly stopping just about six feet away from us.

“I’ll keep my distance, because of the coronavirus.”

After shooting the shit with this 6 year old for a couple minutes, we parted ways and Elliott and I continued walking towards the swing area. I noticed that Elliott was still gazing at the two little girls sitting in the distance. When I planted Elliott down on her own two feet, she spun around and faced the direction of the girls, turned her heads towards me, then stretched her arm our towards them, pointing her index finger in their direction, signaling to me she wanted to go play. But I had to explain to her that she couldn’t and that we needed to keep our distance.

Elliot started bawling. Non stop.

I felt so sad for her.

I squatted down to her eye level, trying (as best as I could) to gently explain to her that she couldn’t go play with them. But it was no use. Nothing I said comforted Elliott.

God damn this pandemic.

I hope and pray that this pandemic ends soon and that we can return to our “new normal”, a normal that allows children to run around with each other, play tag, hug one another, without them fearing, or their parents fearing, for their lives.

February 11, 2021
PAXOS (not) made simple
Leslie Lamport, a world renounced computer scientist, first published the paper “Part-time parliament” back in 1990. Unfortunately, that paper well not well received, primarily due to the fact that the consensus protocol was described in the context of obscure ancient Paxos civilization. The paper was not was received among computer scientists and Leslie Lamport followed up by publishing another version of paper titled “Paxos made simple.”

Over the last couple days, I’ve been reading and carefully studying the paper, making (so far) three passes. From the title, you would think that the paper is written such that it could be easily understood by us computer scientists. It’s supposed to be simple, right?

Nope.

In fact, I’m finding it incredibly difficult to conceptualize the consensus protocol, unable to succinctly describe the protocol in my own words. The paper’s title misleads its readers!

Frustrated with my own inability to understand the paper, I set it aside and decided to pick up another (optional) reading assignment prescribed by my distributed systems course, the paper titled “In Search of an Understandable Consensus Algorithm.” After reading the first couple paragraphs, I feel a sense of relief because on the second page, the authors share both my frustration and confusion about PAXOS:

“The first drawback is that Paxos is exceptionally difficult to understand. The full explanation [15] is notoriously opaque; few people succeed in understanding it, and only with great effort. As a result, there have been several attempts to explain Paxos in simpler terms [16, 20, 21]. These explanations focus on the single-decree subset, yet they are still challenging. In an informal survey of attendees at NSDI 2012, we found few people who were comfortable with Paxos, even among seasoned researchers. We struggled with Paxos ourselves; we were not able to understand the complete protocol until after reading several simplified explanations and designing our own alternative protocol, a process that took almost a year”

After realizing that I’m not alone in failing to grok PAXOS, I started searching the internet for other resources that better describe PAXOS. I found a couple great write-ups as well as the “defacto” lecture describing PAXOS:
So, sorry “PAXOS made simple”, it’s not me, it’s you.
February 11, 2021
The FLP theorem: impossibility of achieving consensus within distributed systems

For this week, my distributed systems course just assigned us students a reading assignment: “Impossibility of distributed consensus with one faulty process“.

Apparently, this paper is a seminal piece of work that goes on to describe and prove that, given a single process failure within a distributed system, the underlying system cannot achieve consensus (i.e. the nodes cannot reach a state where they all agree on some proposed value). That’s right: not difficult, but impossible. Assuming that this theorem holds true, how the hell do we even build robust distributed systems that can tolerate failure? Surely there must be a way. Otherwise, how the hell do popular cloud services like Amazon S3 and Amazon DynamoDB guarantee high availability and consistency.

My gut tells me that the authors of the paper — Fischer, Lynch, Patterson — make strict assumptions about the underlying system. And if we can somehow relax these assumptions, then we can in fact build highly available distributed systems. But, I’ll find out soon since I’ll be reading both this paper as well as another seminal piece of work “Paxos made simple” by Leslie Lamport.

I’ll follow up on a separate post once I’ve read through the paper three times.

February 8, 2021
YouTube Review: “My Guitar Teacher TOMO FUJITA Gives Words of Wisdom”

I recently watched a YouTube video titled “My Guitar Teacher TOMO FUJITA Gives Words of Wisdom”. In this video (below), YouTuber Mary Spender interviews Tomo Fujita, a guitar instructor who taught at Berkelee school of music for over 20 years; he takes his years of accumulated knowledge and shares some words of wisdom. From this interview, I took away three lessons: practice music that you are drawn to , identify 5 inspirational guitarists, and record and analyze yourself playing guitar.

Practice music that you are drawn to

Here’s the gist: stop trying to be someone you are not.

Many of us aspiring guitarists incorrectly believe that in order to be a “good” guitarist, we need to be able to play either classical, blues, or jazz — even if that’s the type of music we don’t listen to.

Tomo shared an example of one of his students who declared “I want to get good at Jazz” and his student proceeded to enumerate all the ways in which he would become a Jazz player, declaring that he’ll begin with learning all the Jazz music theory. Tomo Fujita responded, “What sort of Jazz do you listen to?”.

Radio silence.

The student admitted they don’t listen to any Jazz and Tomo shared his philosophy: that unless you practice the music that you love, you’ll never get good at it. At best, you’ll nail the mechanics. But ultimately, the playing with feel empty. He’s right.

So stop practicing music you don’t enjoy — right now — and just be your authentic self. Enjoy pop music? Play and write pop songs! You like getting down to reggae ? Drill Bob Marley songs! Also, don’t feel like you need to restrict yourself to one genre either. But once you’ve identified the types of music you want to pursue playing, then you should identify 5 guitarists who inspire you.

Identify top 5 guitarists who inspire you

Another useful exercise he mentioned was name 5 guitarists that inspired you. Not six. Not 10.

Then, once you identified the top 5 artists, research them. Learn their history and how they became the artist they are today. Identify their inspirations, recursively tracing inspirations to the root. This exercise helps inform and influence your own style.

Record yourself for feedback

If you never record yourself — using audio, video, or both — then it’s difficult to improve as a guitarist. You miss out on feedback, miss out on opportunities to discover your own habits and area sof improvement.

I recently started recording myself and publishing the videos to YouTube. I upload videos not to generate millions of views, but for multiple reasons. First, I’m documenting my progress, something I wished I did when I first picked up the guitar a couple years ago. Second, through recording and more importantly, watching and hearing myself allows me to analyze my playing, allowing to identify skills that need to be brushed up on. Finally, recording is a forcing function, pushing me to generate a large volume of work, the only real way, I think, to improve music craftsmanship.

January 30, 2021
Week in Review: 2021/01/17 – 2021/01/24

Not too much to report this week. Not because nothing happened, but because I wasn’t at diligent in capturing this week’s activities; I was on-call this week and carrying the pager almost always disrupts my flow, this week being no exception. My pager alarmed me out of bed several times (e.g. 12:30 AM, 2:30 AM, 4:50 AM), throwing off my rhythm, confusing my body’s circadian rhythm and making me wake up at unusual times. In addition to the thrown off sleep schedule, on call constantly interrupts my trains of thoughts and whatever I happen to be doing at the time gets dropped. This results in me forgetting to write my days down.

Family

See you next year Christmas Lights

I tore down our Christmas lights.

Our home owners association requires we tear down all decorations within two weeks after the New Years. Too bad. I wish the decorations could be left up a little longer. Our house would look warmer. Plus our neighbor’s decorations were something Elliott looked forward to walking pass every day. She’ll just have to wait next year to see her “No No” — her “snowman”.

Insane child brain development

This past week, Elliott’s motor and linguistic skills are exponentially growing.

She basically mirrors everything we say. For this reason, we need to be even more careful since both Jess and I have tendencies to swear like a sailor. Last thing I want is a 16 month year old walking around dropping F-bombs.

Apologies to the wife

On Thursday evening, I apologized to Jess after snapping at her.

My short fuse, I think, has to do with a combination of lack of sleep (due to being waken up in the middle of night due to being on call) and the frustration I often feel from her constantly instructing me how to do when it comes to Elliott, what I consider trivial things: I know how to do place a bib on my daughter. The micromanagement can sometimes make me feel incompetent as a father and I think that’s the root of my frustration.

Home maintenance

My dad and I rolled up our sleeves and repaired the broken down dryer. About three weeks ago, the unit stopped working, the drum no longer spinning.

After surfing the internet forums and watching about a dozen tutorials on YouTube, I ordered new parts — rubber drum ring, idler pulley — and after disassembling the entire dryer and replacing the parts, the repaired dryer not only works, but runs exponentially quieter. Before fixing it, you’d press the start button and the dryer would just scream! Not any more; no sir. The dryer now sings. In the end, repairing the dryer aligns with my philosophy of taking care of the things we own, instead of just chucking things out the window and buying new ones.

Graduate School

Graduate school is going really well this semester. I’m learning a lot and the material piques my interest much than I had anticipated.

This past week, I read both the required and optional readings; these readings, combined with watching the lectures, shifts my perspective on how I approach analyzing and building distributed systems. Normally, when designing systems, I would employ physical clocks for tracking time, using technologies like network time protocol (NTP). But I’ve learned a new way to track time: logical clocks.

They are essentially counters that monotonically tick with each process event and we can implement them using different techniques: Lamport’s scalar clocks, vector clocks, matrix clocks.

Music

Fingers on my left hand re-blistered. That’s because I’m practicing and playing guitar a lot more than usual. When I have downtime, even for a minute or two, I grab my guitar hanging off the wall. Sometimes I’ll run scales; sometimes I’ll advance my fret board knowledge by memorizing the position of notes; other times I’ll record myself practicing a guitar cover (most recently “Jolene” by Dolly Parton and Blackbird by the Beetles).

January 25, 2021
Why is Lamport’s Scalar Clock only consistent, not strongly consistent?

For the last couple days, I’ve been watching the distributed systems video lectures and reading the recommended papers that cover logical clocks. Even after making multiple passes on the material, the concepts just haven’t clicked: I cannot wrap my mind around why Lamport’s clocks satisfy only consistency — not strong consistency. But now I think I have a handle on it after sketching a few of my own process diagrams (below).

Before jumping into the examples, let’s first define strong consistency. We can categorize a system as strongly consistent if and only if, we can compare any two related event’s timestamps — and only the timestamps — and conclude that the two events are causally related. In other words, can we say “event A caused event B” just by inspecting their timestamps. In short, the timestamps must carry enough information to determine causality.

Example: Time Diagram of Distributed Execution

In my example below, I’ve limited the diagram to two process: P1 and P2.

P1 executes three of its own events (i.e. event 1, event 2, event 3) and receives an event (i.e. event 4) from P2. P2 executes 2 events (i.e. event 1, event 2), sends a message (i.e. event 3) to P1, then P4 executes a fourth and final event. To show the sending of a message, I’ve drawn an error connecting the two events, (P2, 3) to (P1, 4).

Logical clocks with an arrow from P2 to P1, showing causality between events (P2, 3) and (P1, 4)

Cool. This diagram makes sense. It’s obvious from the arrow that (P2, 3) caused (P1, 4). But for the purposes of understanding why Lamport’s clocks are not strongly consistently, let’s remove the arrow from the diagram and see if we can reach the same conclusion of causality by inspecting only the timestamps.

Removing arrow that connects (P2, 3) to (P1, 4)

Next, let’s turn our attention to the timestamps themselves.

Can we in any way say that (P2, 3) caused (P4, 4)?

Just by comparing only the timestamps, what can we determine?

We can say that the event with the lower timestamp happened before the event with the higher timestamp. But that’s it: we cannot make any guarantees that event 3 caused event 4. So the timestamps in themselves do not carry sufficient information to determine causality.

If Lamport’s scalar clocks do not determine causality, what can we use instead? Vector clocks! I’ll cover these types of clocks in a separate, future blog post.

January 22, 2021
8 fallacies of distributed computing

Rotem-Gal-Oz, A. (2005). Fallacies of Distributed Computing Explained.

Cognitive biases (built-in patterns of thinking) and fallacies (errors in thoughts) creep into our every day lives, sometimes with us not even knowing it. For example, ever wonder why you work just a little harder, a little quicker, when you think someone is standing over your shoulder, watching you? This is known as the Hawthorne effect: people tend to change their behaviors when they know (or think) they are being watched.

Or how about situations when we are trying to solve a problem but reach for the same tool (i.e. using a hammer)? That might be the exposure effect, people preferring the same tools, processes — just because they are familiar.

Essentially, cognitive biases and fallacies trick us: they throw off our memory, perception, and rationality. So we must surface these blind spots to the forefront of our brains when designing and building distributed systems.

8 Fallacies

According to Schneider, there are 8 fallacies that bite distributed system designers. These fallacies stood the test of time. They were drafted over 25 years ago; designers have been building distributed systems for over 55 years. Despite advances in technology, these fallacies still remain true today.

1. The network is reliable
2. Latency is zero
3. Bandwidth is infinite
4. The network is secure
5. Topology doesn’t change
6. There is one administrator
7. Transport cost is zero
8. The network is homogeneous

Notice anything in common among these fallacies? They all tend to revolve around assumptions we make about the underlying communication, the network.

Network assumptions

I wouldn’t go as far and networks are fragile, but failures (e.g. hardware failure, power outages, faulty cable) happen on a daily basis. Sometimes, these failures are not observed by our users, their traffic being automatically routing towards an alternative — albeit sometimes a little slower — path; their dropped messages being resent thanks to the underlying (transport) communication protocol.

So, what does this mean for you, the system designer?

To overcome network hiccups, we should either “use a communication medium that supplies full reliable message” or employ the following messaging techniques: retry, acknowledge important messages, identify/ignore duplicates, reorder messages (or do not depend on message order), and very message integrity.

In addition to hardening our software, be prepared to thoroughly stress test your system. You’ll want to hamper network performance: drop messages, increase latency, congest the pipes. How well can your system tolerate these conditions? Can your system sustain these performance degradations? Or does your system fail in unexpected ways?

Main takeaway

In short, we should assume that the failure is unreliable and in response we should both employ techniques (e.g. retry messages) that harden our system as well as thoroughly test our systems/software under non-ideal conditions including device failures, congestion, increased latency.

References

1. Hunt Pragmatic Thinking and Learning: Refactor your wetware
2. https://fallacyinlogic.com/fallacy-vs-bias/

January 17, 2021
What are good models and what models are good?
Schneider, F. B. (1993). What good are models and what models are good. Distributed Systems, 2, 17–26.

Paper Summary

In his seminal paper on models (as they apply to distributed systems), Schnedier describes the two conventional ways — experimental observation; modeling and analysis — we normally develop an intuition when studying a new domain. And for the remainder of the paper, he focuses on modeling and analysis, directing our attention towards two main attributes of distributed systems: process execution speeds; and message deliver delays. With these two attributes at the forefront of our minds we can build more fault-tolerant distributed systems.

Main Takeaways

The biggest “ah hah” moment for me was while reading “fault tolerance and distributed systems”. In this section, Schneider suggest that system architects should, from day one, focus their attention on fault-tolerance system architectures and fault-tolerant protocols, with an eye towards both feasibility and cost. To build fault tolerant systems, we should physically separate and isolate process, connect process by a communication network: doing so ensures component fails independently. That’s some seriously good intuition; I hope to incorporate those words of wisdom as I continue to learn about distributed and as I design my own distributed systems.

Great Quotes

In a distributed system, the physical separation and isolation of processors linked by a communications network ensures that components fail independently.”

One way to regard a model is as an interface definition—a set of assumptions that programmers can make about the behavior of system components. Programs are written to work correctly assuming the actual system behaves as prescribed bythe model. And, when system behavior is not consistent with the model, then no guarantees can be made

Notes

Defining Intuition

Distributed systems are hard to design and understand because we lack intuition for them

When unfamiliar with a domain, we can develop an intuition for it in one of two ways:
- Experimental observation – build prototype and observe behavior. Although the behavior in various settings might not be fully understood, we can use the experience to build similar systems within similar settings
- Modeling and analysis – we formulate a model and then analyze model using either mathematics or logic. If this model reflects reality, we now have a powerful tool (a model should also be both accurate and tractable)
Good Models

What is a model?

Schneider’s definition of a model is as follows: collection of attributes and a set of rules that govern how these attributes interact. Perhaps as important (if not important) is that there’s no single correct model for an object: there can be many. Choosing the right model depends on the questions we ask. But regardless of which specific model we choose, we should should choose a “good model”, one that is both accurate and tractable. Accurate in the sense it yields truth about the object; tractable in the sense that an analysis is possible.

What questions should we ask ourselves as we build distributed systems?

While modeling distribution systems, we should constantly ask ourselves two questions: what’s the feasibility (i.e. what types of problems does this solve); and cost (i.e. how expensive can the system be). The reason we ask ourselves this question is three-fold. First, we don’t want to try to solve an unsolvable problem “lurking beneath a system’s requirements”, wasting our time with design, implementation, and testing. Second, by taking the cost into consideration up front, we can skip choosing protocols that are too slow or too expensive. Finally, with these two attributes in mind, we can evaluate our solution, providing a “yardstick with which we can evaluate any solution that we devise.”

Why is studying algorithms and computational complexity insufficient for understanding distributed systems?

Although studying algorithms and computational complexity address two above questions (feasibility and cost), it ignores the subtle nuances of distributed systems. Algorithms and computational complexity assumes a single process, one that is sequential in nature. But distributed systems are different: multiple process communicating over channels. Thus, we need to extend our model to address the new concerns.

Coordination problem

I’m hoping I can update this blog post’s section because even after reading this section twice, I’m still a bit confused. First, I don’t understand the problem; second I don’t understand why said problem cannot be solved with a protocol.

Synchronous vs. Asynchronous

Why should we distinguish synchronous systems from asynchronous ones?

It’s useful to distinguish a synchronous system versus a asynchronous system. In the former, we make assumptions: speed of system is bounded. In the latter, we assume nothing — no assumptions about neither process execution nor message delivery delays.

Author makes a super strong argument in this section, saying that ALL systems are asynchronous, and that a protocol designed for a asynchronous system can be used in ANY distributed system. Big arguments. But is this true? He backs his claim up by saying that even processes that run in lockstep and message delivery is instantaneous still falls under the definition of a asynchronous system (as well as synchronous system).

Election Protocol

What can we achieve by viewing a system as asynchronous?

By no longer viewing a system as synchronous sand asynchronous instead, we can employ simpler or cheaper protocols. The author presents a new problem: N processes, each process Pi with a unique identifier uid(i). Create protocol so all processes learn leader.

According to the author, although we can solve the problem easily, the cost is steep: each node broadcasts its own uid.

So how can we solve the election protocol assuming a synchronous system?

We can define a τ to be both 1) greater than largest message delivery delay; and 2) largest difference observed at any instant by reading clocks. Basically, we use a constant and each process will wait until τ * uid(i) before broadcasting. In other words, the first process will broadcast and the messages will arrive at all other nodes before the second node attempts to sends a second broadcast.

Failure Models

When trying to understand failure models, we assign faulty behavior not by counting the occurrences, but by counting the number of components: processors and communication channels. With this component counting approach, we can say a system is t-fault tolerant, a system continuing to satisfy its specifications (i.e. keeps working) as long as no more than t of its components fault. This component driven concept is novel to me and forces me to think

Why is it important to correctly attribute failure?

The author states that we need to properly attribute failure. He provides an example on message loss. A message can be loss due to electrical issues on the channel (i.e. blame the channel) or maybe due to a buffer overflow on the receiver (i.e. blame the receiver). And since replication is the only way to tolerate faulty components (I’d like to learn more about this particular claim), we need to correctly attribute fault to the components because “incorrect attribution leads to an inaccurate disattributed system model; erroneous conclusions about system architecture are sure to follow”.

What are some of the various models?

He then talks about the various failure models including failstop, crash, crash+link, and so on. Among these failures, the lease disruptive is “failstop” since no erroneous actions are ever taken place.

Although it may not be possible to distinguish between a process that has crashed versus a process that’s running very very slowly, the implications and distinctions are important. When a process has crashed, other processes can assume the role and responsibility, moving forward. But, if the system respond slowly and if the other processes assume the slow process’s responsibilities, we are screwed: the state of the world loses its consistency.

When building failure models, how deep in the stack should we go? Should we go as far as handling failures in the ALU of a processor!? Why or why not?

We don’t want to consider a CPU’s ALU failing: it’s too expensive for us to solve (which relates ot the two attributes above, feasibility and cost); in addition, it’s a matter of abstractions:

A good model encourages suppression of irrelevant details”.

Fault Tolerance and distributed systems

As the size of a distributed system increases, so does the number of its components and, therefore, so does the probability that some component will fail. Thus, designers of distributed systems must be concerned from the outset with implementing fault tolerance. Protocols and system architectures that are not fault tolerant simply are not very useful in this setting

From day one, should think about fault tolerance system architectures and fault tolerant protocols.But he goes on to make another very strong claim (probably backed up too), that the only way to achieve fault tolerance is by building distributed systems (really, is it?!).

In a distributed system, the physical separation and isolation of processors linked by a communications net-work ensures that components fail independently.

This makes me think about systems that I design. How do I model and design systems such that I encourage process and communication boundaries, so I clearly understand the failure models. This is why it is so important for control plane and dataplane separation. It’s another way to wrap our minds around failure scenarios and how system will react to those failures.

Which model when?

Although we should study all the models, each model “idealizes some dimension of real systems” and it’s prudent of us to understand how each system attribute impacts the feasibility or cost — the two important benchmarks described earlier.
January 16, 2021
Distributed Computing – Lesson 1 Summary
Summary

Distributed systems are everywhere: social media, internet of things, single server systems — all part of larger, distributed systems. But how do you define a distributed system? A distributed system is, according to Leslie Lamport (father of distributed computing), a system in which failure of some component (or node) impacts the larger system, rendering the whole system either unstable or unusable.

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable – Leslie Lamport

Lesie Lamport. Source: https://www.heidelberg-laureate-forum.org/laureate/leslie-lamport.html

To better understand distributed systems, we can model their behaviors using nodes and messages. Nodes send messages to one another, over unreliable communication links in a 1:1 or 1:N fashion. Because of the assumed flaky communication, messages may never arrive due to node failure. This simple model of nodes and messages, however, does not take the nodes capturing events into consideration; for this, our model must be extended into a slightly more complex one, introducing the concept of “state” being stored at each node.

When evaluating distributed system, we can and should leverage models. Without them, we’d have to rely on building prototypes and running experiments to prove our system; this approach is untenable due to issues such as lack of resources or due to the scale of the system, which may require hundreds of thousands of nodes. When selecting a model, we want ensure that it can accurately represent the problem and be used to analyze the solution. In other words, the model should hold two qualities: accurate and tractable.

Using Models. Source: Georgia Tech

Building distributed systems can be difficult for three reasons: asynchrony, failures, consistency. Will a system assume messages are sent immediately within a fixed amount of time? Or will it be asynchronous, potentially losing messages? How will the system handle the different types of failures: total, grey, byzantine (cannot tell, misbehaving) ? How will the system remain consistent if we introduce caching or replication? All of these things are part of the larger 8 fallacies of distributed computing systems.

CAP Theorem. Source: Georgia Tech

Ultimately, when it comes to building reliable and robust systems, we cannot have our cake and eat it too. Choose 2 out of the 3 properties: consistency, availability, partitioning. This tradeoff is known as CAP theorem (which has not been proven and is only a theorey). If you want availability and partitioning (e.g. cassandra, dynamodb), then you sacrifice consistency; similarly, if you want consistency and partitioning (e.g. MySQL, Megastore), you sacrifice availability.

References
January 15, 2021
Spring 2021: Distributed Computing
Yes! I’m finally registered for the distributed computing course. This course is hot off the press! It’s spanking brand new to the OMSCS program and offered for the first time this (Spring 2021) term. I’ve been eagerly waiting over two years for a course centering around distributed systems, combining both theory and practice.

The course is taught by professor Ada, who also teaches Graduate Introduction to Operating Systems, which received the highest accolades from previous students who wrote reviews over at OMSCentral review. I bet this class will raise the bar, intellectually challenging and stimulating our minds. According to the course’s syllabus, we’ll bridge both theory and practice by delivering 5 programming projects, the assignments based on University of Washington’s Distributed Systems Labs that are published in this repository: https://github.com/emichael/dslabs . The projects will require us students to:
1. Build a simple ping/pong protocol
2. Implement an exactly-once RPC protocol (parts of lab 1 reused)
3. Design and implement a primary-backup protocol to teach fault-tolerance
4. Implement the famous Paxos
5. Implement a key value stored using Paxos and implement a two-phased commit protocol
I’m particularly interested in the last two projects: paxos and two-phased commit protocol. I had first read the original Paxos paper about four years ago, when I first joined Amazon, but never implemented Paxos myself; and I only recently learned about two-phased commit protocols a month or two ago when taking advanced operating systems last semester and it’s a topic I want to understand more deeply.

I’m hoping that during (and after I take this class) I’ll be able to apply the principles of distributed computing to effectively analyze the reliability of existing and new services powering Amazon Web Services. Moreover, I’ll incorporate the lessons learned into my mental model and refer to them as I both design and build new systems

Let’s get cracking.
January 15, 2021