Distributed system snapshots: consistent vs inconsistent cuts

February 13, 2021 | minutes read

In “Consistent Global States of Distributed Systems: Fundamental Concepts and Mechanisms”, the authors propose capturing a distributed system’s computation using a time series graph. Each row in the graph represents a process (e.g. P1, P2, P3), and each tick (e.g. e1, e2) within that row represent a an event: a local event, a send message event, a receive message event. For example, looking at the figure above, P1’s event is a local event but P1’s second event represents the reception of the message sent from P2.

Snapshot of distributed system using cuts

Now, say we wanted to take a snapshot in time of all the processes. How do we model the system’s state? Well, the authors describe a technique they call cuts. Essentially, a cut is a tuple, each element signifying each process’s last event (i.e. called a frontier) that should be considered part of the snapshot state.

Cuts of a distributed computation

In the figure above, two cuts exist. The first cut is (5, 2, 4), the second cut (3,2,6). For the first cut, we include P1’s events up to (and including) event 5. For the same cut, we include P2’s events up to (and including) event 2.

Consistent versus Inconsistent

Now, there are two types of cuts. Consistent and inconsistent cuts. Returning back to the figure above, the first cut (C) is considered “consistent” while the latter (C’) “inconsistent”.

Why?

A consistent, according to the authors, can be defined using mathematical notation.

Definition of consistent cut

If that’s not clear, they offer another explanation: “In other words, a consistent cut is left closed under the causal precedence relation.”

Inconsistent cut in layman terms

Okay, what does this really mean? I still find it a bit confusing, so let me try to form it in my own words.

If an event (event_x) within tuple was caused by another event (event_y), then the causal event (event_y) must be included within the tuple. As an example, take a look back at the figure above. P3’s event 6 was caused by P1’s event 5. However, the cut’s frontier event for process 1 is event 3, failing to include P1’s event 5. Therefore, this cut can be classified as inconsistent.

 

 

 

I’m Matt Chung. I’m a software engineer, seasoned technology leader, and father currently based in Seattle and London. I love to share what I know. I write about topic developing scalable & fail-safe software running in the AWS cloud, digital organization as a mechanism for unlocking your creativity, and maximizing our full potentials with personal development habits.

View all articles