Tag: paxos

  • PAXOS made moderately complex – slots

    PAXOS made moderately complex – slots

    In the paper “PAXOS made moderately complex”, the authors introduce unfamiliar concepts not mentioned in the original PAXOS paper, concepts such as slots, slot in, slot out, and WINDOW. I found these concepts difficult to understand despite reading both the accompanying pseudo code as well as their Python implementation.

    This post aims to shed light on these concepts and better understand how the following invariant holds:

    R5: A replica proposes commands only for slots for which it knows the configuration: slot_in < slot_out + WINDOW

    The rest of the post focuses on the replica only. We will not discuss the actual consensus algorithm, also known as the SYNOD; this topic will be covered in a separ

    SLOTS Overview

    • Slots are to be occupied by decisions
    • SLOT_IN points to the next “free” unoccupied slot
    • SLOT_OUT points to the next decision to be executed
    • SLOT_IN advances by 1 with every new decision received by leaders
    • SLOT_OUT advances by 1 with every execution of a decision by the replica
    • SLOT_IN + WINDOW is N number of slots allowed to be buffered
    • SLOT_OUT trails behind the SLOT_IN as more decisions flow into the replica
    • SLOT_OUT == SLOT_IN means replica has processed all it’s commands
    Replica Initial state
    Figure 1 – Imagine a WINDOW=5. Initially, both slot_in and slot_out both point to slot 1

     

    Then, the replica, as it receives requests from clients, will send proposals to the leaders. The leaders will, as part of its consensus algorithm, will respond with decisions commands, each accept command tied to a particular slot position. We won’t discuss the consensus algorithm as part of this post (but perhaps another).

    Figure 2 – Replica sending a proposal for slot 1

     

    Then, as the replica receives decisions, it will fill the slot accordingly. Every time a slot fills, the slot in pointer advances by one.

    Figure 3 – As replica receives decisions, it inserts into its buffer, advancing slot in by 1

    Next is the perform phase. The replica will execute the decision — it’s associated command — and will advance the slot out index by 1.

    After receiving a decision, the replica will fill in the slot associated with that particular decision.

    Figure 4 – Replica receives decision and fill in the particular slot associated with the decision

    Then, for illustrative purposes, imagine that the replica sends out another proposal (not shown in the figure below), advances the slot in by 1, then fills in the second slot.

    Figure 5 – Slot IN points to index 3 and Slot OUT still pointing to slot 1

    Finally, during the perform phase, the replica will advance the slot out pointer.

    Figure 6 – As the replica executes commands in the slot, the slot OUT index advances. The slot, previously yellow, is now green, symbolizing that the occupying command has been executed

     

    Finally, in this example, the replica executes the next command, advancing SLOT OUT which now points to the same (unoccupied) slot as SLOT IN.

    Figure 7 – Replica executes it second outstanding decision, advancing SLOT OUT by 1, which now points to the same (unoccupied) slot as SLOT IN

    Summary

    WORK IN PROGRESS

  • PAXOS – I’m coming for you!

    PAXOS – I’m coming for you!

    I’m now half way through Distributed Computing course at Georgia Tech and us students are now tackling the penultimate project: building a replicated state machine using PAXOS. This project will be challenging (probably going to require 40+ hours) and it’ll put my theoretical knowledge to the test and reflect back, in a couple weeks, how much I learned.

    Presently, here’s my little understanding of the consensus algorithm:

    Current PAXOS understanding

    • Servers play different roles – Proposer, Acceptor, Learner
    • Proposers send proposals that monotonically increase
    • Proposals are accepted if and only if a majority of the quorum accept them
    • The 2PC (2 phased commit) protocol essentially tells us whether or not a particular transaction is committed or aborted
    • Guaranteeing linearzability means that, from the clients perspective, real time (i.e. wall clock) should be respected and the client should view the system as if there is a single replica

    Future PAXOS understanding

    • How exactly PAXOS guarantees consensus via its 2 phased commit protocol
    • How does a server determine its role (or does it play multiple roles)
    • How to handle the edge cases (say two proposals arrive at the same time)
    • What role does a client play? Does it serve as a proposer?
    • How does leader election work in PAXOS?
    • Should I just try and mimic the Python based code described in Paxos made moderately difficult
    • How will replication work as the number of nodes in the system scales (say from 3 to 5 to 10)
    • How to detect (and perhaps avoid) split brain (i.e. multiple leaders)

    References

    1. Majority quorum can be defined as floor(n/2) + 1
    2. Python implementation described https://www.cs.cornell.edu/courses/cs7412/2011sp/paxos.pdf
  • Distributed Computing – Lesson 1 Summary

    Distributed Computing – Lesson 1 Summary

    Summary

    Distributed systems are everywhere: social media, internet of things, single server systems — all part of larger, distributed systems. But how do you define a distributed system? A distributed system is, according to Leslie Lamport (father of distributed computing), a system in which failure of some component (or node) impacts the larger system, rendering the whole system either unstable or unusable.

    A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable – Leslie Lamport

    Lesie Lamport. Source: https://www.heidelberg-laureate-forum.org/laureate/leslie-lamport.html

    To better understand distributed systems, we can model their behaviors using nodes and messages. Nodes send messages to one another, over unreliable communication links in a 1:1 or 1:N fashion. Because of the assumed flaky communication, messages may never arrive due to node failure. This simple model of nodes and messages, however, does not take the nodes capturing events into consideration; for this, our model must be extended into a slightly more complex one, introducing the concept of “state” being stored at each node.

    When evaluating distributed system, we can and should leverage models. Without them, we’d have to rely on building prototypes and running experiments to prove our system; this approach is untenable due to issues such as lack of resources or due to the scale of the system, which may require hundreds of thousands of nodes. When selecting a model, we want ensure that it can accurately represent the problem and be used to analyze the solution. In other words, the model should hold two qualities: accurate and tractable.

    Using Models. Source: Georgia Tech

    Building distributed systems can be difficult for three reasons: asynchrony, failures, consistency. Will a system assume messages are sent immediately within a fixed amount of time? Or will it be asynchronous, potentially losing messages? How will the system handle the different types of failures: total, grey, byzantine (cannot tell, misbehaving) ? How will the system remain consistent if we introduce caching or replication? All of these things are part of the larger 8 fallacies of distributed computing systems.

    CAP Theorem. Source: Georgia Tech

     

    Ultimately, when it comes to building reliable and robust systems, we cannot have our cake and eat it too. Choose 2 out of the 3 properties: consistency, availability, partitioning. This tradeoff is known as CAP theorem (which has not been proven and is only a theorey). If you want availability and partitioning (e.g. cassandra, dynamodb), then you sacrifice consistency; similarly, if you want consistency and partitioning (e.g. MySQL, Megastore), you sacrifice availability.

    References

  • Spring 2021: Distributed Computing

    Spring 2021: Distributed Computing

    Yes! I’m finally registered for the distributed computing course. This course is hot off the press! It’s spanking brand new to the OMSCS program and offered for the first time this (Spring 2021) term. I’ve been eagerly waiting over two years for a course centering around distributed systems, combining both theory and practice.

    The course is taught by professor Ada, who also teaches Graduate Introduction to Operating Systems, which received the highest accolades from previous students who wrote reviews over at OMSCentral review. I bet this class will raise the bar, intellectually challenging and stimulating our minds. According to the course’s syllabus, we’ll bridge both theory and practice by delivering 5 programming projects, the assignments based on University of Washington’s Distributed Systems Labs that are published in this repository: https://github.com/emichael/dslabs . The projects will require us students to:

    1. Build a simple ping/pong protocol
    2. Implement an exactly-once RPC protocol (parts of lab 1 reused)
    3. Design and implement a primary-backup protocol to teach fault-tolerance
    4. Implement the famous Paxos
    5. Implement a key value stored using Paxos and implement a two-phased commit protocol

    I’m particularly interested in the last two projects: paxos and two-phased commit protocol. I had first read the original Paxos paper about four years ago, when I first joined Amazon, but never implemented Paxos myself; and I only recently learned about two-phased commit protocols a month or two ago when taking advanced operating systems last semester and it’s a topic I want to understand more deeply.

    I’m hoping that during (and after I take this class) I’ll be able to apply the principles of distributed computing to effectively analyze the reliability of existing and new services powering Amazon Web Services. Moreover, I’ll incorporate the lessons learned into my mental model and refer to them as I both design and build new systems

    Let’s get cracking.