For this week, my distributed systems course just assigned us students a reading assignment: “Impossibility of distributed consensus with one faulty process“.
Apparently, this paper is a seminal piece of work that goes on to describe and prove that, given a single process failure within a distributed system, the underlying system cannot achieve consensus (i.e. the nodes cannot reach a state where they all agree on some proposed value). That’s right: not difficult, but impossible. Assuming that this theorem holds true, how the hell do we even build robust distributed systems that can tolerate failure? Surely there must be a way. Otherwise, how the hell do popular cloud services like Amazon S3 and Amazon DynamoDB guarantee high availability and consistency.
My gut tells me that the authors of the paper — Fischer, Lynch, Patterson — make strict assumptions about the underlying system. And if we can somehow relax these assumptions, then we can in fact build highly available distributed systems. But, I’ll find out soon since I’ll be reading both this paper as well as another seminal piece of work “Paxos made simple” by Leslie Lamport.
I’ll follow up on a separate post once I’ve read through the paper three times.