Some misconceptions about the CAP Theorem

Posted on September 29, 2010 by Tommy McGuire
Labels: theory, notation, protocols, quote, math
I finally broke down and read Eric Brewer's original "Towards Robust Distributed Systems" slides from PODC 2000 and Seth Gilbert and Nancy Lynch's "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services" from the ACM SIGACT News, 2002. I can now understand some of the hubbub around the use of this theorem. Unfortunately, I believe there are some misconceptions with the content of the paper.

Note: for people reading this blog, I apologize if I drop into full-on academic-ese. It's an occupational hazard; dogs, fleas, that sort of thing. Also, Gilbert and Lynch use the term "algorithm", while I prefer "network protocol". I have used both terms interchangeably.

In their abstract, Gilbert and Lynch describe their contribution:
When designing distributed web services, there are three properties that are commonly desired: consistency, availability, and partition tolerance. It is impossible to achieve all three. In this note, we prove this conjecture in the asynchronous network model, and then discuss solutions to this dilemma in the partially synchronous model.
Ultimately, the conjecture and theorem assert that any two of the properties of consistency, availability, and partition tolerance can be achieved in a system, which leads to the question of what those combinations mean. Daniel Abadi describes one difficulty with this theorem,
...as far as I can tell, there is no practical difference between CA systems [consistent and available, but not tolerant of partitions] and CP systems [consistent and tolerant of network partitions, but not available]. As noted above, CP systems give up availability only when there is a network partition. CA systems are “not tolerant of network partitions”. But what if there is a network partition? What does “not tolerant” mean? In practice, it means that they lose availability if there is a partition. Hence CP and CA are essentially identical. So in reality, there are only two types of systems: CP/CA and AP.

Jeff Darcy presents an interesting analysis of the CAP theorem in terms of quorum-based distributed systems (where node X holds the current version of the data), in which he says,


Unfortunately, in this analysis, CA and CP differ in which of the requests Y or Z is blocked; i.e. how the system becomes unavailable.

Indeed, in their paper Gilbert and Lynch admit to some similarity between CA and CP:
3.2.2 Atomic, Available [G&L's terms for CA]
If there are no partitions, it is clearly possible to provide atomic [consistent], available data. In fact, the centralized algorithm described in Section 3.2.1 [Atomic, Partition Tolerant; i.e. CP] meets these requirements.
Dan Weinreb wrote an excellent discussion of the definitions used by Gilbert and Lynch but, after reading the paper, I think the difficulties go deeper. However, before I can examine those difficulties in the formal contents of the paper, I must start by noting some of the definitions used by Gilbert and Lynch.

You're a daisy if you do

One difficulty many readers seem to have with the paper lies in the specific definitions of atomic, available, and partition tolerant used by Gilbert and Lynch. Consistency, the C in CAP, is defined in section 2.1 by Atomic Data Objects, as
...there must exist a total order on all operations such that each operation looks as if it were completed at a single instant. This is equivalent to requiring requests of the distributed shared memory to act as if they were executing on a single node, responding to operations one at a time.
One note for anyone applying the CAP theorem to data storage systems, such as relational databases, NoSQL, ACID, BASE, pre- and post-relational data stores, object stores, etc., etc.: this, serial consistency criteria is significantly stronger than SQL's serializable consistency, or the commonly available, weaker criteria like read-committed and so forth. If you are trying to argue that, for example, an off-the-shelf cannot satisfy your availability requirements, then this proof does not directly help your argument.

Availability, the A, is defined in section 2.2 as
...every request received by a non-failing node in the system must result in a response.
Availability, in this paper, has a very specific, very technical meaning. In particular, a request which is not received by a non-failing node does not impact availability.

Finally, a network partition is defined in section 2.3 as
...the network will be allowed to lose arbitrarily many messages sent from one node to another....(And any pattern of message loss can be modeled as a temporary partition separating the communicating nodes at the exact instant the message is lost.)
Partition tolerance, the P, is strangely defined in terms of consistency and availability:
The atomicity requirement therefore implies that every response will be atomic, even though arbitrary messages sent as part of the algorithm might not be delivered. The availability requirement implies that every node receiving a request from a client must respond, even though arbitrary messages that are sent may be lost.
The proofs in the paper are based on two protocol models: asynchronous, in which
...there is no clock, and nodes must make decisions based only on the messages received and local computation;
and partially synchronous, in which
...every node has a clock, and all clocks increase at the same rate....Further, assume that every message is either delivered within a given, known time: tmsg, or it is lost.
Skin that smoke wagon

Before I look at the more important parts of the paper, I would like to examine perhaps the strongest, albeit minor, result from the paper,
Corollary 1.1 It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties:
The proof of the corollary proceeds by contradiction and by noting that an asynchronous algorithm has no way of determining whether a message has been lost; therefore an algorithm which made those guarantees under an assumption that no messages were lost would also make those guarantees if messages were lost, contradicting theorem 1 of the paper.

This claim is very surprising in light of the fact that, assuming no messages are lost, it is relatively simple to exhibit a protocol which does guarantee availability and atomic consistency.

The proof of corollary 1.1 begins by assuming an algorithm A which makes those guarantees in all non-partitioned executions (in other words, no messages are lost). By theorem 1, there must be some execution α in which some response is not atomic (and presumably from which some message is lost), from which a prefix α' is extracted, up to the point where the non-atomic response is received (presumably including the message loss, which would cause the error). The execution prefix α' is extended to a complete execution α'' in which all messages are delivered eventually (including the message which was lost in execution α; in α'' this message was merely delayed). The execution α'' therefore contradicts the guarantees.

The problem with this proof, aside from its dizzying structure, seems to be that it is missing an analysis of a case: it proceeds by assuming that if A produces an erroneous execution, it does so by producing a non-atomic response. However, the protocol which I would exhibit (guaranteeing consistency and availability in non-failing executions) would violate availability upon a failing execution, not consistency. It would just halt, precluding the identification of the prefix α' and the construction of the contradictory execution α''.

I'm your huckleberry

The paper has two primary theorems,
Theorem 1 It is impossible in the asynchronous network model to implement a read/write data object that guarantees the following properties:
in all fair executions (including those in which messages are lost).
and a corresponding theorem 2, for partially synchronous networks
Theorem 2 It is impossible in the partially synchronous network model to implement a read/write data object that guarantees the following properties:
in all executions (even those in which messages are lost),
The proofs of both of these theorems presented by Gilbert and Lynch are very similar: an algorithm A is assumed to exist which satisfies all of the properties. Then, an execution of A is constructed that includes an inconsistent response, which contradicts the original assumptions. In both cases, the network consists of two disjoint, non-empty sets of nodes, G1 and G2, which are further partitioned so that no communication is possible between G1 and G2. The constructed execution involves a write to G1 followed by a read from G2, which cannot be consistent with the write.

Unlike corollary 1.1, I do not see any particular problem with these proofs as far as they go, but I do believe the result has been overstated. In the conclusion, Gilbert and Lynch say,
In this note, we have shown that it is impossible to reliably provide atomic, consistent data when there are partitions in the network. It is feasible, however, to achieve any two of the three properties: consistency, availability, and partition tolerance.
The first sentence adequately describes the theorems in the paper, but it does not imply the second. Specifically, both of the proofs assume availability as a priority requirement and make use of a partition that leaves the system unable to maintain consistency.

I find these proofs to be unsatisfactory for a number of reasons. The definitions and proofs, as given, are excessively complex and result in frequent misunderstandings. Further, the goal as stated in the abstract and conclusion is not satisfied. I believe the goal can be satisfied and that doing so is important. Finally, I believe a clearer presentation would open up several further avenues into the capabilities of distributed systems.

You may indeed, if you get lucky.  

I intend to provide an alternative proof (or, to be more correct, a proof sketch) of the CAP theorem. This proof will proceed in three parts: first, a demonstration that a consistent and available protocol cannot withstand partitions; second, that a consistent and partition tolerant protocol cannot guarantee availability; third, that an available and partition tolerant protocol cannot guarantee consistency. Before I can get to the proofs, I must revisit the definitions.

The execution model is the same as Gilbert and Lynch's partially synchronous model: all nodes have a clock which may not show the same time as any other node, but which advances at the same rate. This model is more general than the asynchronous model and is more useful.

A distributed storage system is made up of n nodes, labeled \(M_0\) to \(M_{n-1}\), where \(n > 0\). This storage system maintains a single value, an integer.

There are two possible messages which can be sent to a node \(M_i\): read, requesting the current value, and write, updating the current value. There are two possible response messages of \(M_i\) to a request: value containing the current value in response to a read request, and ack, indicating that the write request has updated the value. I explicitly do not include an error response; such a response would be exactly identical to no response at all in terms of violating availability. (In practice, such a response would be a requirement.)

If a protocol guarantees consistency, then if a request message is received by any node \(M_i\), the result of that operation is based on the latest value immediately prior to that operation. A read request results in the most recent value; a write request overwrites the most recent value with the new value. In order to demonstrate this consistency, I associate a witness with the value maintained by the distributed data store such that the witness begins at 0 with the value at its initial state, and is incremented by 1 every time the value is read or written. As a result, a read request results in the most recent value and a witness value \(w+1\) where \(w\) is the witness associated with the immediately previous operation; likewise, a write request results in an ack and a witness value \(w+1\) where \(w\) is the witness associated with the immediately previous operation. I am maintaining the Gilbert and Lynch's requirement for serial consistency.

If a protocol guarantees availability, then if a node \(M_i\) receives a request, it sends a response within some bounded period of time; a read request results in a value and a write request results in an ack. An unbounded delay results in an availability failure, as would some hypothetical error response. The actual bound must be protocol dependent. What is explicitly disallowed is a delay until the partition is healed.

If a protocol guarantees partition tolerance, then the loss of one or more messages does not invalidate any of the other guarantees of the protocol.

Theorem 1: \(CA \rightarrow \neg P\)

Assume that a distributed storage protocol over nodes \(M_0, \ldots, M_{n-1}\) guarantees consistency and availability. Such a protocol cannot guarantee partition tolerance. The proof of this theorem follows the same lines as the proofs of Gilbert and Lynch.

Assume that the total cluster of nodes is divided into two, non-overlapping, non-empty sets \(M_0,\ldots,M_i\) and \(M_{i+1}, \ldots ,M_{n-1}\), that the current witness value is \(w\), and that all messages from members of the first set do not reach any member of the second set, and vice versa. Further, assume that a request is received by \(M_j\), one of \(M_0, \ldots, M_i\), which responds appropriately with a witness \(w+1\) and that the protocol is otherwise quiescent. If the next request is received by \(M_k\), one of \(M_{i+1}, \ldots, M_{n-1}\), after \(M_j\) responds to the first request, then \(M_k\) should respond appropriately with a witness \(w+2\) by virtue of its availability and consistency. However, it cannot do so consistently since it does not have the most recent version of the value and does not know that the witness has been incremented. As a result, this protocol cannot tolerate the partition without invalidating some other guarantee.

Theorem 2: \(CP \rightarrow \neg A\)

Assume that a distributed storage protocol over nodes \(M_0,\ldots, M_{n-1}\) guarantees consistency and partition tolerance. Such a protocol cannot guarantee availability.

Assume that the total cluster of nodes is divided into two, non-overlapping, non-empty sets \(M_0, \ldots, M_i\) and \(M_{i+1}, \ldots, M_{n-1}\), that this cluster is otherwise quiescent, and that all messages from members of the first set do not reach any member of the second set, and vice versa. If some node \(M_j\) in \(M_0,\ldots,M_i\) receives a request and some node \(M_k\) in \(M_{i+1},\ldots,M_{n-1}\) receives another request then one of the responses should have a witness of \(w+1\) and the other should have a witness of \(w+2\). However, because there is no communication between the two sets, they will not be able to agree on the witnesses; the set which should assign witness \(w+2\) can have no idea that witness \(w+1\) has been assigned. In order to avoid inconsistency, one or the other of \(M_j\) and \(M_k\), or both, must not generate any response, therefore violating availability.

Theorem 3: \(AP \rightarrow \neg C\)

Assume that a distributed storage protocol over nodes \(M_0,\ldots,M_{n-1}\) guarantees availability and partition tolerance. Such a protocol cannot guarantee consistency.

Assume that the total cluster of nodes is divided into two, non-overlapping, non-empty sets \(M_0,\ldots,M_i\) and \(M_{i+1},\ldots,M_{n-1}\), and that all messages from members of the first set do not reach any member of the second set, and vice versa. If some node \(M_j\) in \(M_0,\ldots,M_i\) receives a request and some node \(M_k\) in \(M_{i+1},\ldots,M_{n-1}\) receives another request then one of the responses should have a witness of \(w+1\) and the other should have a witness of \(w+2\). However, because there is no communication between the two sets, they will not be able to agree on the witnesses; the set which should assign witness \(w+2\) can have no idea that witness \(w+1\) has been assigned. In order for the cluster to preserve availability, both \(M_k\) and \(M_j\) must generate responses in some bounded time while the duration of the partition may not be bounded. In these responses, both witnesses will have some arbitrary value (such as both having \(w+1\)). In order to preserve availability, therefore, this protocol must violate consistency.

Theorem 4: CAP

A distributed storage protocol can only guarantee two of consistency, availability, and partition tolerance.

Theorem 4 follows from theorems 1, 2, and 3.

Maybe poker's just not your game. I know: let's have a spelling contest

I feel some trepidation in writing this section, since in the previous I have presented only informal proof sketches myself. However, I feel it needs to be said, even if I do not currently have the tools to follow my own advice.

With the exception of corollary 1.1, I do not have any specific issues with the proofs from Gilbert and Lynch. In fact, the proof sketches I have presented above are not materially different from their proofs. However, the primary general problem I have with their paper, and the causative factor I see with corollary 1.1, is that none of their proofs are in any way formal. Let me emphasize that, for anyone who has ever referred to "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services" as a formal proof of the CAP theorem:


This paper is in no way formal.
There are no formal proofs here.
The parrot is dead.


If Gilbert and Lynch had attempted a formal proof, the weakness of corollary 1.1 would have been evident. In my opinion, the attempt at formalizing the proofs of the two theorems would have resulted in definitions of consistent, available, and partition tolerant much clearer than those actually presented. And further, the overall failure to demonstrate the CAP theorem, rather than a subset of it, should have been clear.


It appears my hypocrisy knows no bounds


Now that I have presented what I believe to be a better proof of the CAP theorem, what have I gained? One question anyone might ask is, if you weaken one or more of the two guarantees, can you get back some of the third? I (and presumably everyone else) believe so. And with theorems 1, 2, and 3, I have some guidance in doing so.


Theorem 5: \(\tilde{C}P \wedge \tilde{A}\)


Assume that a distributed storage protocol over nodes \(M_0,\ldots,M_{n-1}\) guarantees write serial consistency and partition tolerance. Such a protocol can guarantee read availability.


Define write serial consistency by associating a witness with the value maintained by the distributed data store such that the witness begins at 0 with the value at its initial state and is incremented by 1 every time the value is written. A read request results in the most recent value and current witness value. A write request results in an ack and a witness value \(w+1\) where \(w\) is the witness associated with the immediately previous operation.


Define read availability such that if a node \(M_i\) receives a read request, it sends a response within some bounded period of time. No guarantee is made about write requests.


Suppose that the total cluster of nodes is divided into two, non-overlapping, non-empty sets \(M_0,\ldots,M_i\) and \(M_{i+1},\ldots,M_{n-1}\), that the current witness value is \(w\), and that all messages from members of the first set do not reach any member of the second set, and vice versa. Further, suppose that a request is received by \(M_j\), one of \(M_0,\ldots,M_i\).  If the request is a write, assume that an error is generated, it is delayed until the partition is healed, or that it is simply ignored. If the request is a read, assume that \(M_j\) responds with the value and witness \(w\). This protocol can support all three of write serial consistency, read availability, and partition tolerance.

You called down the thunder

In this post I have pointed out some issues that I believe exist in Gilbert and Lynch's paper and presented what I believe to be improvements. I have also demonstrated that, while it is impossible to achieve all three of (serial) consistency, (absolute) availability, and (unconditional) partition tolerance, it is possible to achieve two. Also, by weakening one or both, it is possible to partially approach the third.

Doc Holliday: What did you ever want?
Wyatt Earp: Just to live a normal life.
Doc Holliday: There's no normal life, Wyatt, it's just life. Get on with it.
Wyatt Earp: Don't know how.

Comments



I highly recommend http://danweinreb.org/blog/what-does-the-proof-of-the-cap-theorem-mean which has a similar theme. People think Gilbert and Lynch proved Brewer's theorem, but they really proved something subtly different and generally less applicable to the contexts in which CAP is mentioned.

Jeff Darcy
2010-10-01T09:59:21.725-05:00'

In short, it is stupid to map one model to another model without justification. Mathematics is a great way to do this. What mathematical tools do we have at our disposal?

Likewise, this question can be asked (and has been asked, and is being asked) at the language-level. Hewitt's ActorScript ideas are important here. So is the blame calculus, since static languages interfacing with dynamic ones is really just a way to define partitions. Wadler has linked to an explanation of the CAP Theorem on his blog before, but the description he picked wasn't that great.

I also think most people misunderstand the differences between agent-based computing and Basic Stuff Required For the Critical Path of a Business(TM). Computing with agents is great, and can be used to reduce costs and/or service customers faster/better/cheaper. But it is not a replacement for underlying social traditions and the expectations of members of society. Peter Norvig said in Coders At Work that many of Google's unit tests don't have concrete internal pass/fail checks, and the test is more on a continuum. I can imagine that for a database search engine where half a second increase in latency drops traffic by one-fifth! Better to give suboptimal search results than to have users use Bing instead; all that matters is the results are accurate enough in relation to some function that measures AdWords money generation for Google. The only flaw in such unit tests is they fundamentally can't model second-order effects.

Building programming models based on societal traditios accumulated since the stone age is why Hewitt detoured so heavily into theories of organizations, because he believed mathematical models should reflect the real world -- he was especially critical of petri nets modeling due to its inability to model real-world processes.

Jeff, I had missed that particular post by Dan Weinreb, although I have a browser tab open on another one of his posts. I'll add a link to it.

I am tempted to say, "if I'd seen that, I wouldn't have spent my three-day weekend writing my post," but that is not precisely true. The real irritant for me was corollary 1.1, which was like changing from fifth gear to first at 60mph on my old Kawasaki: I didn't necessarily know what, but I knew something was very wrong. Only after that did I really dig in and notice that the paper did not actually prove what it claims to have proven.

It is true that the CAP theorem is both subtly different from what most people would believe and also not directly applicable to some of the contexts in which it is used, but I would go further and say that Gilbert and Lynch have only partially proved the CAP theorem, even given their somewhat convoluted definitions.

Tommy McGuire
2010-10-05T17:23:02.418-05:00'

Z-Bo, it's good to hear from you!

I'm afraid I do not really understand your comments. Could you elaborate?

Tommy McGuire

1. Phil Wadler has been developing a distributed programming language called Links that allows programmers to treat n-tier applications as a single tier, and then dynamically split the application across those tiers. He has a number of papers on this, coauthored with his Ph.D. students, such as The Essence of Form Abstraction. One of the claims Form Abstraction via applicative functors makes is that if your programs are well-typed then your forms have proper encapsulation. At the same time, he has also developed the Blame Calculus with Jeremy Siek. Traditionally, one might view the problems BC solves strictly in terms of static and dynamic types. However, these are generally a continuum in any real-world system. When you have partition barriers, you have coordination that must be resolved. You can bake this coordination into the type system and create a static type system for distributed computing. You can also create a dynamic type system and provide a well-defined protocol for how types are dynamically sealed. Saying well-typed systems can't be blamed is really just saying that the type system is insulated from the outside world. Many real-world systems need to model situations where the types of entities can change dynamically, and engineers currently have two approaches for dealing with this problem:

(a) Create an auditable document that says what is going to be changed and have everybody sign off on it, to keep track of what is changed. Modern "Fully Integrated Data Environments" can automate the generation of the audit document and handle correctness of the upgrade modulo some ad-hoc scripts, but they cost a lot of money since they're complicated systems that only commercial vendors would build. Changes still have to be made in the application code. For systems in the wild, CRMs and ERPs companies buy might require consultants of the CRM/ERP solution to come in and tell them if a change is even possible (and even then they are just guessing). You can track usage of schema elements via network profiling, but that doesn't explicitly tell you "this code is only called once a year". Some processes can run for over a year without a change.

(b) Develop something like Ruby on Rails with lots of reflection and dynamic typing. This doesn't really address the issues raised by (a) directly since RoR is a not a robust dynamically reflective system. But the spirit of the idea is captured if we decouple the main ideas in RoR. For one, RoR could define schemas external to the application itself rather than directly reflect on classes. Classes could download the schema from an external source, and the applications knowledge of the schema would have to explicitly model the barrier between the application and the database, and the fact the database schema could change. This sort of dynamic dependency management is non-existent in real-world systems today, because we are stuck manipulating symbolic references. RoR has no way to tombstone itself and re-birth itself like a Phoenix.

John "Z-Bo" Zabroski
2010-10-06

2. Hewitt has spent most of his academic career advocating the Actor Model of computation. His argument was the Actor Model was a natural model of computation and was expressive enough to capture all computation. This therefore includes schematic evolution of a system. Actor systems have to be constructible that can survive problems like schema (type) mapping failures. Actor systems must also be able to solve problems like modeling problems with non-serializable subgoals. Hewitt's views on these problems are different from Wadler's.

They are two sides of the same coin. One looks at the problem in terms of type systems, the other in terms of (concurrent, distributed) logic programming.

W.r.t. Google: Google has released reports that half a second response latency drops their traffic by a fifth. If that half a second bottleneck is spent inside the search engine trying to find the "perfect" result set, then it is pointless from a business perspective. It is much better from the perspective of a statistician to create a model for AdWords independent of the model for latency/traffic and try to derive a trade-off function that can dynamically decide how good the results are vs. how much money inferior results might generate via AdWords clicks anyway.

Corporations, on the other hand, have regulatory compliances they must adhere to for certain services. If they provide you with a draft of hospital charges for inpatient surgery, and it is approved by your insurance, they can't just randomly change that. The results of "Why did we charge this patient that?" have to be deterministic and point to a very specific, final answer.

The principle trade-off for large enterprises is that they have many stovepipe applications that are dependent on one another, and enterprises lose money when a point of service or call center person has to look up why a customer's account is not showing the fact they paid a bill already. That 20 minute walk to the filing cabinet or whatever costs money, and it adds up. So decisions regarding CAP have to be driven by business requirements. Languages should allow people to reason about these trade-offs at a high-level.

John "Z-Bo" Zabroski
2010-10-06
active directory applied formal logic ashurbanipal authentication books c c++ comics conference continuations coq data structure digital humanities Dijkstra eclipse virgo electronics emacs goodreads haskell http java job Knuth ldap link linux lisp math naming nimrod notation OpenAM osgi parsing pony programming language protocols python quote R random REST ruby rust SAML scala scheme shell software development system administration theory tip toy problems unix vmware yeti
Member of The Internet Defense League
Site proudly generated by Hakyll.