The Logical Tax — Consistency is a Loan You Repay in Round Trips
The Logical Taxes
The checkout service’s payment path was blocking for 115ms on every transaction. Not a network degradation — a consistency tax. Cross-region strict serializability with a 100ms inter-region RTT requires at minimum two to three synchronous coordination hops. Each hop costs at minimum one quorum write per shard, plus 2PC phases to commit atomically across both. Those hops happen unconditionally, at every offered load, whether the network is healthy or degraded.
The latency floor is not a bug report. It is a receipt.
Consistency is a loan taken out against latency. The guarantee is the principal — linearizability, serializability, or causal. The interest rate is the Round-Trip Time ( RTT ) price: one per shard for single-shard strict serializability, three minimum for cross-shard, zero for eventual. Every operation at a given consistency level accrues that level’s interest on every call — not occasionally, not under load, but unconditionally, because the interest is the definition of the guarantee.
Engineers who borrow more consistency than the application requires pay interest on a loan the application never needed. Engineers who borrow less suffer a different penalty: the debt is deferred, not cancelled. It surfaces as data corruption, stale reads with no staleness indicator, and split-brain states that trigger no alert.
The Physics Tax priced the hardware-determined costs — physical coherency coefficient , scalability bound , tail-latency fan-out. That measured is a single number: the coherency overhead observed during a controlled load test of the full running system, hardware and protocol combined. This post decomposes that number. The physical floor is set by hardware: NIC interrupt latency, memory bus bandwidth, cache coherency protocol. The logical overhead sits above that floor, set entirely by the consensus protocol’s design choices — quorum rounds, leader pipelining, cross-shard coordination. A measured fit from a load test captures both simultaneously; this post gives each component a separate name so they can be changed independently. This post covers a different category: costs determined not by hardware but by protocol design choices about what “agreement” means. Two logical taxes contract the Pareto frontier in the consistency-latency dimension — the round-trip cost each consistency level imposes, and the logical coherency floor each consensus implementation sets above . Let denote the P99 inter-node round-trip latency — the unit in which every guarantee in this post is denominated, and the interest rate on the consistency loan.
| Logical Tax | What You Pay | Design Consequence |
|---|---|---|
| Consistency spectrum | +1 RTT per write for linearizability; metadata for causal; 2PC phases for cross-shard | Each step up the spectrum adds a measurable latency floor denominated in |
| Consensus protocol | Logical Coherency Tax ( ) set by protocol implementation operating above the physical hardware floor ( ); cross-region bounded by speed of light | Protocol choice sets the Protocol Ceiling — and the write throughput limit — via |
The Consistency Tax — What Each Level Costs
Every row in the consistency price list is denominated in multiples of . Borrowing more consistency than the application requires — strict serializability for a display counter, linearizable reads for a cache warm path — is over-paying: the interest accrues on every operation regardless of whether the application ever exercises the guarantee. Borrowing less than required — eventual consistency for a payment debit, causal consistency for a seat reservation — is under-paying: the penalty is deferred until a correctness failure surfaces in production, where it arrives as data corruption, stale reads with no staleness indicator, or split-brain states that trigger no alert.
The consistency spectrum from The Impossibility Tax ( Definition 5 ) is not an aesthetic preference. Each level carries a measurable cost — in write latency, in metadata overhead, or in both. The Pareto frontier in the consistency-latency dimension has a price at every point.
Definition 14 (Consistency Tax Function). For a consistency level in the Viotti-Vukolic partial order, the consistency tax at level is the pair where is the added write latency and is the added per-message metadata overhead, both relative to eventual consistency as baseline .
Definition 14 -- Consistency Tax Function: the pair of added write latency and metadata overhead at each consistency level, relative to eventual consistency as baseline
Axiom: Definition 14: Consistency Tax Function
Formal Constraint: For a consistency level in the Viotti-Vukolic partial order, the consistency tax at level is the pair where is the added write latency and is the added per-message metadata overhead, both relative to eventual consistency as baseline :
Engineering Translation: Every step up the consistency spectrum from eventual to strict serializability adds either latency (synchronous round-trips for coordination) or metadata (vector clocks, version stamps, tombstones for conflict detection) or both. Moving up the spectrum is movement along the frontier in the latency-consistency dimension: you gain ordering guarantees, you pay in round-trips or bandwidth. The tax function makes each cost explicit and measurable rather than implicit in protocol documentation.
Physical translation. The consistency tax function converts the qualitative consistency spectrum into denominated engineering costs: moving from eventual to read-your-writes adds causal dependency tracking (metadata cost); moving to strict serializability adds a quorum round-trip per write (latency cost). Both are measurable in your load test and must appear in the birth certificate’s Assumed Constraints, not in a qualitative architecture comment.
Cross-series numbering reference — Definitions and Propositions from prior posts
Note: the series uses a continuous numbering scheme across posts. Definitions 1–9 and Propositions 1–6 appear in The Impossibility Tax. Propositions 7, 7a (Coherency Domain Decomposition — USL extension for skewed loads), 8, and 9 (Coordinated Omission Bias) and Definitions 10–13 appear in The Physics Tax. This post introduces Proposition 10, Proposition 10a (Metastable Recovery Stability — a corollary of the load-shedding stability analysis), Definition 14 (Consistency Tax Function), Definition 15 (Logical Coherency Tax), and Definition 16 (Protocol Operability Cognitive Load). throughout this post denotes the hardware coherency floor from Definition 11 — an irreducible physical constant, measured in a perf lab load test and fixed by hardware topology. (Definition 15) is the logical protocol overhead above — a distinct, protocol-determined cost layer, not a refinement of the measured value. The combined form throughout this post is the sum of both independent components, and fits of the combined term produce the full coherency overhead of a running protocol implementation.
Proposition 10 (Consistency Ordering). Each step up the consistency partial order from Definition 6 toward strict serializability increases write coordination requirements by at least one synchronous round-trip or increases per-message metadata overhead by at least :
Before applying Proposition 10 to specific protocols, four recurring mechanisms need names — they appear throughout the rest of this post and in the tax table that follows.
Optimistic conflict checking: execute a transaction without locking any rows, then validate at commit that nothing was concurrently modified — low overhead when contention is rare, rising abort rate when multiple writers compete for the same row.
Pessimistic locking: acquire every lock the transaction will need before releasing any — prevents write-skew and phantom-read anomalies, but any two transactions that each hold a lock the other needs create a deadlock that the database must detect and abort.
Versioned snapshots: give each reader a frozen snapshot of the database at transaction start time, so reads never block in-flight writes — write-write conflicts still cause one of the writers to abort.
Conflict-free merge structures: data structures whose merge operation is commutative, associative, and idempotent so that any two replicas can be merged in any order with the same result. The shopping cart intuition: two regions each add items to a cart during a network partition — add(milk) on US-East, add(bread) on EU-West — and merging the two states always produces {milk, bread} regardless of which merge happens first. A likes counter that only increments is a conflict-free merge structure. A list that also supports deletion is not, because concurrent tombstones create a conflict that needs a resolution policy. The zero-RTT write path that conflict-free merges enable defers the cost to the read side and to background compaction — formalized as the read-path merge tax.
The table below applies those mechanisms to the full consistency ordering — pricing each level in extra round-trips per write, added per-message metadata, and the resulting latency floor in multiples of (P99 inter-node RTT):
| Level | Extra RTTs per write | Extra metadata per message | Latency floor |
|---|---|---|---|
| Strict serializability | Single-shard: 1 quorum write. Cross-shard: 1 quorum write per shard + 2 2PC phases (~2 RTTs in integrated databases that pipeline Prepare into the replication log) | None | Single-shard: (1–5ms intra-DC; 100ms cross-region). Cross-shard: – (2–10ms intra-DC; 200–300ms cross-region). Same consistency level; cost varies with transaction scope, not with the partial order position |
| Serializability | 0–1 ( optimistic conflict checking : no RTT until conflict; pessimistic locking: lock acquisition) | Conflict version per transaction | Abort-on-conflict under optimistic checking; deadlock under pessimistic locking; +0–5ms intra-DC |
| Snapshot isolation | 0 ( versioned snapshots reads from snapshot) | versioned snapshots version stamps per object | +0–2ms; write-write conflicts abort; phantom reads possible |
| Sequential consistency | 0 (single leader) | None | Leader bottleneck; stale reads permitted in real time |
| Causal consistency | 0 | vector clock | naive: +8KB/message at 1,000 nodes; compact version tracking (Dotted Version Vectors): 24–40 bytes when active writers are a small subset of replicas — degrades back to naive baseline in active-active deployments where all nodes concurrently accept writes to the same un-sharded causal domain |
| Read-your-writes | 0 | Session token | +session routing overhead |
| Eventual consistency | 0 (write) | Tombstone metadata; 1– state size growth | Read-Path Merge Tax ( ); dominant logical cost when |
Proof sketch -- Consistency ordering (Viotti-Vukolic 2016): why each step toward stronger consistency adds at least one round-trip or replica-proportional metadata overhead
Axiom: Proposition 10: Consistency Ordering (Viotti & Vukolic 2016)
Formal Constraint: Strict serializability requires all operations to appear in a serial order consistent with real time — necessitating at least one synchronous quorum round-trip per shard. Removing the real-time bound yields serializability: optimistic conflict checking adds zero RTTs until commit-time detection; pessimistic locking serializes without a consensus round-trip but introduces deadlock. Versioned-snapshot isolation adds zero RTTs for reads but permits write-skew and phantoms. Both serializability and snapshot isolation are above the HAT boundary — neither is achievable without coordination. Causal consistency and below require no synchronous coordination; they pay in metadata overhead. [2] [1]
Engineering Translation: Each step up the consistency spectrum from causal to strict-serial costs one synchronous RTT or metadata — not both simultaneously. The price list is fixed by physics and protocol structure; the only lever is which level you actually need. Operations that naturally commute (add-to-cart, increment) do not need to buy above the HAT boundary.
Contention multiplier — the variable-rate consistency loan. The pricing table above quotes fixed RTT costs per consistency level. Those prices hold when concurrent write contention is low. Under optimistic concurrency (check-and-set, MVCC), the effective per-operation latency at high contention is not the quoted price — it is the quoted price multiplied by expected attempts before commit. At collision probability per attempt:
As , : the system enters a contention-induced metastable state where transactions queue and retry faster than they complete. Named failure mode: abort cascade — a popular product key receives a traffic spike; all concurrent write transactions compare-and-swap on the same inventory counter; abort rate rises from 2% to 90%; effective write latency spikes while the system looks healthy on P50 metrics. The pricing table quotes the steady-state cost for a well-distributed workload. The contention multiplier is the missing term for any workload with write hotspots — measure it per hot key, not per operation class.
Lab, not production: the abort cascade boundary must be characterized before any hotkey-prone workload ships. Run a contention experiment in staging: synthesize a single-key write workload with a tunable concurrency level using a CO-free, open-loop load generator, ramp concurrent writers from 1 to 50, and record the abort rate and effective latency at each step. The inflection point — where abort rate crosses 10–15% — is the write-concentration ceiling for that consistency level and storage backend. Why this cannot be discovered from production: a production traffic spike arrives without warning, at peak load, and the abort cascade manifests as latency degradation that mimics network issues on P50 dashboards. The staging experiment isolates the contention variable, produces a reproducible measurement, and gives you a ceiling to enforce before you face a checkout-page incident.
When strict serializability spans multiple shards — as in a cross-shard distributed transaction in Spanner or CockroachDB — per-shard consensus is only one component of the coordination cost. Atomic commitment requires a second coordination layer: Two-Phase Commit ( 2PC ), executed by a transaction coordinator. Phase 1 (Prepare): the coordinator sends prepare to all participant shards and waits for all acknowledgments — one synchronous round-trip. Phase 2 (Commit): the coordinator sends commit to all shards — another round-trip. Total: two additional synchronous phases beyond per-shard consensus. Intra-DC, 2PC adds 2–8ms. Cross-region, each phase crosses the speed-of-light bound: 2PC cost is 200ms+ per transaction. 2PC is also a blocking protocol — if the coordinator fails after Phase 1, all participants remain blocked until recovery. Upgrading to Paxos Commit or three-phase commit eliminates the blocking window. This is the Multi-Key Tax absent from single-shard analyses.
Deterministic execution — an escape hatch off the 2PC latency bound. The 2PC blocking window is a property of the execution model, not of cross-shard strict serializability itself. Deterministic databases [17] establish a global total order of input transactions before any execution begins: a sequencing layer broadcasts an ordered transaction batch to all participant shards, which execute that batch in the same order without a commit-phase acknowledgment. Conflicts are resolved by the input ordering, not by runtime coordination — if two conflicting transactions appear in the batch, the ordering determines which executes first, deterministically, without an abort-retry cycle. The result is cross-shard strict serializability without the blocking window: no Prepare round-trip, no coordinator-failure blocking, no 2PC phases during execution.
The Multi-Key Tax becomes a sequencing tax — one additional RTT to establish the input order before execution — rather than a commit-phase tax paid on every transaction during execution. For workloads where input ordering can be batched efficiently, this is a movement on the logical frontier: the 2L–3L bound applies to 2PC execution models; deterministic execution exits the 2PC execution model, replacing commit-phase RTTs with a single sequencing RTT — a movement along the frontier, not beyond it.
This escape hatch assumes transaction read/write sets are known in full before execution begins. Interactive transactions — where an application reads a value, makes a decision based on it, and then writes based on that decision — cannot use deterministic execution because the write set is not known until after the read. These transactions must fall back to standard 2PC-based blocking execution, where the coordinator-failure blocking window reappears. If your workload includes read-modify-write cycles or SELECT-then-INSERT patterns where the INSERT depends on the SELECT result, the Calvin model applies only to the batch-schedulable portion of your transaction mix.
Sequential consistency removes the real-time constraint, with a precise consequence: a read that begins after a write completes in wall-clock time is permitted to return the pre-write value, provided all processes observe the same total order. This is not a corner case — it is the definition. Any protocol assuming “I wrote it, I can read it from another process immediately” needs linearizability, not sequential consistency. Sequential consistency is typically enforced by routing all writes through a single leader, trading the round-trip for a throughput bottleneck and for the stale-read window between write completion and leader propagation.
Causal consistency requires only that causally related operations are seen in order — if you post a reply to a comment, any reader who sees your reply must also see the original comment it responds to. This ordering constraint is enforced without synchronous coordination: each node stamps outgoing messages with a compact record of what it has already seen, and a receiver holds a delivery only until its causal dependencies have arrived. The overhead is metadata, not round-trips. The HAT classification confirms: everything at or above snapshot isolation requires coordination; everything at or below causal consistency does not. The figure in the table is the naive baseline — a full per-node tracking vector grows one entry per node. Modern implementations use compact version tracking (Dotted Version Vectors), which track entries only for active writers rather than every node in the cluster; their size is , which at a typical write fanout of 3–5 nodes stays at 24–40 bytes regardless of cluster size. The 8KB/message cost applies to naive vector clocks at 1,000 nodes — not an architectural constant of causal consistency. Production causal-consistency systems — Cure [12] and Saturn [13] — demonstrate compact-vector causality tracking at geo-distributed scale with throughput competitive with eventually-consistent stores and metadata size constant in the number of clients and partitions, validating the 8KB-to-40-byte reduction in practice.
The partial order — a hierarchy where some levels are strictly stronger than others, and some are incomparable — from strictest to most relaxed, and what changes at each step:
%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph COORDINATED["Requires coordination"]
STRICT["Strict Serializability
+1 RTT quorum per shard
+2PC phases cross-shard"]:::branch
SERIAL["Serializability
optimistic: conflict check at commit
pessimistic: lock acquisition"]:::branch
SNAP["Snapshot Isolation
version stamps, write-write conflicts abort"]:::branch
SEQ["Sequential Consistency
single-leader total order, stale reads OK"]:::branch
end
subgraph HAT["Coordination-free -- HAT-compatible"]
CAUSAL["Causal Consistency
vector clock per msg
+8KB naive; 24-40 bytes compact
(compact degrades to naive in active-active)"]:::leaf
RYW["Read-Your-Writes
session token routing"]:::leaf
EVENTUAL["Eventual Consistency
conflict-free merge; tombstone growth"]:::leaf
end
STRICT -->|"drop real-time bound"| SERIAL
SERIAL -->|"allow write skew; versioned"| SNAP
SNAP -->|"drop snapshot; single leader"| SEQ
SEQ -->|"drop total order -- HAT boundary"| CAUSAL
CAUSAL -->|"drop cross-session order"| RYW
RYW -->|"drop session guarantee; merge-based"| EVENTUAL
classDef branch fill:none,stroke:#ca8a04,stroke-width:2px;
classDef leaf fill:none,stroke:#333,stroke-width:1px;
The HAT boundary is the critical threshold in this diagram: cross it upward and every consistency level requires synchronous coordination; stay below it and the cost shifts entirely to metadata and semantic weakening.
Bill of materials — checkout service. A checkout service at 100K operations/sec, 1,000 inventory nodes, two shards (inventory shard, order shard). The service requires cross-shard strict serializability for payment writes (cart debit + inventory reservation + order creation must be atomic) and causal consistency for product catalog reads.
- Payment writes (cross-shard strict serializability): 1 Raft quorum write per shard (two shards = 2 quorum round-trips) + 2 2PC phases (Prepare + Commit across both shards). Intra-DC cost: 4–10ms per checkout transaction. At 100K writes/sec: 400,000–1,000,000 wall-clock ms/sec of blocking (by Little’s Law, to 1,000 concurrent threads held open just to sustain the blocked connections), independent of business logic. This is not CPU burn — it is I/O wait consuming thread-pool capacity.
- Catalog reads (causal consistency): 0 extra RTT s. Vector clock overhead: per message — at 1,000 inventory nodes, 8KB per read response. At 100K reads/sec: 800MB/sec of metadata overhead, competing directly with payload bandwidth.
- Total consistency bill: 4–10ms added write latency per payment (paid even under zero contention) plus 800MB/sec of metadata bandwidth (paid on every catalog read). Neither is an error condition. Both are the definition of the operating point the service chose. Moving from causal to eventual consistency on catalog reads zeroes the metadata cost at the price of stale read windows; moving from cross-shard strict serializability to single-shard linearizability halves the 2PC phases at the price of re-architecting data placement. Both are movements along the frontier. Neither is free.
These are protocol-derived theoretical floors. Before treating them as the birth certificate values, verify them against your actual implementation with a load test: run a CO-free, open-loop load generator at the target RPS with your consistency configuration active, compare observed write P99 against the RTT floor, and record any gap in the Assumed Constraints field. Undocumented coordination overhead above the theoretical floor is the most common source of birth-certificate drift in the first 30 days after deployment.
The following diagram shows the two operation classes and their respective consistency costs, arriving at the total consistency bill for the checkout service.
%%{init: {'theme': 'neutral'}}%%
flowchart TD
CHECKOUT["Checkout service
100K ops/sec, 1000 nodes, 2 shards"]:::root
PAY["Payment writes: 100K/sec
requires cross-shard strict serializability"]:::branch
CAT["Catalog reads: 100K/sec
requires causal consistency"]:::branch
PAY_COST["Cost: strict serial
2 Raft quorum RTTs + 2 2PC phases
4-10ms latency, 400-1000 blocked threads"]:::work
CAT_COST["Cost: causal
0 extra RTTs
8KB vector clock = 800 MB/sec metadata"]:::work
BILL["Total consistency bill
write: 4-10ms unavoidable per txn
read: 800 MB/sec unavoidable metadata"]:::entry
CHECKOUT --> PAY
CHECKOUT --> CAT
PAY --> PAY_COST
CAT --> CAT_COST
PAY_COST --> BILL
CAT_COST --> BILL
classDef root fill:none,stroke:#333,stroke-width:3px;
classDef branch fill:none,stroke:#ca8a04,stroke-width:2px;
classDef work fill:none,stroke:#333,stroke-width:1px;
classDef entry fill:none,stroke:#333,stroke-width:2px;
The diagram makes the operating point concrete: neither cost row is optional given the consistency choices made. The toll road case study that follows shows what happens when you evaluate those choices explicitly.
The Consistency Toll Road — Case Study: RTT -Denominated Pricing for Raft vs. conflict-free merge Counters. The global API quota counter — 1,000 req/min enforced across US-East and EU-West, increment SLA 50ms — faces a different toll at each consistency exit, denominated in .
Exit 1: Global Raft quorum. Every increment waits for a cross-region quorum ACK before returning. RTT US-East to EU: . That single round-trip consumes the 50ms increment SLA twice over. Assuming strictly sequential, unbatched operations on a single contention domain, the write ceiling is committed increments per second — enough for 600 req/min. The 1,000 req/min quota requires 17 increments per second; the global quorum ceiling delivers 10 under those conditions. Group commit recovers throughput — a leader can batch hundreds of concurrent client requests into a single AppendEntries round-trip — but the 100ms latency tax per transaction remains irreducible. The toll is the propagation delay of light in fiber, charged unconditionally in the counter’s increment budget.
Exit 2: Regional Raft with local enforcement. Each region runs an independent Raft group. Increment latency: 1–5ms intra-DC. Write ceiling per region: 200–1,000 increments per second — the quota fits. The toll at this exit is accuracy: during normal operation, neither region has a real-time view of the other’s count. If US-East has admitted 600 req/min and EU-West has admitted 600 req/min, the global counter is at 1,200 req/min — 20% over quota. The Overage Rate (fraction of traffic admitted above the global limit per convergence window) is proportional to cross-region replication lag: at and 1,000 req/min total, maximum bounded over-admission is under 2 requests per convergence window. Bounded over-admission, paid continuously, in exchange for zero cross-region coordination cost.
Exit 3: Conflict-free merge . The write path is immediate: each region increments its local shard and propagates state asynchronously, with no cross-region contact required. Increment latency: sub-millisecond. The mechanics become clear through a simpler system first — a ‘Like’ button on a viral post. US-East records 4,000 likes; EU-West records 3,200 likes. Neither region waits for the other. When they sync, the merge is trivial: sum the two regional counts. The merged total is always correct, regardless of merge order. This is what the API quota counter is doing at this exit: each region increments locally and trusts that the totals will converge.
The toll at this exit does not appear on the write path. It appears on the metadata axis and the enforcement axis. Each sync message must carry a per-region count vector so that receiving nodes can determine which updates are new — at 1,000 nodes, that vector is 8KB per message under a naive scheme. The enforcement decision — ‘has this request exceeded quota?’ — requires merging all regional states: a scatter-gather read whose P99 is bounded by the slowest shard. For the API quota counter, this produces a bounded enforcement gap: during a network partition, US-East reads a merged count of 800 req/min and EU-West reads the same 800 — both independently admit up to 200 more, for a potential 40% Overage Rate with no upper bound for partition duration. The overage does not arise from a bug; it is the structural consequence of deferring coordination from the write path to the read path.
Each exit is a distinct point on the achievable region’s consistency-latency-accuracy axis. The consistency spectrum is a toll road where the exit ramp is a one-way door: choosing an exit is an architectural decision, not a configuration flag. That choice is also the consistency loan’s interest rate — the toll you pay on every transaction from the moment you commit to an exit.
The Consistency Price List: RTTs per Guarantee
The price list makes it quantitative: how many network round trips does each consistency level require, at minimum, as a function of deployment topology? These are lower bounds — achievable with optimal protocols, not typical implementation averages.
Single-datacenter. Let denote the P99 inter-node round-trip latency within the datacenter. All costs are in units of ; to get milliseconds, multiply by your measured .
Write cost is the coordination required before the response can be committed; read cost is the coordination required before a correct value can be returned.
| Consistency Level | Write RTTs | Read RTTs | Notes |
|---|---|---|---|
| Strict Serializability (single-shard) | 1 | 1 (ReadIndex) or 0 (lease) | 1 quorum RTT for Raft commit; linearizable reads via ReadIndex add 1 RTT, lease reads add 0 RTT |
| Strict Serializability (cross-shard) | 3 minimum | 0 ( versioned snapshots ) | 1 per-shard consensus + 2 2PC phases |
| Serializability | Write: 0–1 RTT ( optimistic : conflict check deferred to commit, abort-on-conflict; no RTT until conflict detected) or 2 RTTs ( pessimistic : lock acquire + commit) | Read: 0 (optimistic, from snapshot) or 1 (pessimistic, lock acquire) | Same semantic guarantee — operations appear in a serial order consistent with some valid interleaving — implemented via two distinct execution engines. Optimistic: abort-on-conflict; reads from snapshot. Pessimistic: deadlock risk; reads acquire locks |
| Snapshot Isolation | 1 ( versioned snapshots commit) | 0 | Reads always from snapshot; write-write abort |
| Sequential Consistency | 1 (leader append) | 1 (leader) or 0 (stale) | Leader delivers total order; stale reads possible |
| Causal Consistency | 0 (async propagate) | 0 | Vector clock check is local; no network |
| Read-Your-Writes | 0 | 0 (session-routed) | Routing overhead only; see note on session stickiness |
| Eventual Consistency | 0 | 0 | No coordination; tombstone GC cost is async |
Session stickiness note. The 0 extra RTT pricing for Read-Your-Writes holds only under two conditions: strictly sticky routing (every request in a session reaches the same replica) and no replica failure. Both assumptions are violated in practice. When a load balancer shifts a session to a new replica — due to a health check failure, a rolling deployment, or a region failover — the new replica has no knowledge of the session’s write history. Three recovery paths exist, each with a cost: (1) synchronous quorum read on the new replica (+1 RTT) to catch up to the session’s high-water mark before serving; (2) a wait window during which reads are blocked until async replication delivers the relevant writes; (3) full causal tracking in the session token, where the token carries a vector clock and the receiving replica checks it against its local state before serving. Option 3 converts the session token from an opaque routing hint into a lightweight version vector — recovering the 0 RTT property at the cost of token size and per-request clock comparison. Without one of these three mechanisms, RYW degrades silently to stale reads on every failover event with no observable error signal.
Eventual consistency eliminates synchronous coordination but not asynchronous maintenance overhead. Consider what happens when a shopper removes bread from their cart while the US-East to EU-West link is flapping. Both regions eventually need to learn about that deletion — but a purely additive merge cannot represent ‘remove’: it can only accumulate. So the deletion leaves a deletion marker behind, and that marker must ride along in every sync message until all regions confirm they have seen it. Only then is it safe to discard. At 50 deletions per second across a 1,000-node cluster, those markers accumulate faster than they can be cleaned, and the cleanup itself — compaction — temporarily holds up reads while it runs.
The gossip protocols that drive convergence (Cassandra, Riak) consume sustained bandwidth and CPU for anti-entropy under normal conditions — typically 5–15% of cluster throughput, rising during partition recovery. Deletion markers grow proportionally to delete frequency and divergence depth; compaction requires coordinated cleanup and can temporarily block reads. These are not RTT costs, but they are permanent operating costs that belong on the consistency bill of materials. The table prices the synchronous coordination component; the asynchronous maintenance overhead must be measured separately from your gossip and compaction telemetry.
The cost structures differ by timing, not magnitude. Where consensus pays on the write path — each commit blocks on a synchronous quorum ACK before returning — conflict-free merge structures defer payment to the read path and background processes. Formalizing this distinction makes the trade-off measurable.
Think of it as a tab that grows while the bar is open. Every write that skips the consensus round adds an entry to the divergence history. Every deletion adds a marker that cannot be discarded until all replicas confirm they have seen it. When a read arrives, it must settle the whole tab before returning — walking every unmerged update, stepping over every deletion marker. The user waits for that reconciliation work.
A shopping cart makes the cost concrete. US-East and EU-West diverge for 30 seconds during a link flap: 12 items added across both regions, 3 items removed, 1 promotional price applied only on the US-East side. The first read after reconnection must reconcile all of that before returning a total. At scale — 50 deletions per second across a 1,000-node cluster — the reconciliation work on the read path, and the compaction work in the background, are both real throughput costs. The longer the replicas have been diverging and the more deletion markers have accumulated, the longer every read waits.
Read-Path Merge Tax. The tax has two components: synchronous merge latency charged at read time, and sustained background throughput consumed by compaction. Both grow with divergence: merge latency scales with the number of unmerged update sets since last convergence and the number of accumulated deletion markers; compaction throughput runs at 5–15% of cluster capacity during steady state and rises during recovery. When the write path carries no coordination cost ( ), these two components become the dominant logical cost.
Formal definition -- Read-Path Merge Tax: the throughput fraction consumed by conflict resolution on every read in conflict-free merge eventual consistency
Axiom: Formal Definition: Read-Path Merge Tax
Formal Constraint: For a conflict-free merge deployment, the read-path merge tax is the pair extending the logical tax vector. The merge complexity depends on whether the conflict sets are sorted:
where is divergence depth (number of sorted runs to merge) and is the aggregate total entries examined across all runs within the target key range — the sum across all sorted runs, not the size of a single run (an individual run may contain far fewer than entries in the target range). This is the correct input to the k-way min-heap merge: elements are extracted from the heap in total, each extraction costing , giving overall. The components are live entries plus tombstones encountered in that range. Tombstones inflate above ; the global tombstone count is not the correct input — only tombstones falling within the scanned key range affect merge cost, and . The ordered form applies when conflict sets are sorted and a min-heap k-way merge can be used — the dominant case for LSM-based stores (Cassandra SSTables, RocksDB compaction). The unordered form — which implies a nested-loop reconciliation of each entry against each conflict set — applies to naive implementations or unsorted conflict metadata; its cost represents a Metadata Reconciliation Scan. Production deployments should be operating in the ordered regime; an merge is a compaction implementation defect, not a fundamental property of conflict-free merge structures. is the sustained background throughput consumed by marker compaction — typically 5–15% at steady state, higher during partition recovery. Both are zero for consensus protocols that pay on the write path; they become the dominant logical cost when .
Engineering Translation: A 60-second partition at 50 writes/second and 5 deletions/second accumulates diverged runs and tombstones within the target key range, giving for a tombstone-dominated scan. The first read after recovery must settle the entire tab. Under the ordered form: operations — fast in isolation, but arriving simultaneously across the fleet drives GC contention. Under the unordered form: operations — a read that will time out before it returns. Both forms grow without bound as the partition window extends; the ordered form grows as , the unordered form as . Bound divergence depth before forcing a consistency fallback; rate-limit retries during recovery to allow the backlog to drain before the next retry wave arrives.
Write-path vs. read-path — the deferred tax. Consensus protocols pay per write: each committed operation blocks on a quorum ACK, charging a fixed synchronous RTT. Conflict-free merge structures pay per read: write latency is near-zero ( ) but read-time merge latency grows with replica divergence history and tombstone accumulation, and background GC consumes throughput proportionally to delete frequency and compaction backlog. The two cost structures do not compare on a single axis — the crossover depends on read/write ratio, delete frequency, partition frequency, and merge latency tolerance. Choosing conflict-free merge structures to eliminate write-path coordination does not eliminate the tax; it relocates it.
The metastable read cliff. The read-path merge tax has a non-linear failure mode that the steady-state GC numbers do not capture. During a network partition, every zero-coordination write keeps accumulating: divergence depth and deletion-marker count grow throughout the partition window without bound. The merge tax formula grows without bound as the partition window extends — for ordered stores (superlinear), for unordered metadata scans (quadratic), where is the scan payload within the target key range inflated by accumulated tombstones. A 60-second partition at 50 writes/second and 5 deletions/second produces unmerged update sets and pending tombstones. The first read after partition recovery must settle that entire tab before returning. Under the ordered form that is roughly 3,500 operations — fast in isolation but fleet-wide simultaneous arrival drives GC contention. Under the unordered form it is 900,000 operations — a read that times out before it returns.
The timeout triggers a client retry. The retry re-enters a cluster that is now simultaneously processing recovery merges across every node that was diverged — a fleet-wide scatter-gather triggered at once. Each retry re-triggers the same merge work, which triggers more timeouts, which triggers more retries. The cluster is under its heaviest load at exactly the moment its read path is most expensive. This is the metastable read cliff: the system does not degrade gracefully into slower reads — it collapses, because the retry cascade holds the cluster in the state that causes timeouts rather than allowing it to drain the backlog.
The state machine below traces each transition. The animation that follows plots queue depth over time: watch for the brief apparent recovery before the cliff.
graph TD
HEAL["Partition heals
d = 3000, T_count = 300"]:::warn
MERGE["First post-partition read
O(d * T_count) = 900k ops"]:::warn
TO["TIMEOUT
backlog not drained"]:::cliff
RETRY["Retry re-enters cluster
full merge re-triggered"]:::cliff
DRAIN["Backlog drains
stable fixed point"]:::ok
HEAL --> MERGE
MERGE -->|"exceeds timeout budget"| TO
TO -->|"client retry fires"| RETRY
RETRY -.->|"loop gain above 1
lambda_retry grows, mu_drain fixed"| TO
RETRY -->|"rate-limit retries
lambda_retry below mu_drain"| DRAIN
classDef ok fill:none,stroke:#22c55e,stroke-width:2px
classDef warn fill:none,stroke:#ca8a04,stroke-width:2px
classDef cliff fill:none,stroke:#b71c1c,stroke-width:2px,stroke-dasharray: 4 4
The background GC throughput cost is the normal-operation tax. The metastable cliff is the recovery failure mode. Both are consequences of the same deferred tax, but they require different mitigations. GC pressure yields to compaction scheduling. The metastable cliff requires bounding directly: cap the maximum partition duration before forcing a consistency fallback, or rate-limit read retries during recovery to allow the merge backlog to drain before the next retry wave arrives.
Neither mitigation is free of coordination cost. Bounding divergence depth before forcing a consistency fallback means that once the partition window exceeds the cap, the system must revert to synchronous writes — temporarily reinstating the per-write consensus RTT the conflict-free structure was designed to eliminate. Read repair — actively reconciling diverged state during reads by fetching missing updates from other replicas — requires contacting those replicas synchronously, adding at least one cross-replica round-trip to read latency during recovery. Both are correct engineering responses to the metastable cliff. Both impose a coordination cost structurally similar to the write-path tax the conflict-free approach traded away. The conflict-free structure did not eliminate the coordination tax — it deferred it, and the standard mitigations for the deferred cost partially reinstate it. The engineering choice is which coordination pattern is less damaging for the specific workload: pay per write under consensus, or pay during recovery under conflict-free merge with bounded divergence.
Autonomic load shedding during metastable recovery. Bounding and rate-limiting retries are necessary but not sufficient. The metastable cliff is a classic control-theory boundary: the system is a plant with a positive feedback loop. Timed-out reads generate retries; retries re-trigger merge work proportional to ; that merge work consumes the CPU cycles the cluster needs to drain the backlog, driving more timeouts; those timeouts generate more retries. The loop gain exceeds 1 — each cycle amplifies the error signal rather than damping it. In control-theoretic terms, the cluster has entered an unstable equilibrium: the failure state is self-reinforcing, and the unforced plant cannot return to its stable fixed point.
The stability condition for exiting is the boundary where the loop gain drops below 1 — where the incoming retry rate no longer exceeds the cluster’s recovery throughput:
Under an uncontrolled retry cascade, grows faster than can shrink , because each retry’s merge work is proportional to — which is largest at the start of recovery, precisely when the cluster most needs to drain. No amount of read optimization or hardware capacity changes this: the feedback loop is positive, and a positive feedback loop with loop gain above 1 diverges regardless of the plant’s capacity. The system cannot self-heal without an external intervention that introduces a negative feedback derivative to oppose the error signal.
Autonomic load shedding is that negative feedback term. By rejecting a fraction of incoming read requests, it directly reduces — driving the error signal downward rather than allowing it to compound. The shedding probability is set proportionally to how far exceeds , with a safety margin to ensure the loop gain is driven below 1 with headroom. This is proportional-derivative control applied to the retry rate: the proportional term is the current excess; the derivative term is the direction of the merge queue depth — shedding is removed only when is actively draining, not merely when the instantaneous retry rate looks acceptable. The protocol operates in three phases:
Phase 1 — Detect entry into . Trigger the load shedding protocol when two conditions hold simultaneously: (a) read P99 latency has exceeded the timeout threshold for at least three consecutive measurement windows; and (b) per-node merge queue depth is growing, not draining. Condition (b) distinguishes the metastable cliff from a transient spike — a transient spike sees the merge queue drain between retries; the metastable cliff sees it grow. The baseline queue drain rate is lab-measured at commissioning, not derived from production observation: in staging, run the cluster to 70% of N_max capacity, induce compaction pressure via a synthetic write burst, then release it and record the drain rate. That lab-measured drain rate is the reference against which production queue behavior is compared — “growing, not draining” means growth persisting beyond the lab-characterized drain window.
Phase 2 — Shed and prioritize. Reject all read requests — new and retried alike — at the ingress layer with probability , returning a retriable error to the client (not a timeout):
where (use 0.7) is a safety margin ensuring is driven below with headroom. The stability condition applies to the full retry rate — not to fresh arrivals only. If retried requests bypass the shed, the amplification loop that elevated in the first place remains unbroken; the system stays permanently in regardless of how aggressively new traffic is dropped. Critically, read-repair operations — which are progressing the recovery — are not shed; they occupy a priority queue ahead of all client read requests. This prevents the pathological case where shedding slows recovery by also shedding the merge work.
Phase 3 — Release. Exit shedding mode when drops below , where is the per-update-set merge cost measured at commissioning. At this , a single read’s merge work completes within the timeout budget and the cascade cannot restart.
The shedded requests receive a retriable error, not a silent timeout — clients with correct retry logic back off exponentially and re-enter at reduced rate when the cluster signals recovery. A with is the correct response; a hung connection that times out silently re-enters immediately, defeating the shedding.
Proposition 10a -- Metastable Recovery Stability: load shedding drives R(t) below T_rec, establishing the stability condition for cluster exit from M
Axiom: Proposition 10a: Metastable Recovery Stability
Formal Constraint: A cluster in metastable state converges to stability if and only if there exists a shedding probability such that:
where is the retry rate at cascade onset and are the divergence depth and scan payload at entry into . Under this condition, is monotonically decreasing and the cluster exits in finite time. If no such exists — because even at — the system has exceeded the recovery capacity of its read path and must revert to synchronous writes (bounding at the source) before shedding can be effective.
Engineering Translation: Compute at commissioning: measure merge throughput on a single node under full tombstone load at a representative divergence depth. The ratio is the cascade multiplier — how many times over the cluster’s recovery capacity the retry load is. If the cascade multiplier exceeds 3, shedding alone is insufficient; the divergence bound must be reduced (tighter consistency fallback threshold) to bring into a range where can absorb the load.
Watch out for two distinct implementation errors. First: shedding implementations that apply uniform
to all read types. Read-repair operations and coordinator-driven consistency checks are recovery-progressing reads — shedding them slows the drain. Instrument read requests with a type tag (new_read vs. read_repair) at the ingress layer and exempt read_repair from shedding unconditionally. Second: shedding implementations that only drop new requests while allowing retried requests through on the assumption that retries represent committed client demand. In a metastable cascade driven by a retry storm, retries are the majority of
— they are the amplification mechanism, not a side-effect. An ingress filter that distinguishes X-Retry-Count: 0 from X-Retry-Count: >0 and only sheds the former leaves the loop gain above 1. Both new_read and retry_read must be subject to
; only read_repair is exempt.
Conflict-free merge structures pay at recovery. Consensus-based protocols pay per write — on every operation, regardless of partition state, at the latency floor the RTT sets.
At (same-rack P99): a strict-serial cross-shard write costs 2–3ms depending on whether 2PC is layered or integrated (pipelined Prepare compresses to 2ms; separate steps cost 3ms). At (cross-rack P99): 6–9ms by the same logic. The quorum round trips themselves are irreducible — no software optimization eliminates the requirement to collect acknowledgements from a quorum; it is the definition of the guarantee. What integration can compress is the number of sequentially blocking RTTs before the client receives the commit acknowledgement.
Multi-region. Let denote the P99 inter-region RTT . For US-East to EU: (vacuum lower bound: ~37ms for the NY-London great-circle distance; fiber refractive index and non-great-circle cable routing add ~2.5x, yielding ~100ms in practice). For US-East to APAC: . Cross-region costs by consistency level and quorum topology:
| Consistency Level | Cross-Region Write | Cross-Region Read | Notes |
|---|---|---|---|
| Strict Serial (global quorum) | 1 RTT = | 1 RTT = | Every write blocks on cross-region quorum |
| Strict Serial (regional quorums) | 1 RTT = | Stale across regions | Intra-region Raft; cross-region reads are stale |
| Serializable (global, multi-shard) | – | 0 (versioned snapshot) | 2PC across regions: 200–300ms+ at US-EU; lower bound with pipelined Prepare |
| Snapshot Isolation | per region | 0 ( versioned snapshots ) | Cross-region read is stale by replication lag |
| Causal Consistency | 0 coordination | 0 | Async propagation; up to lag |
| Eventual Consistency | 0 | 0 | — |
Cross-region strict serializability requires the write quorum to span regions. A 5-node Raft group with 3 nodes in US-East and 2 in EU has a US-East quorum majority — cross-region write latency is driven by EU follower lag but the commit does not wait for EU. A 5-node group with a split 2+2+1 across regions has no intra-region majority — every write waits for at least one cross-region ACK. Architecture, not protocol, determines which case you are in. These are write-path bounds. The read path has its own floor — and it cannot be compressed to a single hop.
Why One-Hop Serializable Reads Are Impossible
The SNOW theorem (from The Impossibility Tax) excludes the region from the achievable region. The implication for read paths is concrete.
For a read to be linearizable, it must observe the effect of every write that completed (returned to its caller) before began. If writes are wait-free — no read can block a write from making progress — then when contacts a single node :
- may not yet have applied a write that committed on a quorum not containing .
- To verify that no such is missing from ’s state, must contact at least one node from every possible write quorum — which means contacting a majority of nodes.
This is what ReadIndex does: the leader sends a round of heartbeats to confirm (a) it is still the leader, and (b) its state machine is current through the last committed entry. The quorum contact is not optional — it is the proof that no uncommitted write has bypassed the read. The one-hop escape requires either weakening the guarantee (sequential or causal, served from a potentially-stale follower) or paying the coordination cost somewhere else (lease reads pay it in clock synchronization infrastructure instead of RTT s).
Raft Linearizable Read Methods — Single-Object Scope
All three methods below operate within single-object (single-shard) linearizability — Raft’s domain. Raft guarantees that reads on a single Raft group observe the latest committed write to that group in real time. This is not strict serializability: strict serializability is the combination of linearizability and multi-object serializability — real-time ordering and cross-shard atomicity. Spanner and CockroachDB provide strict serializability; their L cost is higher and priced separately below. Single-shard Raft is the building block; the additional 2PC and clock-coordination layers are what make that building block strictly serializable across objects.
Three methods exist for serving single-shard linearizable reads without appending an entry to the Raft log on every read.
ReadIndex (Ongaro Sec. 6.4): the leader records the current commit index , broadcasts a heartbeat to a quorum confirming it is still leader, then serves the read from the state machine at once quorum ACKs arrive.
- Cost: 1 quorum RTT per heartbeat batch. Multiple reads waiting during the same window share one heartbeat: at 10,000 reads/sec and , one heartbeat serves up to 10,000 reads per millisecond.
- Correctness: always linearizable. No dependency on clock accuracy.
Lease reads (Ongaro Sec. 6.4.1): the leader maintains a lease window where . Within this window it is guaranteed to be the only leader, so it serves reads directly without heartbeating.
- Cost: 0 RTT s per read. Lease renewal is shared with the write heartbeat path — amortized to zero at nonzero write rate.
- Correctness: linearizable if and only if actual clock skew . If this bound is violated, a new leader can be elected while the old leader still serves reads from its lease window — two nodes simultaneously returning potentially different values to clients expecting linearizability.
Follower reads (non-linearizable): any follower serves reads from its local state.
- Cost: 0 RTT s.
- Correctness: sequential consistency at best (all writes eventually reach follower in log order) or causal (with session vector clock pinning). Stale by up to replication lag: intra-DC (e.g., typical); cross-region.
The following diagram shows the three read methods side by side, mapping each to its RTT cost and consistency guarantee.
%%{init: {'theme': 'neutral'}}%%
flowchart TD
subgraph RI["ReadIndex: 1 RTT per batch -- always linearizable"]
RI1["Client read arrives at leader"]:::entry
RI2["Leader records commit index R_idx"]:::work
RI3["Leader sends heartbeat to majority"]:::work
RI4["Majority ACK: still the leader"]:::ok
RI5["Leader serves read at R_idx"]:::work
RI1 --> RI2 --> RI3 --> RI4 --> RI5
end
subgraph LR_["Lease read: 0 RTT -- linearizable if clocks bounded"]
LR1["Client read arrives at leader"]:::entry
LR2{"now less than t0 + T_lease?"}:::decide
LR3["Leader serves read -- no network"]:::ok
LR4["Lease renewal via write heartbeat"]:::work
LR1 --> LR2 --> LR3
LR4 -.->|"amortized"| LR2
end
subgraph FR["Follower stale read: 0 RTT -- sequential or causal"]
FR1["Client read arrives at follower"]:::entry
FR2["Follower serves from local state
stale by up to replication lag"]:::warn
FR1 --> FR2
end
classDef entry fill:none,stroke:#333,stroke-width:2px;
classDef decide fill:none,stroke:#ca8a04,stroke-width:2px;
classDef work fill:none,stroke:#333,stroke-width:1px;
classDef ok fill:none,stroke:#22c55e,stroke-width:2px;
classDef warn fill:none,stroke:#b71c1c,stroke-width:2px,stroke-dasharray: 4 4;
Read the diagram. Three methods for serving reads in Raft, each at a different point on the consistency-latency frontier. ReadIndex pays one RTT per batch for guaranteed linearizability. Lease reads pay zero RTT s but depend on bounded clock skew — a safety assumption, not a tuning knob. Follower reads pay zero RTT s and deliver sequential or causal consistency, stale by replication lag.
The following table summarizes each method’s RTT cost, consistency guarantee, throughput ceiling, and the specific risk that makes each choice non-trivial.
| Method | RTT cost | Linearizable? | Single-DC throughput ceiling | Risk |
|---|---|---|---|---|
| Log append (naive) | 1 per read | Yes | reads/sec | Saturates log at modest read load |
| ReadIndex | 1 per batch | Yes | reads/sec | None beyond leader bottleneck |
| Lease read | 0 | Yes (if clocks bounded) | Leader CPU bound | Clock skew breaks linearizability |
| Follower stale | 0 | No — sequential/causal | Distributed across replicas | Stale reads by replication lag |
Linearizability vs. strict serializability — L cost separation. The table above prices Raft’s domain: single-object (single-shard) linearizability. Strict serializability — Spanner’s and CockroachDB’s domain — adds cross-shard atomicity on top and changes the L cost:
- Raft ReadIndex (single-shard linearizability): for the heartbeat quorum confirming leader validity. Scope: one key, one Raft group. No claim of ordering across shards.
- Spanner (multi-object strict serializability): per-shard Paxos write ( ) + 2PC Prepare across coordinators ( ) + 2PC Commit ( ) + TrueTime commit-wait (0–7ms clock-uncertainty window) = + uncertainty minimum per cross-shard write. The commit-wait is the price of real-time ordering without a shared global lock: Spanner waits until its atomic clock uncertainty resolves before releasing the commit timestamp, guaranteeing no future transaction will receive an earlier timestamp.
- CockroachDB (multi-object strict serializability): per-shard Raft write ( ) + 2PC Prepare ( ) + 2PC Commit ( ) = minimum. CockroachDB replaces TrueTime with a hybrid logical clock, eliminating the commit-wait uncertainty window but retaining the 2PC structure. Transactions that detect a clock skew violation restart rather than wait — shifting the cost from latency to occasional abort rate.
The rule when pricing : every shard boundary crossed adds one for 2PC; real-time ordering across nodes without a shared clock adds one for clock coordination (or a restart cost in its place). Single-object linearizability (Raft) pays neither; multi-object strict serializability (Spanner/CockroachDB) pays both.
Network Budget Worksheet
Consistency level and write latency are not independent choices — they are geometrically coupled. Each coordination protocol consumes RTT hops, and the total hop budget equals the write SLA divided by the RTT floor. Strict cross-shard serializability requires three coordination hops (prepare, commit, ack); if your write SLA is 10ms and your RTT is 3ms, you have exactly 3.3 hops of budget — barely enough. If your write SLA is 2ms globally with 100ms cross-region RTT , no protocol fits: you can have global serializability or the 2ms SLA , not both. The ceiling follows mechanically from three inputs:
— P99 inter-node RTT (set by network topology and geography, not tunable). — write latency SLA (the budget). — read latency SLA . Given these, the maximum achievable consistency levels are:
The 3L threshold is the conservative baseline for systems that layer 2PC on top of consensus as separate steps. Integrated databases that pipeline the 2PC Prepare phase into the consensus replication log (Spanner, CockroachDB) compress the critical path to approximately 2L — so if your database is one of these, the cross-shard boundary shifts to . For planning purposes, use 3L unless you have measured the actual critical path on your specific stack.
Sample calculations at representative deployments:
| Deployment | Write SLA | Read SLA | Max write | Max read | |
|---|---|---|---|---|---|
| Same-rack, single-DC | 0.5ms | 2ms | 1ms | Strict Serial cross-shard ( ) | ReadIndex (0.5ms) |
| Cross-rack, single-DC | 3ms | 2ms | 1ms | Strict Serial single-shard (3ms) — barely | Lease only (ReadIndex needs 3ms) |
| Cross-rack, single-DC | 3ms | 10ms | 5ms | Strict Serial cross-shard (9ms) | ReadIndex (3ms) |
| US-East to EU | 100ms | 50ms | 50ms | Regional strict serial; NOT global | Regional ReadIndex; NOT global |
| US-East to EU | 100ms | 400ms | 200ms | Global serializable (300ms+) | Global ReadIndex (100ms) |
| Global, regional quorums | 1ms intra | 5ms | 5ms | Per-region strict serial | Lease intra; stale cross-region |
The worksheet exposes a common planning error: architects specify a latency SLA and a consistency requirement independently, then discover they are mutually exclusive only after deployment. A 1ms P99 write SLA with cross-shard strict serializability requires inter-node P99 — below what most shared-infrastructure networks deliver without co-location guarantees.
Named failure mode: “latency SLA below RTT floor.” A service commits to P99 write latency of 2ms globally with strict serializability. The cross-region inter-node P99 is 100ms. The SLA requires at minimum 100ms for the quorum round trip; 2ms is physically unreachable at that topology. The options are: relax the SLA to at least per RTT required, relax consistency to causal (0 coordination RTT s), or restructure the deployment so that the write quorum is intra-region and cross-region reads are explicitly stale. There is no fourth option.
Named failure mode: “linearizable read via log append.” A team implements reads by appending a no-op entry to the Raft log and waiting for commit, then returning the state at that log index. This is linearizable — the read is serialized by its log position. It is also expensive: every read consumes a full quorum RTT and a log slot. At 10,000 reads/sec and , the read path saturates the log at 10 entries per millisecond. ReadIndex achieves the same guarantee at 1 RTT per batch of all reads in a 1ms window — the same quorum cost amortized over all concurrent reads.
Watch out for: conflict-free merge structures trade coordination cost for metadata cost, and the trade is not always favorable. An append-only distributed tally at 1,000 nodes requires a 1,000-element vector in every message — 8KB at minimum under a naive vector clock (compact-vector pruning reduces this; see above). A grow-only set with 1M tombstones from deleted elements requires 1M entries in every sync message — unbounded growth. Named failure mode: “tombstone avalanche” — systems using LWW -Element-Sets or OR-Sets without regular garbage collection accumulate tombstone vectors that eventually exceed message size limits or network throughput. The fix is tombstone compaction on a schedule, which itself requires at-least-once delivery guarantees to maintain conflict-free merge invariants [3] . Teams adopting conflict-free merge structures for “zero coordination” discover the coordination was moved to garbage collection. This is not an escape from the frontier — it is a different point on it. Automerge [14] and Yjs [15] reduce per-tombstone storage overhead substantially — Automerge through columnar binary encoding, Yjs through a compact linked-list structure — but delete-heavy workloads still require scheduled compaction on both; the growth is bounded per operation, not eliminated.
The tax also shifts axis. In a linearizable system, the write pays the coordination cost — the consensus round-trip guarantees that all subsequent reads see a consistent value immediately, at no cost to the reader. In a conflict-free merge system, writes are coordination-free but reads must perform the merge that the writer deferred. A read touching replicas must retrieve potentially divergent states and apply the conflict-free merge merge function to produce a single consistent value. The merge cost scales with state divergence and tombstone count: under a tombstone avalanche, a single read’s merge path may scan millions of stale entries before arriving at a current value. This is the read-path merge tax — the work deferred from write-time to read-time reappears as read latency and CPU cost on the user-facing critical path. The total coordination work in the system has not decreased; it has been relocated. “Zero-coordination writes” is an accurate description of the write path. It is a misleading description of the system.
Watch out for: clock comparison is itself a consistency cost, and the cost varies by three orders of magnitude depending on the mechanism.
- TrueTime (Spanner [4] ): GPS + atomic clocks, uncertainty in the 2013 deployment; current production deployments achieve under 1ms P99 through denser GPS/atomic-clock infrastructure. Commit-wait holds a transaction open for before committing. Guarantees external consistency. Pricing in the RTT framework: TrueTime adds to every write’s commit path — a fixed latency addition independent of distance or quorum size. At 1ms uncertainty, the commit-wait is comparable to a fast intra-DC Raft quorum round-trip; at 7ms (2013 original), it exceeds one intra-DC RTT and approaches the cost of a short cross-region operation. This positions TrueTime between strict serializable (1 quorum RTT) and global strict serializable (1 quorum RTT + ): the commit-wait is the price of the “global” qualifier — the guarantee that an external observer always sees writes in their actual commit order. No protocol achieves global strict serializability without paying at least per write, because any shorter wait allows two writes to be ordered differently by clocks that disagree within the uncertainty bound. Cost: commit-wait per transaction + specialized GPS/atomic-clock hardware.
- HLC (Hybrid Logical Clocks — CockroachDB [5] ): 0ms commit-wait. An uncertainty restart occurs when a read encounters a value written within the cluster-wide maximum clock offset window , where is the configured maximum allowed clock skew parameter (500ms default in many deployments; commonly tightened to 250ms in deployments with actively monitored NTP) — not the local drift of a single node but the maximum allowed skew between any two nodes in the cluster. Under NTP, typical node-to-node drift is 50–100ms; the maximum skew parameter must be set conservatively above this. The HLC uncertainty window is therefore approximately wider than TrueTime’s at the default, or when tightened to 250ms. The cost structure also differs fundamentally from TrueTime: TrueTime amortizes the uncertainty cost to writes (commit-wait blocks once per transaction at write time; reads are fast); HLC shifts the cost to reads (any read that encounters a value timestamped within the past triggers an uncertainty restart). For write-heavy workloads with stable key access, restart rates are typically below 0.1%. For hot-key read workloads with recent writes, restart rates are workload-dependent and can be significantly higher.
- NTP alone: uncertainty 100–250ms — too coarse for any distributed protocol requiring causal ordering.
Named failure mode: “NTP causal illusion” — a team believes operations are causally ordered because each server timestamps locally; VM clock drift of 500ms causes apparent time reversals; causal violations appear as “out-of-order” events in application logs. The fix: use HLC (zero hardware cost, implemented in CockroachDB and AntidoteDB) or TrueTime (hardware cost, guaranteed bounds). NTP is not a causal ordering mechanism.
%%{init: {'theme': 'neutral'}}%%
flowchart TD
NTP_NODE["NTP alone
uncertainty 100-250ms
not a causal ordering mechanism"]:::warn
DRIFT["Drift: Server B lags 500ms behind A
A writes at t=100, B reads at t=95
causal order violated"]:::warn
ILLUSION["Out-of-order events in logs
B appears before A
no error thrown: silent failure"]:::warn
HLC_FIX["Fix: HLC (CockroachDB, AntidoteDB)
max_offset=500ms, restarts on hot reads
restart rate under 0.1% on stable workloads"]:::ok
TT_FIX["Fix: TrueTime (Spanner)
epsilon=1-7ms, GPS + atomic clocks
commit-wait blocks write for epsilon"]:::ok
NTP_NODE -->|"allows drift-induced reordering"| DRIFT
DRIFT --> ILLUSION
ILLUSION -.->|"replace NTP with"| HLC_FIX
ILLUSION -.->|"replace NTP with"| TT_FIX
classDef ok fill:none,stroke:#22c55e,stroke-width:2px;
classDef warn fill:none,stroke:#b71c1c,stroke-width:2px,stroke-dasharray: 4 4;
Clock uncertainty as a continuous parameter of the feasibility surface. The three mechanisms above — TrueTime, HLC, NTP — are not discrete binary choices. Clock uncertainty continuously parameterizes the achievable latency at each consistency level, creating a feasibility surface rather than a binary threshold.
For TrueTime (Spanner), every write’s commit path costs . External consistency is jointly feasible with a latency SLA of if and only if , defining a feasibility threshold . Reducing from 7ms to 1ms is not merely a 6ms latency improvement — it expands the feasible region in the (latency, consistency) plane: operating points jointly infeasible at 7ms uncertainty become reachable at 1ms. For a system with a 10ms write SLA and a 5ms intra-DC quorum RTT, ms — TrueTime’s 2013-era 7ms uncertainty made external consistency infeasible within that SLA; current sub-1ms deployments make it feasible with 4ms to spare.
For HLC (CockroachDB), the uncertainty restart rate rises proportionally with and the write rate on hot keys: effective P99 read latency is a continuous function of , not a binary safe/unsafe condition. At no reads restart; at nearly all hot-key reads trigger restarts and P99 read latency exceeds the SLA. The feasibility boundary is the value of at which the restart-inflated P99 crosses the SLA threshold — a workload-specific continuous function of write rate and key access distribution, not a single number.
The lease boundary condition is the same structure. The safe lease window shrinks continuously as clock skew grows. As , the window collapses to zero and the system must fall back to ReadIndex with its full quorum RTT. As , the full election timeout is available for the lease. The condition is not binary — each millisecond of clock skew reduction buys one additional millisecond of safe lease headroom, and therefore one additional millisecond of latency budget before ReadIndex is required.
In all three cases, clock infrastructure investment is a frontier-expansion operation. Reducing or moves the boundary between feasible and infeasible operating points outward along the latency axis — the same geometric action as adopting a lower- consensus protocol expands the throughput frontier. The difference is the instrument: protocol choice changes and ; clock infrastructure changes and the available latency headroom at strict consistency. Both are inputs to the achievable region; neither is fixed by the CAP or FLP results.
Cognitive Map — The Consistency Tax. Every step up the consistency spectrum carries a price in round-trips, metadata overhead, or hardware — no step is free, and the consistency tax function makes each cost explicit and measurable. The three clock mechanisms differ in where they pay the uncertainty cost: NTP pays nothing and provides no guarantee; HLC pays on reads; TrueTime pays on writes via commit-wait. conflict-free merge adoption moves the cost from coordination to metadata: the frontier shifts, it does not disappear. Choosing the wrong mechanism does not raise an exception — it silently reorders events in ways your application was not designed to handle.
The Consensus Protocol Tax
Intra-DC Raft commits in one quorum RTT — one . Cross-region Raft commits in one . The vacuum lower bound ms assumes great-circle distance km in vacuum; production P99 is approximately higher due to fiber refractive index and non-great-circle cable routing. Use ms as a planning estimate, or measure your actual inter-region P99 directly.
The consistency tax tells you what each level costs in the abstract. The consensus protocol tax tells you what it costs in practice — because the protocol that implements your consistency level sets the Coherency Tax ( ) from The Physics Tax and the latency floor from the consistency spectrum above. Protocol choice is where theory meets the load test.
Notation note — three coherency symbols, each measuring a distinct layer. denotes the physical hardware floor: per-node-pair cache invalidation, NIC contention, and memory bus synchronization costs established in The Physics Tax. denotes the logical protocol overhead: consensus rounds, quorum sizes, and dependency resolution mechanics that operate strictly above and that protocol selection can lower. The combined distributed-systems throughput ceiling is ; protocol selection lowers but no software optimization drops it below .
Topology note — star vs. all-to-all, and what each shows up as in the USL fit. The factor in the USL denominator models pairwise coordination — every node communicating with every other. This is exact for all-to-all protocols (gossip replication, EPaxos dependency tracking) and approximate for others. Raft uses a star topology: all write coordination routes through a single leader, which sends and receives messages per round — coordination cost per round, not . In a USL fit of a Raft cluster, this manifests not as a large (small, because pairwise coherency is genuinely low) but as a large : the leader is a serialization bottleneck, and the serial fraction absorbs the scaling cost that a pairwise protocol would express as coherency overhead. A Raft cluster’s canonical signature in the USL — high , low — is correct precisely because the model is fitting the right physical behavior. The bottleneck is leader saturation, not mesh coherency; the USL coefficients reflect the topology. Cross-shard coordination (where any node may coordinate with any other) and all-to-all gossip genuinely produce the scaling the quadratic term assumes — those are the settings where a large is the binding constraint, not a large . Observant readers who notice that Raft’s star topology does not match the USL’s pairwise assumption can verify this by fitting their Raft cluster: they will find , not the reverse.
Definition 15 -- Logical Coherency beta: the protocol-level coordination overhead above the hardware floor, which compounds with kappa to set the true scalability ceiling
Axiom: Definition 15: Logical Coherency
Formal Constraint: The logical coherency coefficient is the per-node-pair overhead imposed by a consensus protocol’s agreement mechanics — quorum sizes, message rounds, dependency resolution — above the physical hardware floor . The observed coherency is additive, and the scalability bound becomes:
Protocol selection is the only lever that lowers ; no software optimization drops the combined floor below .
Topology note: How distributes between the USL terms depends on the consensus topology. Leaderless protocols (EPaxos, Dynamo) generate genuine all-to-all cross-talk; their overhead inflates — pairwise, in the USL denominator. Leader-based protocols (Raft, Multi-Paxos) route all writes through one node; the leader queue is a serialization bottleneck that primarily inflates (the Amdahl contention term), with a smaller contribution from the follower fan-out. The additive form gives the combined ceiling; for leader-based protocols, expect to carry most of the protocol penalty at high write rates.
Composition note: The additive form gives the structural scaling ceiling at the moment of measurement — it describes which part of the coherency cost is hardware-floor and which is protocol-layer. This is the commissioning model. In a live deployment, itself is subject to environmental modifiers — load distribution shifts, co-tenant pressure, deployment-environment jitter — that compound multiplicatively with the baseline value. The additive decomposition remains valid for protocol selection; the multiplicative compounding governs how the realized ceiling evolves over time.
Engineering Translation: Raft: ( ). EPaxos fast path: ( at 0% conflict rate, falling to Raft’s ceiling at 80% conflicts). Conflict-free merge structures: on writes — but the merge cost relocates to the read path as . These are USL-fit estimates from load tests; the papers report throughput curves, not values — you must fit from your own measurements.
The most consequential planning error is treating consistency as a binary setting — safe or unsafe, with no measurable middle ground. The full consistency spectrum is a price list where every level has a denominated cost and every protocol choice stakes a specific position on that list. State-Machine Replication ( SMR ) — Raft, Multi-Paxos, EPaxos — occupies the strict-serial region of the achievable region: writes are ordered through consensus, the log defines truth, and every read reflects an agreed global state. This region is reachable at a price: the RTT floor from the consistency price list, the coefficient that sets , and the implementation complexity that surfaces at 3am. Conflict-free merge structures and HATs occupy a different region entirely: writes are coordination-free, state evolves through lattice merge, and the system makes no total-order claims. The frontier separating these two regions is the HAT boundary — a hard line between what requires consensus and what does not. No protocol straddles it. Choosing between SMR and conflict-free merge structures is not a performance tuning decision — it is a choice of which region of the achievable region to inhabit, with all of that region’s properties, costs, and excluded corners.
Intra-Datacenter Raft
The dominant consensus protocol in production distributed systems [6] carries these measured costs:
- Commit latency: 1–5ms per write (one quorum round-trip within a datacenter)
- Write throughput: bounded by leader throughput — all writes serialize through a single log
- Leader failover unavailability: 300ms–2s (election timeout multiplied by 2–3 election rounds)
- value: approximately 0.002–0.005 for a standard 3-node Raft cluster under transaction workload ( USL fit estimate; measure your cluster — the Raft paper reports throughput curves, not values)
These numbers place a Raft-based system at – write-coordinating nodes. Beyond that, the coherency tax from Proposition 7 dominates.
Cross-Region Raft
- Commit latency: ~100ms per consensus round (speed of light at intercontinental distances plus network RTT )
- Commit latency floor: 100ms minimum per write — irreducible propagation delay, not a protocol limitation. A single sequential client is capped at ~10 writes/sec. Raft pipelining and batching remove this sequential ceiling: if 1,000 concurrent requests arrive at the leader, a single AppendEntries RPC commits them all in that same 100ms window. The toll is not a throughput ceiling — it is a 100ms latency floor on every committed write and a proportional concurrency requirement: sustaining N writes/sec requires roughly N × 0.1s = N/10 concurrent inflight requests to keep the pipeline full.
- This is not a configuration problem. It is a physics problem. The speed of light at 10,000km imposes a minimum one-way propagation time of ~33ms in vacuum; the refractive index of fiber (~1.47) raises this to ~49ms per direction, giving a ~98ms fiber RTT before any routing overhead. Real-world network paths add 10–30ms of routing and queuing, producing the ~100ms observed RTT . No software optimization changes the propagation delay. The consequence is a 100ms latency floor on every committed write — and exhausted thread pools, saturated connection limits, and rigid P50 latency at that floor under any load.
EPaxos , Atlas, and the Implementation Frontier
The literature offers protocols with better theoretical properties than Raft:
- EPaxos (SOSP 2013 [7] ): leaderless; fast-path commits for commuting commands in approximately 70% of cases; on fast path ( USL fit estimate from the paper’s throughput curves)
- Atlas (EuroSys 2020 [8] ): fast-path commit rate approximately 88%; substantially faster post-commit execution than EPaxos on comparable workloads
The EPaxos Illusion
EPaxos commits in one of two ways depending on whether concurrent proposals conflict. When two operations commute — they can be applied in either order with the same result, like two independent counter increments — any replica can commit the operation in a single round-trip without coordinating with other proposers. This is the fast path, and it is where the throughput gain lives. When operations do conflict — two writes to the same key, or operations where order matters — EPaxos must run a slower multi-round resolution to establish an agreed ordering. The entire value proposition rests on the fast path being the common case.
The theoretical case is unambiguous: the fast path commits non-conflicting operations in 1 L with , yielding versus Raft’s . Every one of those numbers is conditional on — the fraction of operations reaching the slow path. Let . The effective coherency coefficient across a mixed workload:
This calculates the effective logical protocol penalty in isolation. The true observed in a load test is bounded by — the hardware floor sets a ceiling no protocol improvement can exceed.
The and throughput ceiling (at ) degrade as follows:
| (non-commuting) | Throughput ceiling | vs. Raft | ||
|---|---|---|---|---|
| 0% (advertised) | 0.0005 | 44 | +96% | |
| 10% | 0.00075 | 36 | +69% | |
| 30% | 0.00125 | 28 | +38% | |
| 50% | 0.00175 | 24 | +18% | |
| 80% | 0.0025 | 20 | +6% |
The throughput advantage disappears above . The P99 latency picture is structurally worse. EPaxos fast path delivers 1 L ; slow path delivers 2 L . Since the slow path claims of all commits, P99 = 2 L whenever . Raft P99 = 1 L always. Every production workload with hot keys, range scans, or cross-shard mutations satisfies . The protocol that advertises 1 L commit latency delivers 2 L P99 in every non-trivial production deployment.
Under a contention storm — hot-key pressure during a flash sale, rate-limiter write bursts, coordinated JVM GC triggering dependency re-resolution across all nodes simultaneously — . The outcome: (Raft-equivalent), (Raft-equivalent), P99 (worse than Raft’s ). The system performs as Raft in throughput, worse than Raft in tail latency, and carries EPaxos ’s full operational complexity. The contention storm is precisely the condition where you need scaling headroom most, and EPaxos provides none.
The non-commuting rate is not a configuration parameter. It is determined by the workload’s conflict graph and key-access skew. In systems with power-law access distributions — standard in payment processing, inventory management, social graph mutation — the top 1% of keys receive 30%+ of writes. On hot keys, regardless of protocol. EPaxos provides maximum benefit for already well-distributed workloads — the easy scaling case. It collapses to Raft-equivalent performance on hot-key-dominated workloads — the hard scaling case. Atlas improves the slow-path execution time but inherits the same structural conditional: at ~88% fast-path rate on a uniform workload, any real key skew pushes well past the break-even point.
Production-Viable Alternatives
Three mechanisms lower without the contention trap.
Flexible Paxos quorum downsizing [16] : the replication safety invariant requires only — write and read quorums must intersect, but neither must be a majority independently. For append-only write paths (event logs, WAL s) where quorum reads are rare: at , , set or reduce to with . Contacting 2 acceptors per write instead of 3 reduces the per-commit message count by 33%, lowering proportionally without any protocol replacement — quorum configuration only. High-contention behavior is structurally identical to standard Raft: one leader, one execution path, no slow-path fallback.
Bounded-staleness follower reads: route reads to the nearest replica under a bounded-staleness contract (data confirmed at most stale, typically 5 seconds). Writes still commit through the Raft leader; drops because reads no longer contend for leader CPU and network capacity. For read-dominant workloads (70%+ reads), the reduction in shifts further than a protocol switch to EPaxos would on a write-dominant workload — without changing write-path protocol or operational runbooks.
Sharded N=3 Raft groups instead of a single large group: at , and per group — well clear of the ceiling. Scale by adding groups, not by raising within a group; the coherency ceiling never binds per-group. The coordination tax moves to cross-shard transactions, which is acceptable when cross-shard operations are below 10% of traffic. This is the production architecture used by etcd, TiKV, and CockroachDB: many N=3 Raft groups, keyspace partitioned across them.
Three additional protocols extend the frontier in different directions, each with explicit trade-offs:
Fast Paxos — the latency optimization and its contention trap. Fast Paxos [10] achieves lower common-case commit latency than Multi-Paxos by allowing clients to send Phase 2 proposals directly to all acceptors, bypassing the leader on the critical path and reducing message delays from 2 to approximately 1.5 — saving one one-way message delay per commit. On WAN topologies where each message delay is 50–100ms intercontinental, this is a 50–100ms per-commit saving. The trade-off appears in two places. First, the fast quorum requires acceptors — larger than the quorum of Classic Paxos — which increases coherency cost and reduces for the same cluster. Second, when two clients propose simultaneously, collision detection and leader-mediated resolution add a third round-trip, pushing high-contention latency to 3 RTT s — worse than Classic Paxos at 2 RTT s. The achievable region expands on the latency axis (common-case WAN commits) and contracts on the throughput axis (collision overhead shrinks ) — a protocol-level trade-off, not a free improvement.
HotStuff — eliminating the BFT quadratic bottleneck. Classic PBFT requires messages per consensus round: every replica broadcasts to every other replica, producing messages before a quorum decision. At 100 replicas under 10Gbps links and 1KB messages, this generates 10 GB/sec of consensus traffic — network-saturated before reaching any application throughput. HotStuff [11] (PODC 2019, adopted in Diem/LibraBFT) reduces message complexity to per round through threshold signature aggregation: replicas send partial signatures to a rotating leader, which aggregates them into a single threshold signature proving quorum agreement. The protocol pipelines three phases (Prepare, Pre-Commit, Commit) so that while round is in its Commit phase, round is already in Prepare — amortizing 3 RTT s to 1 effective RTT per block in steady-state. The consequence: message complexity per node pair drops from (PBFT) to (HotStuff), making viable at 50–200 replicas. The latency penalty is real: 3 RTT s per commit (vs. 1 for Raft) pushes intra-DC latency to 10–20ms and WAN latency to 300ms+ per block — HotStuff is not a Raft replacement for latency-sensitive systems, but a frontier expansion at the BFT -tolerance axis that Raft cannot reach at any configuration.
The protocols above — Raft, EPaxos, HotStuff — all occupy the same region of the achievable region: strict serializability, consensus-based, RTT-priced. A distinct region exists at the opposite end of the consistency spectrum, where writes carry no consensus cost at all.
The zero-coordination frontier — conflict-free merges and HATs. The protocols above all operate on the strict-serializability axis: they commit operations through consensus and pay the RTT price on every write. The zero-coordination frontier is a different region of the achievable region entirely, where writes carry no consensus cost at all. Conflict-free merge structures (Shapiro et al. 2011 [3] ) achieve coordination-free writes by restricting state updates to a merge-compatible lattice: any write can be applied locally and merged with any other replica’s state without conflict resolution. HATs (Bailis et al. 2014 [2] ) extend this to read-your-writes and monotonic read guarantees achievable without consensus. Both approaches place on the write path — coherency overhead effectively zero, effectively unbounded. The trade-off is not on the write side: it is on the read side (merge cost grows with divergence and tombstone accumulation — the read-path merge tax formalized above) and the consistency side (causal or weaker guarantees only; strict serializability is excluded from this region by the HAT boundary). The zero-coordination frontier is not a better Raft. It is a different axis of the achievable region, reachable only by crossing the consistency level from strict-serial to causal.
The connection to the Coherency Tax: every protocol in this table sets a value and thereby a Protocol Ceiling — the maximum throughput the system can reach before the Coherency Tax’s quadratic penalty dominates. The Protocol Ceiling is not a tunable soft limit; it is from Proposition 7 in The Physics Tax, where is the fixed hardware floor and is the protocol-dependent overhead the chosen protocol imposes. Choosing a protocol without measuring its is accepting an unknown Protocol Ceiling. Each row in the protocol comparison that follows is an operating point on the frontier, with columns chosen to answer the topological question: for a given network environment and contention profile, which protocol’s achievable region contains your target operating point?
| Protocol | Fast-path commit | / | High-contention behavior | Optimal topology and workload |
|---|---|---|---|---|
| Raft / Multi-Paxos | 1 RTT | 0.002–0.005 / 14–22 | Stable; leader serializes all writes | Intra-DC; all workloads; production default |
| Fast Paxos | 0.5 RTT (common case) | 0.001–0.003 / 18–31 | Degrades to 3 RTT s on collision; below Raft | WAN, low-contention, single-proposer common case |
| EPaxos (fast path) | 1 RTT | ~0.0005 / ~44 | Holds for commuting operations | High-throughput intra-DC; commuting-dominant workloads |
| EPaxos (slow path) | 2 RTT s | ~0.003 / ~18 | Same as Raft; full complexity cost | Unavoidable fallback; avoid if non-commuting rate exceeds 30% |
| HotStuff ( BFT , pipelined) | 3 RTT s per commit; pipelining overlaps consecutive commits, yielding ~1 RTT sustained at high load (amortizes the 3-RTT critical path, does not reduce it) | ~0.0005 / 50–100 | Linear messaging; scales to 200 replicas | Byzantine-fault required; blockchain; large replica groups |
| HATs / conflict-free merges (zero-coord.) | 0 RTT (write) | ~0 / unbounded | None on writes; read merge cost grows with divergence | WAN geo-distribution; causal or weaker; read-heavy workloads |
Reading the trade-offs. EPaxos fast path reduces from Raft’s 0.003 to ~0.0005, shifting from 18 to 44 nodes — roughly a 2x throughput ceiling expansion — at a commit latency cost of +2ms for fast-path dependency resolution. Fast Paxos moves the latency axis in the opposite direction: common-case commit falls by ~0.5 RTT, but the fast-path quorum is larger and collision-path latency reaches Raft — making it worse than Raft on the contention axis whenever conflict rate is non-trivial. HotStuff opens the BFT dimension: messaging makes 100-node replica groups viable where PBFT’s saturates network bandwidth, at the cost of 3 RTTs per commit. These are the rows in the table above read as frontier movements, not protocol rankings.
The following diagram maps each protocol as a frontier movement — showing which axis each choice expands and what it costs on the adjacent axes.
%%{init: {'theme': 'neutral'}}%%
flowchart LR
CRDT["HATs / conflict-free merges
l=0 write, c=causal, beta~0
N_max=unbounded: zero-coordination frontier"]:::leaf
RAFT["Raft / Multi-Paxos
l=5ms, c=strict, t=8*gamma, N_max=18
beta=0.003: production default"]:::branch
FPAXOS["Fast Paxos
l=3ms common case, c=strict, N_max=25
collision path: l=15ms, N_max=13"]:::branch
EPAXOS_FAST["EPaxos (fast path)
l=7ms, c=strict, t=15.7*gamma, N_max=44
commuting ops: frontier expansion"]:::ok
EPAXOS_SLOW["EPaxos (slow path)
l=10ms, c=strict, t=8*gamma, N_max=18
same ceiling as Raft, full complexity"]:::warn
HOTSTUFF["HotStuff BFT pipelined
l=15ms, c=strict BFT, N_max=50-100
O(N) messaging: BFT-axis expansion"]:::branch
CRDT -->|"add consensus: cross to strict-serial"| RAFT
RAFT -->|"WAN latency opt, low-contention only"| FPAXOS
RAFT -->|"leaderless: 2x throughput ceiling"| EPAXOS_FAST
RAFT -->|"BFT-axis expansion, 100+ replicas"| HOTSTUFF
EPAXOS_FAST -->|"non-commuting path, no throughput benefit"| EPAXOS_SLOW
classDef leaf fill:none,stroke:#333,stroke-width:1px;
classDef branch fill:none,stroke:#ca8a04,stroke-width:2px;
classDef ok fill:none,stroke:#22c55e,stroke-width:2px;
classDef warn fill:none,stroke:#b71c1c,stroke-width:2px,stroke-dasharray: 4 4;
Read the diagram. The slow path is the critical failure mode: when EPaxos cannot find a commuting fast path (estimated 30% of workloads with conflicting operations), it falls back to a slow path with — identical to Raft — while carrying the full implementation complexity of a leaderless protocol. At 3am under an incident, the slow path behavior is indistinguishable from Raft in throughput but far harder to diagnose. The expected value of the protocol switch is positive only if the workload’s commuting rate exceeds ~70%.
Every protocol in this table sets a value, and every value determines through Proposition 7 . The that matters in production is not the protocol’s theoretical minimum — it is the value achievable by the team that has to operate it at 3am. Raft’s simplicity is a different operating point on a broader frontier that includes team cognitive load.
Watch out for: cross-region consensus is bounded by the speed of light. At 10,000km separation (US East to Europe), the fiber RTT floor is approximately 98ms (propagation at ~200,000km/s). Real-world network paths add 10–30ms of routing overhead. No consensus protocol — Raft, Paxos, EPaxos , or anything yet invented — can commit a cross-region write in less than this floor. Multi-region write throughput is fundamentally limited by per leader. The only architectural escape is to weaken the consistency level for cross-region operations (accept eventual or causal consistency across regions, enforce strict serializability only within a region) — which is movement along the frontier, not expansion of it.
The Multi-Region RTT Tax
The speed of light in fiber is approximately 200,000 km/s. For any deployment spanning distance km, the one-way propagation minimum is ms; the round-trip floor is ms. No routing optimization, no protocol choice, and no engineering decision changes this floor — it is set by the refractive index of glass and the geometry of the Earth.
RTT Tax by deployment topology. Minimum achievable strict-serial write latency for each deployment distance, plus maximum consistency achievable within a 10ms write SLA . These are floors, not targets.
| Topology | Distance | RTT floor | Strict serial write floor | Achievable within 10ms write SLA |
|---|---|---|---|---|
| Same rack | 0.01 km | less than 1ms | 0.5–2ms | Strict serial (single or cross-shard) |
| Cross-rack, single DC | 0.5 km | less than 1ms | 1–5ms | Strict serial (single-shard within write SLA ; cross-shard at limit) |
| Cross-AZ, single region | 30 km | 0.3ms | 3–10ms | Strict serial (cross-shard at the edge) |
| US-East to US-West | 4,500 km | 45ms | 90ms+ | Causal or weaker only |
| US-East to EU | 6,500 km | 65ms | 130ms+ | Causal or weaker only |
| US-East to APAC | 12,000 km | 120ms | 240ms+ | Causal or weaker only |
Any system claiming strict serializability globally with write latency below the RTT floor for its topology is not interior waste. It is an excluded corner — a point that the geometry of the achievable region prohibits regardless of protocol, hardware, or engineering effort.
Interior waste vs. excluded corner. These are categorically different failures with different remedies.
Interior waste is a system that achieves its claimed consistency level but pays more than the minimum required. Example: a single-DC service that uses cross-shard 2PC for operations touching only one shard. The consistency is correct; the coordination overhead is reducible. Fix: detect and route single-shard operations to single-shard coordination. The system moves toward the frontier.
Excluded corner is a system claiming coordinates that the achievable region prohibits. No optimization can fix it — the coordinates do not exist. Example: a global service specifying strict serial writes at P99 = 5ms while spanning US-East and EU ( RTT floor: 130ms). The specification is not expensive to meet — it is geometrically impossible to meet. Fix: change the claim (weaken to regional strict serializability with explicitly stale cross-region reads) or change the geometry (ensure write quorums are intra-region only).
Named failure mode: “instant global consistency.” An architecture document specifies: cross-region strict serializability with P99 write latency 10ms globally; regions span US-East and EU. When deployed, the system does one of two things: (a) implements regional strict serializability and silently serves stale reads cross-region — lying about consistency coordinates — or (b) implements global strict serializability and delivers 130ms+ write latency — lying about latency coordinates. Both outcomes place the system in an excluded corner. No performance engineering resolves it. The Pareto Ledger check: given deployment topology, compute the RTT floor first. Any SLA that violates the floor is a specification error, not an optimization target.
RTT Tax formula. Given write SLA ms and inter-region distance km, the maximum achievable global consistency is:
where is the additional routing overhead above the propagation minimum. Causal consistency is always achievable regardless of and . The formula converts any architecture document’s claims into a pass/fail check against geography: if , the strict-serial claim is impossible, not expensive.
Safety, Liveness, and the Timeout Misconception
The election timeout in Raft is widely understood as a safety mechanism — “if the leader does not respond in time, we need a new one or we risk split-brain.” This is wrong in a precise and consequential way. Safety in Raft holds in the fully asynchronous model, where messages may be delayed arbitrarily. Liveness requires bounded delays. The difference determines what happens when you tune the timeout incorrectly.
Safety property (temporal logic notation). Safety is an invariant over all reachable states — expressed with the modal operator (“in every state on every execution”):
Both hold in the asynchronous model. The proof has three components, none of which reference time:
- Vote uniqueness — each node persists its recorded vote for the current term to durable storage and grants at most one vote per term. This is a local state invariant: . No message timing involved.
- Quorum intersection — any two majorities of nodes share at least one member. For : any two sets of 2 share 1. For : any two sets of 3 share at least 1. This is a counting argument, independent of when or whether messages arrive.
- Log completeness on election — RequestVote rejects candidates whose log is less complete than the voter’s last committed entry. A new leader therefore inherits all committed entries regardless of how long the network was partitioned. This depends only on the log comparison rule, not on timing.
If network delay is infinite — messages never arrive — the cluster makes no progress. It does not corrupt state. Safety and “making no progress” are compatible by definition.
Liveness property. Liveness is an eventuality — expressed with the modal operator (“in some future state on every execution”):
This requires the Partial Synchrony assumption (Dwork, Lynch, Stockmeyer 1988 [9] ): there exists a Global Stabilization Time after which all message delays are bounded by , where (at the design boundary where , this simplifies to ). Without this bound, FLP applies — no deterministic algorithm can guarantee that a leader is ever elected after a failure. The liveness proof reduces to: once delays stabilize, (a) a follower detects leader absence within , (b) wins an election if it contacts a majority within , (c) begins committing entries. Step (c) depends on messages arriving. Steps (a) and (b) depend on delays being bounded. None of this touches safety.
The following diagram shows the relationship between the two models: safety holds in the fully asynchronous model; liveness requires adding partial synchrony, where FLP applies only to the liveness guarantee.
%%{init: {'theme': 'neutral'}}%%
flowchart LR
subgraph ASYNC["Asynchronous model -- no timing assumptions"]
SAF["Safety holds
quorum intersection prevents split-brain
vote uniqueness, log completeness"]:::ok
end
subgraph PARTIAL_SYNC["Partial synchrony -- delays bounded after GST"]
LIV["Liveness holds
leader eventually elected
requires: delta less than election_timeout - max_RTT"]:::ok
end
FLP["FLP impossibility
no deterministic algorithm
guarantees liveness in async alone"]:::warn
ASYNC -->|"add partial synchrony"| PARTIAL_SYNC
FLP -.->|"applies to liveness only"| ASYNC
classDef ok fill:none,stroke:#22c55e,stroke-width:2px;
classDef warn fill:none,stroke:#b71c1c,stroke-width:2px,stroke-dasharray: 4 4;
The Safety-Liveness Gap. Safety is a property of quorum arithmetic — it holds even when messages never arrive; FLP impossibility applies only to liveness, never to safety. Wrong timeout tuning produces liveness failures, not safety failures. The distinction has direct operational impact.
Scenario A — timeout too short (150ms, max network RTT = 200ms): the follower starts spurious elections while the leader is alive and processing. Multiple candidates contest the same term; at most one wins — quorum intersection prevents two simultaneous leaders. Write throughput collapses; leader churn increases commit latency during elections. This is a liveness failure: no stable leader means no progress. It is not a safety failure: quorum intersection still prevents two nodes committing conflicting values at the same index.
Scenario B — timeout too long (30 seconds, max network RTT = 5ms): after leader failure, the cluster waits up to 30 seconds before an election fires. Result: 30-second write unavailability per failure. This is a liveness failure: slow recovery. It is not a safety failure: entries committed before the failure are durable; the new leader will have them by log completeness.
Scenario C — quorum count error (4-node cluster requiring only 2 votes instead of 3): two independent 2-node partitions can each elect a leader and independently commit conflicting values at the same log index. This is a safety failure — StateMachineSafety is violated. It has nothing to do with timeouts. It is a quorum arithmetic error introduced at configuration time.
Named failure mode: “timeout equals safety.” A team reduces election timeout from 300ms to 50ms to improve failover speed. They observe frequent spurious elections under normal operations and dismiss them as harmless availability improvements. What they have produced is a liveness degradation: the cluster now spends 10–15% of operating time in election mode, not committing writes. They believe faster failover equals higher availability. In reality, under normal operations (where the leader is healthy), they have reduced availability by injecting write stalls every few seconds. The system is never unsafe. It is frequently unavailable.
Named failure mode: “lease read clock drift.” Raft read leases allow the leader to serve reads without a quorum round-trip. A lease of duration grants permission to serve reads for milliseconds after the last confirmed heartbeat, on the assumption that no other leader can be elected within that window. Safety requires . With NTP at 250ms uncertainty, a 300ms election timeout permits a safe lease of at most 50ms. A team that sets without accounting for clock skew allows a new leader to be elected while the old leader still serves reads from its lease window — two nodes serving reads simultaneously, potentially returning stale values to a client expecting linearizability. This is a safety violation: a linearizable read that returns a stale value violates the definition of linearizability. It is caused by placing a safety-critical invariant on timing rather than quorum arithmetic.
Named failure mode: “Volatile GST — frontier vibration and collapse.” The Partial Synchrony assumption does not claim the network stays synchronous — it claims there exists a GST after which it becomes synchronous. In production, GST resets during incidents: cloud provider brownouts elevate P99.9 inter-region RTT from 2ms to 400ms for 30–120 seconds; BGP re-convergence causes 30–300ms of packet loss; NIC firmware events spike P99.9 to 800ms during maintenance windows. Each reset re-enters the asynchronous model for the duration.
Two regimes, depending on incident duration relative to election timeout:
Vibration — incident duration less than two election timeout cycles: the latency floor rises transiently, triggering spurious elections and elevated commit latency. Safety holds. After stabilization, the frontier returns to baseline. The achievable region temporarily contracted and then recovered — the system “vibrated.”
Collapse — incident sustained beyond multiple election timeout cycles: no candidate can contact a majority within the bounded-delay window. FLP liveness applies: no deterministic algorithm guarantees progress. Writes stop. Safety holds — no corruption — but the liveness dimension of the achievable region has contracted to empty.
Detection: monitor the GST margin — the gap between election timeout and current P99.9 message delivery time: . A cloud brownout that elevates RTT P99.9 from 2ms to 450ms with election timeout 500ms drops the GST margin from 498ms to 50ms — 50ms of headroom before any jitter spike triggers write stalls. The fix is not post-incident timeout tuning; it is monitoring the margin continuously and either increasing election timeout when margin falls below , or adopting an adaptive timeout that tracks observed P99.9 RTT .
Correctness checklist. The following questions establish whether your consensus implementation is grounded in quorum arithmetic (safety-bearing) or timing (liveness-only). Safety questions require a yes; liveness questions require a bounded answer.
Safety (quorum arithmetic — timing-independent):
- Quorum count: is your quorum exactly ? Verify for both read and write quorums under membership changes.
- Vote persistence: does each node write its current-term vote to durable storage before responding to a vote request? A crash-then-recover node that loses this record can vote twice in the same term.
- Log completeness in vote grants: does RequestVote reject candidates whose last log entry term is less than the voter’s, or (same term) whose log is shorter?
- Lease duration: if using read leases, is ? Measure actual inter-node clock skew using your NTP monitoring tooling before setting .
- Membership changes: are configuration changes using single-server changes or joint consensus? Ad-hoc quorum reconfigurations that bypass these mechanisms can violate quorum intersection transiently during the transition.
- Clock isolation: does any code on the commit or vote path use wall-clock time for ordering or correctness decisions? Wall clock time is not monotonic under NTP adjustments. Ordering decisions must use logical time — term number and log index — not wall clock time.
Liveness (timing-dependent — tuning, not correctness):
- Heartbeat interval : ensures heartbeats reach followers before timeouts fire under normal operations. Too large = spurious elections; too small = unnecessary network overhead.
- Election timeout : prevents spurious elections from network delay. Too small = liveness degradation. Too large = slow failover recovery. Neither value makes the system unsafe.
A system that passes items 1–6 has safety grounded in quorum arithmetic. Items 7–8 tune how quickly it makes progress. The two are independent. Setting election timeout to 10 minutes makes the cluster very slow to recover from leader failure; it does not make it possible for two leaders to commit conflicting values.
Physical translation. Raft’s election timeout does one thing: it decides when a follower concludes that the leader is absent and starts an election. It does not prevent two leaders from simultaneously committing conflicting values — quorum intersection does that, and quorum intersection holds whether or not messages ever arrive. Setting the timeout to 10ms does not make your cluster unsafe; it makes your cluster hold elections constantly. Setting it to 10 minutes does not make your cluster safe; it already was. The only way to break safety is to break quorum arithmetic — by miscounting votes, losing durable state, or misconfiguring membership changes.
The checklist above applies to the entire consensus section, not only timeout configuration. Items 1–6 are safety invariants that belong in architecture review and code review. Items 7–8 are operational tuning parameters that belong in capacity planning.
Consensus protocol choice sets , and determines and the shape of the throughput-consistency frontier. The tax is not the protocol’s latency in isolation — it is the latency combined with implementation complexity, failure mode visibility, and recovery cost. Safety is a quorum property, not a timing property; liveness requires partial synchrony, not perfect clocks. Cross-region consensus is a physics problem; geography sets the frontier in the latency-consistency dimension, not architecture.
The Cognitive Tax — Understanding as a Finite Resource
The consistency tax prices guarantees in RTTs. The consensus protocol tax prices protocol choices in . Both are denominated in objective, measurable quantities. The cognitive tax prices a third resource that neither formula captures: the on-call engineer’s operational understanding of the system during an incident. Understanding, like consistency, is borrowed. During normal operation, knowledge of the protocol’s state machine sits idle — it is capital, not expenditure. During an incident, it is loaned: the on-call draws it down as they work through observable states, replication timelines, and failure-mode hypotheses. A protocol whose state space exceeds the team’s available capital at 3am is not diagnosed faster; it is escalated, restarted blind, or resolved by reverting to the previous version. Each of those outcomes is MTTR .
Definition 16 -- Protocol Operability Score O_protocol: the product of diagnostic state count and concurrent transition branching factor, which bounds incident diagnosability under production failure modes
Axiom: Definition 16: Protocol Operability Cognitive Load
Formal Constraint: For a protocol under its most common production failure mode, enumerate: the observable states that require a different on-call action ( ), and the maximum number of transitions that can occur simultaneously during that failure — the branching factor of the incident decision tree ( ). The cognitive load estimate is their product:
Unit: Failure Mode Cardinality — a dimensionless integer count of observable states times concurrent transitions. It is not in engineer-hours and not a ratio. To place it alongside the other taxes: is the absolute complexity score for a protocol; (introduced in The Reality Tax) is the team’s utilization of its cognitive capacity; (from The Governance Tax) is a one-time engineer-hour cost to traverse a decision gate. The three quantities measure different things at different layers — protocol complexity ( ), team headroom ( ), and governance overhead ( ) respectively.
Heuristic scope: This is a counting heuristic, not a theorem. Both and are chosen by the team — different engineers counting the same protocol under the same failure mode will reach different numbers, because they monitor for different observable signals. That disagreement is the metric’s output, not a defect: it surfaces which failure modes one team’s runbooks cover that another’s do not. No empirical validation has established that this product predicts incident MTTR or that the multiplicative form is the uniquely correct aggregation. The calculated values are relative ordering indicators — not precision measurements.
is the absolute protocol complexity score — an integer count of states and transitions. On its own it is not actionable: a score of 18 may be operationally tolerable for a twelve-engineer platform team with deep protocol expertise and may be catastrophic for a three-engineer on-call rotation inheriting an unfamiliar system. What makes actionable is dividing it against the team’s available cognitive capacity. In The Reality Tax, that ratio is formalized as the Operator Tax: , where is the team’s cognitive ceiling measured from runbook coverage and escalation rate. is the numerator you are calculating here. The denominator — and the full tax — belongs to The Reality Tax.
What follows is one defensible enumeration for each protocol under its most common production failure mode.
Raft under leader failure. Observable states requiring distinct on-call actions:
- Heartbeat timeout — no action yet, watching
- Election in progress — term increment visible in logs, wait, do not intervene
- New leader elected — first AppendEntries from new term, confirm cluster health
This enumeration excludes split vote with no leader elected — the state where no candidate wins within the timeout — on the grounds that the on-call response is identical to election-in-progress: wait for the next timeout to trigger re-election. A team that has observed split-vote failures in production may reasonably include it as a fourth state, yielding . Under this enumeration: — three states, two simultaneous transitions (pre-election and election).
EPaxos under network partition with conflicting in-flight operations. Observable states requiring distinct on-call actions:
- Fast-path commit — commuting operations, no action
- Slow-path trigger — non-commuting detected, watch for dependency resolution
- Dependency graph repair — diverged entries, must not intervene prematurely
- Commit barrier — waiting for dependency set to close
- Recovery protocol activation — partition healing, do not interrupt
- Safe recovery completion — all nodes converged, confirm log consistency
Up to three concurrent transitions at any affected node. under this enumeration.
The enumeration above omits a persistent operational obligation that does not appear in any incident title: the dependency graph itself must be garbage-collected, and unlike Raft’s log — truncated at a monotone committed index with zero distributed coordination — the EPaxos dependency DAG has no locally determinable safe compaction boundary. A delayed message from a slow replica can extend live dependencies past the apparent frontier, making premature compaction a silent data-loss path rather than a crash; the GC coordination pass cannot safely execute during the partition that triggered the incident, accumulates debt under load spikes, and mirrors the tombstone-accumulation dynamic discussed for conflict-free merge structures above — the same tax relocated from the write-free merge path to the consensus commit path, where Raft’s linear log imposes none of it. Raft’s cognitive load under its common failure mode is one-third of EPaxos ’s. The difference is not a verdict on protocol quality — EPaxos is the stronger protocol. It is the cost of the trade-off the team accepted: lower at higher .
The Operability axis — conflict-free merge for the rate limiter counter. Teams adopting an append-only distributed tally (conflict-free merge) for the global rate limiter counter typically expect lower operational complexity than Raft: no leader election, no quorum configuration, no split-brain risk. The cognitive reality is different. The following enumeration is specific to an append-only counter deployment with anti-entropy sync, tombstone GC, and a global quota split across two regions; a simpler deployment without GC would have fewer states.
Reference points: a single-region central Redis counter has three failure modes (node failure, network timeout, memory exhaustion) with one concurrent transition at a time — . Regional Raft adds leader election: .
The append-only distributed tally (conflict-free merge) failure-mode space for this deployment:
- Increment acknowledged locally before propagation
- Anti-entropy sync triggered mid-window
- Merge applied to diverged regional counts
- Tombstone GC scheduled while partition is active
- Partition divergence detected — both regions individually below their local limits but combined above the global quota
- GC debt accumulation blocking sync
Up to three transitions simultaneously — merge during active increment traffic, GC during partition recovery, anti-entropy during tombstone growth. under this enumeration: the same figure as EPaxos , for a data structure the team adopted to replace Redis. The Operability axis is not a soft concern. It is a fifth coordinate of the achievable region, and the direction of movement from a central leader to an append-only distributed tally (conflict-free merge) is not toward simplicity.
The loan structure. A consistency guarantee charges its RTT tax on every operation, whether or not the guarantee is exercised. The cognitive tax charges differently: zero during normal operation, spiking during incidents proportional to the failure mode’s state-space depth. But the expected cognitive cost is non-zero: it is the cognitive loan rate times the incident frequency — the expected cognitive expenditure per unit time, paid at irregular intervals at maximum stress.
The Complexity/Safety Trade-Off
Moving along the frontier for 5% more throughput by adopting a lower- protocol buys headroom the team can measure on a load test. It charges a cognitive cost the team cannot see until the incident that exercises the new failure mode. That cost is not incalculable — it is not measured before the trade-off is made.
The net-gain formula. Let be the throughput improvement from the protocol change (ops/sec), be the value per additional operation, be the annual incident rate for the affected failure mode, and be the increase in MTTR caused by the higher cognitive load. The net annual gain of the protocol trade-off is:
where is the cost per incident-minute ( SLA penalty, revenue loss, engineer time). When net gain is negative, the protocol trade-off is a net loss despite its measurable throughput benefit. The throughput gain appears on the benchmark; the cognitive cost appears on the post-mortem.
Numbers. Adopting EPaxos over Raft reduces from approximately to , raising from 18 to 44 nodes. At , that improvement is approximately 23% additional throughput. For a service with ops/sec and conflict-storm events per year, a MTTR increase from Raft’s 45 minutes to EPaxos ’s 180 minutes ( minutes) costs engineer-minutes of additional annual outage plus any SLA penalties. Whether ops/sec at the chosen covers 3.4 hours of additional annual cognitive debt is a calculation specific to the service — but it is a calculation, not a judgment call.
| Protocol change | Throughput gain | MTTR multiplier | Net positive when | ||
|---|---|---|---|---|---|
| Raft to EPaxos | -0.0025 | ~23% at | +12 states | ~4x (conflict storm) | Incident rate < 0.5/yr or SLA cost below throughput value |
| Raft to conflict-free merge (writes) | ~-0.003 | ~30% write throughput | +6 ( read-path merge tax ) | 2x (merge conflict) | Read-path incidents < 1/yr or read consistency relaxable |
| Raft to multi-Raft (sharded) | per-shard identical | Linear in shard count | +4 per shard boundary | 1.5x per involved shard | Cross-shard incident rate proportional to added boundary count |
Watch out for: cognitive load is not fixed across teams. A team that has operated EPaxos for three years has ; a team migrating from a single-leader database starts there. Cognitive load is measured relative to the team’s current operational competency, not to an absolute protocol complexity score. A runbook that covers the EPaxos conflict storm recovery path converts implicit state-machine knowledge into documented procedure — reducing before the incident rather than during it. Runbooks are not documentation hygiene; they are cognitive tax payments made at development time rather than at incident time.
Three positions on the cognitive axis tell the team which way to move:
-
— cognitive interior. The team’s operational understanding exceeds the protocol stack’s failure-mode complexity. Room exists to adopt lower- protocols without exceeding operational capacity. The next move is toward the frontier: reduce , and the cognitive cost is still inside the team’s budget.
-
— cognitive frontier. Every protocol change that reduces must be evaluated against the corresponding increase in . The team is at the operational boundary; adding a failure mode without runbook coverage risks pushing an incident past the TTD budget. The minimum-operable from Step 5 of the Pareto Ledger is here bounded from above by the team’s cognitive ceiling, not only from below by the hardware floor.
-
— cognitive debt. The protocol stack’s failure-mode complexity exceeds the team’s operational capital. Incidents require escalation, extended diagnosis, or blind restart as a substitute for understanding. The remedies are protocol simplification (accept a higher to reduce ), runbook investment (reduce by converting implicit knowledge to documented procedure), or team capability investment (raise through training and incident practice).
Named failure mode: beta arbitrage. A team adopts EPaxos over Raft, observing a 23% throughput improvement on load tests. Six months later, a conflict storm triggers EPaxos ’s dependency graph repair path — a failure mode Raft never exhibits; the runbook covers only Raft recovery. Resolution: restart all nodes. MTTR : 4 hours. The load test showed the gain; the post-mortem showed the cost.
The frontier moved outward on the throughput axis and inward on the operational axis simultaneously. Reducing without a corresponding runbook investment leaves the cognitive ceiling undefended: the protocol complexity rises but the team’s operational capital does not. This is beta arbitrage — borrowing operational safety to pay for throughput headroom, with the loan called during the first incident that exercises the new failure mode.
Synthesis — The Three Logical Taxes on the Achievable Region
Every result in this post is a contraction of the achievable region across three coupled coordinates: consistency, latency, and operability — the cognitive load the protocol imposes on incident diagnosis and recovery. A position Pareto-optimal in the (latency, consistency) plane is a production failure if exceeds the team’s operational ceiling: the operating point is optimal in two coordinates and catastrophic in the third, timed to reveal itself at 3am. The physics taxes from The Physics Tax — the Contention Tax , the Coherency Floor , the Geometric Tax (tail latency fan-out) — contract the throughput-latency axes. The three logical taxes here contract the consistency-latency and operational axes. Together they define the boundary within which every architectural decision must operate.
The consistency spectrum tax prices every level of the consistency partial order in multiples of . Strict serializability costs per cross-shard write. Causal consistency costs in coordination but in metadata per message. Every step between those endpoints has a measurable price — the consistency tax function from Definition 14 . The RTT floor at each level is irreducible: it is the definition of the guarantee, not an implementation artifact.
The consensus protocol tax determines the specific value that your system actually pays. Raft at places . EPaxos fast path at places . HotStuff opens the BFT axis entirely. Conflict-free merge structures set on writes at the cost of causal-only guarantees and read-path merge overhead. The protocol decision matrix maps each commitment to the frontier position it enables.
The cognitive tax prices a fifth axis — operational understanding — that the first two taxes leave unpriced. Reducing by adopting EPaxos over Raft charges cognitive load : the on-call engineer must hold more concurrent state-machine transitions in working memory to diagnose the failure modes that lower- protocols introduce. The minimum-operable is bounded not only from below by the hardware floor but from above by the team’s operational ceiling — the maximum cognitive load serviceable within the incident TTD budget.
The following diagram shows how the three logical taxes contract the achievable region from different axes simultaneously.
%%{init: {'theme': 'neutral'}}%%
flowchart TD
FULL["Achievable region
carved by CAP, FLP, SNOW, HAT"]:::root
CONSISTENCY["Consistency spectrum tax
each level adds RTTs or metadata
prices denominated in L"]:::branch
CONSENSUS["Consensus protocol tax
protocol sets beta and N_max
cross-region bounded by D/100 ms"]:::branch
COGNITIVE["Cognitive tax
protocol failure-mode complexity
O_protocol = states x concurrent transitions"]:::branch
CONTRACTED["Contracted achievable region
consistency-latency-operational axes narrowed
RTT floor and MTTR floor at every point"]:::leaf
FULL -->|"consistency tax applied"| CONSISTENCY
FULL -->|"protocol tax applied"| CONSENSUS
FULL -->|"cognitive tax applied"| COGNITIVE
CONSISTENCY -->|"RTT pricing constrains latency"| CONTRACTED
CONSENSUS -->|"beta sets throughput ceiling"| CONTRACTED
COGNITIVE -->|"O_protocol bounds minimum-operable beta"| CONTRACTED
classDef root fill:none,stroke:#333,stroke-width:3px;
classDef branch fill:none,stroke:#ca8a04,stroke-width:2px;
classDef leaf fill:none,stroke:#333,stroke-width:1px;
Read the diagram. The achievable region from The Impossibility Tax is contracted three times in this post. The consistency spectrum tax constrains how much latency each consistency level costs — priced in . The consensus protocol tax constrains how much throughput each protocol permits — priced in and . The cognitive tax constrains which protocols the team can actually operate — priced in and incident MTTR . All three contractions are independent: no protocol choice eliminates the consistency tax, no consistency level eliminates the protocol tax, and no architectural diagram eliminates the cognitive tax.
The map now has four constraints: impossibility theorems from The Impossibility Tax carve the boundary; the physics taxes from The Physics Tax contract the throughput-latency axes; the three logical taxes from this post contract the consistency-latency and operational axes.
Ledger Update — . This post adds a second component to the cumulative tax vector first assembled in The Physics Tax: — or, for conflict-free merge deployments where , the extended form as defined in the Read-Path Merge Tax above. The protocol’s coherency coefficient sets the throughput ceiling via the same USL formula as the physics — but where physics is hardware-determined, is protocol-selected. The write latency floor prices the consistency guarantee in round-trips at your measured P99 inter-node RTT . The cognitive load from Definition 16 prices the protocol’s 3am debuggability. The Pareto Ledger audit from The Physics Tax extends naturally: every ledger entry that records and must also record which protocol sets those values, what consistency level it enforces, and what the team carries as a consequence.
The partial-partition test reveals whether a consistency level was chosen deliberately or by default. During a partial partition — where some but not all nodes can communicate — each consistency level responds with a different position in the achievable region. A strict-serial system either blocks writes until quorum is re-established or fails them explicitly: availability degrades, but the consistency coordinate is maintained. A causal system continues accepting writes on both sides of the partition and merges divergence on recovery through vector clock reconciliation: availability holds, but the consistency coordinate weakens temporarily. A system that claims strict serializability but silently serves stale reads across a partition has exited the achievable region without announcing it — it occupies an excluded corner at runtime, not through design. The CAP result from The Impossibility Tax names this geometry; the operational consequence is that a consistency level in a configuration is a partition-behavior specification. A team that has never explicitly chosen behavior under partition has implicitly delegated that decision to the defaults of whatever database they deployed. Consistency is not a safety default — it is a deliberate expense with a specific cost in RTT s, metadata overhead, and availability trade-off during the events where the guarantee matters most.
Pareto Ledger — Loan Servicing
Right-sizing the loan is the discipline of matching consistency level to application requirement — no higher, no lower. Overpayment is coordination overhead charged unconditionally on every operation with no correctness benefit: if the application does not require linearizability, there is no engineering justification for paying the RTT tax linearizability charges. That interest is waste — not safety. Underpayment is a deferred correctness debt that accrues silently until a partition or ordering anomaly makes it visible as data corruption, stale reads with no staleness indicator, or split-brain that triggers no alert. Both errors are avoidable once the loan is named and priced.
The Pareto Ledger for this post is the loan-servicing audit that makes overpayment and underpayment visible: for each operation class in the system, determine what consistency loan it has taken, what interest rate it is paying, and whether the loan is necessary. The interest rate is — your measured P99 inter-node RTT . Every consistency level in the price list is denominated in that unit.
Step 1 — Inventory the loans. For each operation class (writes, reads, cross-shard transactions, background reconciliation), identify the current consistency level in the Viotti-Vukolic order. This is often not documented — it is implicit in the protocol configuration, the database client settings, or the default of the framework. An undocumented loan is being paid unconditionally.
Step 2 — Price the interest. Apply the RTT price list from the Consistency Price List section. For each operation class: write RTTs at the current consistency level measured P99 = added latency per operation. For causal-and-below classes: metadata per message message rate = metadata bandwidth cost. Both are the interest rate, paid on every operation, regardless of whether the application exercises the guarantee.
Step 3 — Identify the minimum required loan. For each operation class, state the application-level invariant that requires the current consistency level. “Payment debit must be atomic with inventory reservation” requires cross-shard strict serializability. “Product catalog reads tolerate 5s staleness” requires only read-your-writes. If no application invariant can be stated, the loan was taken by default, not by requirement. Default loans are the primary source of overpayment in production systems.
Step 4 — Compute the overpayment. For each class where the current consistency level exceeds the minimum required: . This is interest paid on a loan the application never needed. At 100K writes/sec, an unnecessary cross-shard 2PC tier (2 extra RTTs at ) costs 600ms of avoidable write latency in cumulative per-second coordination overhead — paid whether the application is under load or not.
Step 5 — Select the minimum- protocol. For each operation class, identify the protocol in the decision matrix ( Definition 15 ) that achieves the minimum required consistency level at the lowest logical coherency coefficient your team can operate. The lowest operable is bounded from below by the hardware floor ( Definition 11 , The Physics Tax) and from above by the team’s cognitive ceiling ( Definition 16 in this post).
Step 6 — Price the cognitive cost. For each protocol under consideration in Step 5, compute the cognitive load using Definition 16 , and apply the net-gain formula: . If the net gain is negative, the throughput improvement from the lower does not cover the expected MTTR cost of the higher cognitive load — the trade-off moves the operational achievable region inward even as it moves the throughput axis outward. Document this explicitly: a protocol change with negative net gain is a movement along one frontier axis and away from another, not a pure improvement.
Consistency Loan Servicing — Case Study: Consistency-Budget Allocation for Rate-Limit Overages. Applying the six-step ledger to the global counter (1,000 req/min, US-East + EU-West, 50ms increment SLA ):
Step 1 — Inventory. Three operation classes: counter increment (write), quota check (read), cross-region reconciliation (background sync). Deployed configuration: global Raft quorum on increment, ReadIndex on quota check, full-state replication on sync.
Step 2 — Price the interest. Increment at global Raft quorum: per write. This is a latency floor, not a throughput ceiling — Raft batches concurrent increments into a single AppendEntries round-trip, so 17 concurrent writes can commit together in one 100ms window. The binding constraint is latency: every increment in every batch waits at minimum 100ms for the cross-region commit acknowledgment. The 50ms increment SLA is violated unconditionally on every write — no amount of parallelism or batching changes the propagation floor.
Step 3 — Minimum required loan. The invariant “never permit more than 1,000 req/min globally” admits a softer reading: over any 1-second enforcement window, total admitted count must not exceed 1,000. That does not require global Raft quorum. It requires bounded drift: additional requests above the limit per convergence window — negligible for an API rate limiter.
Step 4 — Compute the overpayment. Current deployment pays for strict serializability; application requires bounded drift. Overpayment: of avoidable coordination per second — two extra RTT s per increment (cross-region quorum minus intra-region quorum) at the target quota rate.
Step 5 — Minimum- protocol. Regional Raft (intra-DC, ) achieves bounded drift at 1–5ms per increment — dropping the commit latency floor from 100ms to 1–5ms and bringing every write inside the 50ms SLA. The Overage Rate — the fraction of traffic admitted above the global quota per convergence window — becomes the explicit interest payment accepted in exchange for lower latency.
Step 6 — Price the cognitive cost. Using the enumerations from the worked examples above: . Append-only counter (conflict-free merge): — three times the cognitive load of Raft under these enumerations for equivalent quota guarantees, with an unbounded Overage Rate during partitions. At conflict-free merge incidents per year and minutes added MTTR , the cognitive cost is engineer-minutes annually. The write throughput gain from must cover that cost for the conflict-free merge adoption to be net-positive.
| Configuration | Increment latency | Sequential ceiling(†) | Overage Rate | (**) | Net gain positive when |
|---|---|---|---|---|---|
| Global Raft quorum | 100ms | ~10/sec | 0% | 6 | Never — 100ms latency floor violates 50ms SLA |
| Regional Raft | 1–5ms | 200–1,000/sec | less than 2% at 100ms lag | 6 | Always — same , 50ms SLA met |
| Append-only counter (conflict-free merge) | less than 1ms write | Unbounded | ~40%+ during partition(*) | 18 | Never for hard-quota rate limiting |
(**)The values in this table are derived from the specific failure-mode enumerations in the worked examples above. The ordinal relationship (conflict-free merge harder than Raft) is the load-bearing claim; the specific numbers (6, 18) are products of one team’s enumeration. A team that has seen split-vote failures in Raft production would enumerate four Raft states and get ; a simpler conflict-free merge deployment without tombstone GC would enumerate fewer states. Use to make your team’s enumeration explicit and comparable, not to read a protocol’s cognitive complexity off a table.
(†)The sequential ceiling is throughput for a single-threaded client issuing one write at a time: 1,000ms / commit-latency. Raft pipelining and batching lift this ceiling substantially — concurrent requests can be batched into a single AppendEntries round-trip, committing hundreds or thousands of increments per 100ms window. The sequential ceiling is not the binding constraint for Global Raft quorum in this case study; the 100ms latency floor is.
(*)The ~40%+ conflict-free merge overage rate during partition is a workload-dependent illustrative estimate, not a measured value. It is derived from the scenario: US-East and EU-West each read a merged count of 800 req/min at the start of a partition, each admits up to 200 req/min more independently, yielding up to 400 additional requests — approximately 40% of the 1,000 req/min quota. The actual overage depends on partition duration, inter-region traffic mix, merge frequency, and the enforcement window. Longer partitions, higher traffic, and less frequent merges produce higher overage; shorter partitions and more frequent merges produce lower. No citation or simulation is offered for this specific figure; treat it as a plausible order-of-magnitude for planning, not a measured bound.
The Overage Rate is the Consistency Loan’s interest payment: the fraction of requests admitted above the global quota accepted in exchange for lower write latency. For a hard-quota rate limiter, an Overage Rate bounded under 5% is typically acceptable; an unbounded Overage Rate during partitions is not. This places append-only counter conflict-free merges outside the achievable region for hard-quota enforcement, and global Raft outside the achievable region for a 50ms increment SLA . The rate limiter counter has one viable operating point: regional Raft with explicit Overage Rate accounting.
Pareto Ledger — Logical Taxes
| Tax Type | Metric / Notation | Price Paid — Rate Limiter Case Study | Drift Trigger |
|---|---|---|---|
| Logical — Consistency | ( RTT price per write at level ) | (global Raft) moved to (regional Raft); 3,400ms/s avoidable coordination eliminated | — sync interval recalculation within 5 business days |
| Logical — Protocol | (Raft), ; EPaxos fast path: , | — protocol re-evaluation | |
| Logical — Operability | — observable states times concurrent transitions | ; conflict-free merge rejected: — same cognitive load as EPaxos , zero throughput gain | — runbook coverage audit before any protocol migration |
The migration that opened headroom — and what to do with it. The logical tax audit has a concrete operational conclusion. The rate limiter now runs regional Raft at
with
and a bounded Overage Rate under 2%. The 3,400 ms/s of avoidable cross-region coordination has been eliminated. But the operating point is still static: sync_interval is a scalar set at commissioning time, adjusted only when a human engineer reviews the Drift Triggers. At 1,000 req/min quota, the sync interval is a continuous trade-off between Overage Rate (tighten the interval: lower overage, higher bandwidth) and cost (relax the interval: higher overage, lower bandwidth). That trade-off changes with traffic patterns — peak traffic benefits from a tighter interval; off-peak traffic wastes bandwidth enforcing it. A static setting is either over-constraining at off-peak or under-constraining at peak. The headroom the logical tax audit recovered — by moving from global Raft to regional Raft — is real, but a static navigator cannot exploit it continuously.
To reclaim that headroom dynamically, the platform team upgraded the rate limiter with a multi-objective RL navigator: a learning agent that adjusts sync_interval at runtime based on observed traffic rate, overage count, and sync bandwidth — continuously finding the operating point that minimizes overage cost and bandwidth cost simultaneously. The migration converted a static configuration parameter into a runtime variable navigated by a stochastic decision process. The achievable region’s impossibility boundaries — its excluded corners — did not change; the mechanism for finding the operating point closest to the frontier within them did. The frontier’s precise location, shaped by
and the deployment environment, remains a live measurement target rather than a fixed coordinate. That upgrade — and the new class of costs it introduces — is where the next post begins.
References
-
P. Viotti, M. Vukolic. “Consistency in Non-Transactional Distributed Storage Systems.” ACM Computing Surveys, 49(1), 2016.
-
P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. Hellerstein, I. Stoica. “Highly Available Transactions: Virtues and Limitations.” VLDB, 7(3), 2014.
-
M. Shapiro, N. Preguica, C. Baquero, M. Zawirski. “Conflict-free Replicated Data Types.” SSS, 2011.
-
J. Corbett et al. “Spanner: Google’s Globally-Distributed Database.” ACM Transactions on Computer Systems, 31(3), 2013.
-
S. Kulkarni, M. Demirbas, D. Madappa, B. Avva, M. Leone. “Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases.” OPODIS, 2014.
-
D. Ongaro, J. Ousterhout. “In Search of an Understandable Consensus Algorithm.” USENIX ATC, 2014.
-
I. Moraru, D. Andersen, M. Kaminsky. “There Is More Consensus in Egalitarian Parliaments.” SOSP, 2013.
-
V. Enes, C. Baquero, T. Rezende, A. Gotsman, M. Perrin, P. Sutra. “State-Machine Replication for Planet-Scale Systems.” EuroSys, 2020.
-
C. Dwork, N. Lynch, L. Stockmeyer. “Consensus in the Presence of Partial Synchrony.” Journal of the ACM, 35(2):288–323, 1988.
-
L. Lamport. “Fast Paxos.” Distributed Computing, 19(2):79–103, 2006.
-
M. Yin, D. Malkhi, M. Reiter, G. Golan Gueta, I. Abraham. “HotStuff: BFT Consensus with Linearity and Responsiveness.” PODC, 2019.
-
D. D. Akkoorath, A. Z. Tomsic, M. Bravo, Z. Li, T. Crain, A. Bieniusa, N. Preguiça, M. Shapiro. “Cure: Strong Semantics Meets High Availability and Low Latency.” ICDCS, 2016.
-
M. Bravo, L. Rodrigues, P. Van Roy. “Saturn: A Distributed Metadata Service for Causal Consistency.” EuroSys, 2017.
-
M. Kleppmann, A. R. Beresford. “A Conflict-Free Replicated JSON Datatype.” IEEE Transactions on Parallel and Distributed Systems, 28(10):2733–2746, 2017.
-
P. Nicolaescu, K. Jahns, M. Derntl, R. Klamma. “Near Real-Time Peer-to-Peer Shared Editing on Extensible Data Types.” GROUP, 2016.
-
H. Howard, D. Malkhi, A. Spiegelman. “Flexible Paxos: Quorum Intersection Revisited.” arXiv:1608.06696, 2016.
-
A. Thomson, T. Diamond, S. Weng, K. Ren, P. Shao, D. Abadi. “Calvin: Fast Distributed Transactions for Partitioned Database Systems.” SIGMOD, 2012.