Fleet Coherence Under Partition
Prerequisites
This article addresses the coordination challenge that emerges from the preceding foundations:
- Contested Connectivity: The Markov connectivity model establishes partition as the default state. The capability hierarchy (L0-L4) defines what must remain coherent under partition.
- Self-Measurement: Gossip-based health propagation creates distributed health knowledge, but gossip cannot reach all nodes during partition. State diverges.
- Self-Healing: Self-healing requires local decisions. Each cluster heals independently. When clusters reconnect, their healing histories may conflict.
The preceding articles give each node and cluster the capability to survive independently. But survival is not the mission. The mission requires coordination across the fleet. When partition separates clusters, each makes decisions based on local information. When partition heals, those decisions must be reconciled.
This is the coherence problem: maintaining consistent fleet-wide state when the network prevents communication. The CAP theorem tells us we cannot have both consistency and availability during partition. Edge systems choose availability—continue operating—and must reconcile consistency when partition heals.
Theoretical Contributions
This article develops the theoretical foundations for maintaining fleet coherence in partitioned distributed systems. We make the following contributions:
-
State Divergence Metric: We formalize divergence as a normalized symmetric difference and derive its growth rate as a function of partition duration and event arrival rate.
-
CRDT Applicability Analysis: We characterize the class of edge state that admits conflict-free replication and identify the semantic constraints imposed by different CRDT types.
-
Hierarchical Authority Framework: We formalize decision scope classification and derive conditions for safe authority delegation during partition.
-
Merkle-Based Reconciliation Protocol: We analyze the communication complexity of state reconciliation and prove \(O(\log n + k)\) message complexity for \(k\) divergent items in \(n\)-item state.
-
Entity Resolution Theory: We formalize the observation merge problem and derive confidence update rules for multi-observer scenarios.
These contributions connect to and extend prior work on eventual consistency, CRDTs, and Byzantine agreement, adapting these frameworks for edge deployments with physical constraints.
Opening Narrative: CONVOY Split
CONVOY: 12 vehicles traverse a mountain pass. At km 47, terrain creates radio shadow.
Forward group (vehicles 1-5) receives SATCOM: bridge at km 78 destroyed, reroute via Route B. They adjust course.
Rear group (vehicles 6-12) receives ground relay minutes later: Route B blocked by landslide, continue to bridge. They maintain course.
When both groups emerge from the radio shadow with full connectivity:
- Vehicles 1-5: 8km west on Route B
- Vehicles 6-12: 8km east toward bridge
- Both acted correctly on available information
The coherence challenge: physical positions cannot be reconciled, but fleet state—route plan, decisions, threat assessments—must converge to consistent view.
The Coherence Challenge
Local Autonomy vs Fleet Coordination
Parts 1-3 developed local autonomy—essential, since without it partition means failure. But local autonomy creates coordination problems. Independent actions may:
- Complement: Node A handles zone X, Node B zone Y (good)
- Duplicate: Both handle zone X (wasted resources)
- Conflict: Incompatible actions (mission failure)
| Dimension | Local Autonomy | Fleet Coordination |
|---|---|---|
| Decision speed | Fast (local) | Slow (consensus) |
| Information used | Local sensors only | Fleet-wide picture |
| Failure mode | Suboptimal but functional | Complete if quorum lost |
| Partition behavior | Continues operating | Blocks waiting for consensus |
Coordination without communication is only possible through predetermined rules. If every node follows the same rules and starts with the same information, they will make the same decisions. But partition means information diverges—different nodes observe different events.
The tradeoff: more predetermined rules enable more coherence, but reduce adaptability. A fleet that pre-specifies every possible decision achieves perfect coherence but cannot adapt to novel situations. A fleet with maximum adaptability achieves minimum coherence—each node does its own thing.
Edge architecture must find the balance: enough rules for critical coherence, enough flexibility for operational adaptation.
State Divergence Sources
Definition 11 (State Divergence). For state sets \(S_A\) and \(S_B\) represented as key-value pairs, the divergence \(D(S_A, S_B)\) is the normalized symmetric difference:
where \(D \in [0, 1]\), with \(D = 0\) indicating identical states and \(D = 1\) indicating completely disjoint states.
During partition, state diverges through multiple mechanisms:
Environmental inputs differ. Each cluster observes different events. Cluster A sees threat T1 approach from the west. Cluster B, on the other side of the partition, sees nothing. Their threat models diverge.
Decisions made independently. Self-healing requires local decisions. Cluster A decides to redistribute workload after node failure. Cluster B, unaware of the failure, continues assuming the failed node is operational. Their understanding of fleet configuration diverges.
Time drift. Without network time synchronization, clocks diverge. After 6 hours of partition at 100ppm drift, clocks differ by 2 seconds. Timestamps become unreliable for ordering events.
Message loss. Before partition fully established, some gossip messages reach some nodes. The partial propagation creates uneven knowledge. Node A heard about event E before partition. Node B did not. Their histories diverge.
Proposition 12 (Divergence Growth Rate). If state-changing events arrive according to a Poisson process with rate \(\lambda\), the expected divergence after partition duration \(\tau\) is:
Proof sketch: Model state as a binary indicator per key: identical (0) or divergent (1). Under independent Poisson arrivals with rate \(\lambda\), the probability a given key remains synchronized is \(e^{-\lambda \tau}\). The expected fraction of divergent keys follows the complementary probability. For sparse state changes, \(E[D(\tau)] \approx 1 - e^{-\lambda \tau}\) provides a tight upper bound. Corollary 5. Reconciliation cost is linear in divergence: \(\text{Cost}(\tau) = c \cdot D(\tau) \cdot |S_A \cup S_B|\) where \(c\) is per-item sync cost.
Conflict-Free Data Structures
CRDTs at the Edge
Definition 12 (Conflict-Free Replicated Data Type). A state-based CRDT is a tuple \((S, s^0, q, u, m)\) where \(S\) is the state space, \(s^0\) is the initial state, \(q\) is the query function, \(u\) is the update function, and \(m: S \times S \rightarrow S\) is a merge function satisfying:
- Commutativity: \(m(s_1, s_2) = m(s_2, s_1)\)
- Associativity: \(m(m(s_1, s_2), s_3) = m(s_1, m(s_2, s_3))\)
- Idempotency: \(m(s, s) = s\)
These properties make \((S, m)\) a join-semilattice, guaranteeing convergence regardless of merge order.
Conflict-free Replicated Data Types (CRDTs) are data structures designed for eventual consistency without coordination. Each node can update its local replica independently. When nodes reconnect, replicas merge deterministically to the same result regardless of message ordering.
If the merge operation is mathematically well-behaved, you get consistency for free.
| CRDT Type | Operation | Edge Application |
|---|---|---|
| G-Counter | Increment only | Message counts, observation counts |
| PN-Counter | Increment and decrement | Resource tracking (±) |
| G-Set | Add only | Surveyed zones, detected threats |
| 2P-Set | Add and remove (once) | Active targets, current alerts |
| LWW-Register | Last-writer-wins value | Configuration, status |
| MV-Register | Multi-value (preserve conflicts) | Concurrent updates |
G-Set example: RAVEN surveillance coverage
Each drone maintains a local set of surveyed grid cells. When drones reconnect:
The union is commutative (order doesn’t matter), associative (grouping doesn’t matter), and idempotent (merging twice gives same result). These properties guarantee convergence.
Proposition 13 (CRDT Convergence). If all updates eventually propagate to all nodes (eventual delivery), and the merge function satisfies commutativity, associativity, and idempotency, then all replicas converge to the same state.
Proof sketch: Eventual delivery ensures all nodes receive all updates. The semilattice properties ensure merge order doesn’t matter. Therefore, all nodes applying all updates in any order reach the same state. Edge suitability: CRDTs require no coordination during partition. Updates are local. Merge is deterministic. This matches edge constraints perfectly.
graph TD
subgraph During_Partition["During Partition (independent updates)"]
A1["Cluster A
State: {1,2,3}"] -->|"adds item 4"| A2["Cluster A
State: {1,2,3,4}"]
B1["Cluster B
State: {1,2,3}"] -->|"adds item 5"| B2["Cluster B
State: {1,2,3,5}"]
end
subgraph After_Reconnection["After Reconnection"]
M["CRDT Merge
(set union)"]
R["Merged State
{1,2,3,4,5}"]
end
A2 --> M
B2 --> M
M --> R
style M fill:#c8e6c9,stroke:#388e3c
style R fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
style During_Partition fill:#fff3e0
style After_Reconnection fill:#e8f5e9
The merge operation is automatic and deterministic—no conflict resolution logic needed. Both clusters’ contributions are preserved.
Limitations: CRDTs impose semantic constraints. A counter that only increments cannot represent a value that should decrease. A set that only adds cannot represent removal. Application data must be structured to fit available CRDT semantics.
Choosing the right CRDT: The choice depends on application semantics:
- G-Set: Simplest, lowest overhead, but no removal
- 2P-Set: Supports removal but element cannot be re-added
- OR-Set: Full add/remove semantics but higher overhead (unique tags per add)
- LWW-Element-Set: Timestamp-based resolution, requires clock synchronization
Bounded-Memory Tactical CRDT Variants
Standard CRDTs assume unbounded state growth—problematic for edge nodes with constrained memory. We introduce bounded-memory variants tailored for tactical operations.
Sliding-Window G-Counter:
Maintain counts only for recent time windows, discarding old history:
where \(W(t) = {w : t - T_{\text{window}} \leq w < t}\) is the active window set. Memory: \(O(T_{\text{window}} / \Delta_w)\) instead of unbounded.
RAVEN application: Track observation counts per sector for the last hour. Older counts archived to fusion node when connectivity permits, then pruned locally.
Bounded OR-Set with Eviction:
Limit set cardinality with priority-based eviction:
where \(e_{\text{min}} = \arg\min_{e' \in S} \text{priority}(e')\). The eviction maintains CRDT properties:
Eviction commutativity proof sketch: Define \(\text{evict}(S) = S \setminus {e_{\text{min}}}\). For deterministic priority function, \(\text{evict}(\text{merge}(S_A, S_B)) = \text{merge}(\text{evict}(S_A), \text{evict}(S_B))\) when both exceed \(M_{\text{max}}\).
Priority functions for tactical state:
- Threat entities: Priority = threat level × recency
- Coverage cells: Priority = strategic value × observation freshness
- Health records: Priority = criticality × staleness (inverse)
CONVOY application: Track at most 50 active threats. When capacity exceeded, evict lowest-priority (low-threat, stale) entities. Memory: fixed 50 × sizeof(entity) regardless of operation duration.
Compressed Delta-CRDT:
Standard delta-CRDTs transmit state changes. We compress deltas using domain-specific encoding:
where \(H(\Delta)\) is the entropy of the delta. For tactical state with predictable patterns, compression achieves 3-5× reduction.
Compression techniques:
- Spatial encoding: Position updates as offsets from predicted trajectory
- Temporal batching: Multiple updates to same entity merged before transmission
- Dictionary encoding: Common values (status codes, threat types) as indices
OUTPOST application: Sensor health updates compressed to 2-3 bytes per sensor versus 32 bytes uncompressed. 127-sensor mesh health fits in single packet.
Hierarchical State Pruning:
Tactical systems naturally have hierarchical state importance:
| Level | Retention | Pruning Trigger |
|---|---|---|
| Critical (threats, failures) | Indefinite | Never auto-prune |
| Operational (positions, status) | 1 hour | Time-based |
| Diagnostic (detailed health) | 10 minutes | Memory pressure |
| Debug (raw sensor data) | 1 minute | Aggressive |
State automatically demotes under memory pressure:
where \(\text{level}_{\min}(s)\) is the minimum level for state type \(s\).
Memory budget enforcement:
Each CRDT type has a memory budget \(B_i\). Total memory:
When approaching limit, the system:
- Prunes diagnostic/debug state
- Compresses operational state
- Evicts low-priority entries from bounded sets
- Archives to persistent storage if available
- Drops new low-priority updates as last resort
RAVEN memory profile: 50 drones × 2KB state budget = 100KB CRDT state. Bounded OR-Set for 200 threats (4KB), sliding-window counters for 100 sectors (2KB), health registers for 50 nodes (1.6KB). Total: ~8KB active CRDT state, well within budget.
Last-Writer-Wins vs Application Semantics
Last-Writer-Wins (LWW) is a common conflict resolution strategy: when values conflict, the most recent timestamp wins.
LWW works for:
- Configuration values (latest config should apply)
- Status updates (latest status is most relevant)
- Position reports (latest position is current)
LWW fails for:
- Counters (later increment doesn’t override earlier; both should apply)
- Sets with removal (later add doesn’t mean earlier remove didn’t happen)
- Causal chains (effect can have earlier timestamp than cause)
Edge complication: LWW assumes reliable timestamps. Clock drift makes “latest” ambiguous. If Cluster A’s clock is 3 seconds ahead of Cluster B, Cluster A’s updates always win—even if they’re actually older.
Vector Clocks for Causality
Before examining hybrid approaches, consider pure vector clocks. Each node \(i\) maintains a vector \(V_i[1..n]\) where \(V_i[j]\) represents node \(i\)’s knowledge of node \(j\)’s logical time.
Definition 13 (Vector Clock). A vector clock \(V\) is a function from node identifiers to non-negative integers. The vector clock ordering \(\leq\) is defined as:
Events are causally related iff their vector clocks are comparable; concurrent events have incomparable vectors.
Proposition 14 (Vector Clock Causality). For events \(e_1\) and \(e_2\) with vector timestamps \(V_1\) and \(V_2\):
- \(e_1 \rightarrow e_2\) (\(e_1\) happened before \(e_2\)) iff \(V_1 < V_2\)
- \(e_1 \parallel e_2\) (concurrent) iff \(V_1 \not\leq V_2\) and \(V_2 \not\leq V_1\)
The update rules are:
- Local event: \(V_i[i] \gets V_i[i] + 1\)
- Send message: Attach current \(V_i\) to message
- Receive message with \(V_m\): \(V_i[j] \gets \max(V_i[j], V_m[j])\) for all \(j\), then \(V_i[i] \gets V_i[i] + 1\)
Edge limitation: Vector clocks grow linearly with node count. For a 50-drone swarm, each message carries 50 integers. For CONVOY with 12 vehicles, overhead is acceptable. For larger fleets, compressed representations or hierarchical clocks are needed.
Mitigation: Hybrid Logical Clocks (HLC) combine physical time with logical counters:
HLCs provide causal ordering when clocks are close and total ordering otherwise. The physical component bounds divergence even when logical ordering fails.
CONVOY routing example: Vehicles 3 and 8 both update route:
- Vehicle 3: “Route via checkpoint A” at 14:32:17
- Vehicle 8: “Route via checkpoint B” at 14:32:19
With LWW, Vehicle 8’s route wins. But what if Vehicle 3 had more recent intel that arrived at 14:32:15 and took 2 seconds to process? The “winning” route may be based on stale information.
Application semantics matter. Route decisions should consider information freshness, not just decision timestamp.
Custom Merge Functions
When standard CRDTs don’t fit, define custom merge functions. The requirements are the same:
Commutative: \(\text{merge}(A, B) = \text{merge}(B, A)\)
Associative: \(\text{merge}(\text{merge}(A, B), C) = \text{merge}(A, \text{merge}(B, C))\)
Idempotent: \(\text{merge}(A, A) = A\)
Example: Surveillance priority list
Each cluster maintains a list of priority targets. During partition, both clusters may add or reorder targets.
Merge function:
- Union of all targets: \(T_{\text{merged}} = T_A \cup T_B\)
- Priority = maximum priority assigned by any cluster
- Flag conflicts where clusters assigned significantly different priorities
This is commutative and associative. Conflicts are flagged for human review rather than silently resolved.
Example: Engagement authorization
Critical: a target should only be engaged if both clusters agree.
Merge function: intersection, not union.
If Cluster A authorized target T but Cluster B did not, the merged state does not authorize T. Conservative resolution for high-stakes decisions.
Verification: Custom merge functions must be proven correct. For each function, verify:
- Commutativity: formal proof or exhaustive testing
- Associativity: formal proof or exhaustive testing
- Idempotency: formal proof or exhaustive testing
- Safety: merged state satisfies application invariants
Hierarchical Decision Authority
Decision Scope Classification
Definition 14 (Decision Scope). The scope \(\text{scope}(d)\) of a decision \(d\) is the set of nodes whose state is affected by \(d\). Decisions are classified by scope cardinality: L0 (single node), L1 (local cluster), L2 (fleet-wide), L3 (command-level).
Not all decisions have the same scope. A decision affecting only one node is different from a decision affecting the entire fleet. Decision authority should match decision scope.
| Level | Scope | Examples |
|---|---|---|
| L0 | Single node | Self-healing, local sensor adjustment, power management |
| L1 | Local cluster | Formation adjustment, local task redistribution, cluster healing |
| L2 | Fleet-wide | Route changes, objective prioritization, resource reallocation |
| L3 | Command | Rules of engagement, mission abort, strategic reposition |
L0 decisions can always be made locally. No coordination required. If a drone’s sensor needs recalibration, it recalibrates. No need to consult the swarm.
L1 decisions require cluster-level coordination but not fleet-wide. If a cluster needs to adjust formation due to member failure, the cluster lead coordinates locally. Other clusters don’t need to know immediately.
L2 decisions should involve fleet-wide coordination when possible. Route changes affect the entire convoy. Objective prioritization affects how all clusters allocate effort. These decisions benefit from fleet-wide information.
L3 decisions require external authority. Engagement rules come from command. Mission abort requires command approval. These cannot be made autonomously regardless of connectivity.
During partition: L0 and L1 decisions continue normally. L2 decisions become problematic—fleet-wide coordination is impossible. L3 decisions cannot be made; the system must operate within pre-authorized bounds.
graph TD
subgraph Connected["Connected State (full hierarchy)"]
L3C["L3: Command
(strategic decisions)"] --> L2C["L2: Fleet
(fleet-wide coordination)"]
L2C --> L1C["L1: Cluster
(local coordination)"]
L1C --> L0C["L0: Node
(self-management)"]
end
subgraph Partitioned["Partitioned State (delegated authority)"]
L1P["L1: Cluster Lead
(elevated to L2 authority)"] --> L0P["L0: Node
(autonomous operation)"]
end
L1C -.->|"partition
event"| L1P
style L3C fill:#ffcdd2,stroke:#c62828
style L2C fill:#fff9c4,stroke:#f9a825
style L1C fill:#c8e6c9,stroke:#388e3c
style L0C fill:#e8f5e9,stroke:#388e3c
style L1P fill:#fff9c4,stroke:#f9a825
style L0P fill:#e8f5e9,stroke:#388e3c
Authority elevation during partition: When connectivity is lost, authority must be explicitly delegated downward. The system cannot simply assume lower levels can make higher-level decisions.
Authority Delegation Under Partition
When fleet-wide coordination is impossible, what authority do local nodes have?
Pre-delegated authority: Before mission start, define contingency authorities.
- “If partitioned for more than 30 minutes, cluster leads have L2 authority for routing decisions.”
- “If command unreachable for more than 2 hours, convoy lead has L3 authority for mission continuation.”
Bounded delegation: Authority expires or is limited in scope.
- “L2 authority for maximum 4 hours, then revert to L1.”
- “L2 authority for route changes only, not for objective changes.”
Mission-phase dependent: Authority varies by mission phase.
- “During critical phases, maintain strict L1 only.”
- “During emergency withdrawal, cluster leads have emergency L2 authority.”
Risk: Parallel partitions may both claim authority. Cluster A and Cluster B both think they’re the senior cluster and both make L2 decisions. On reconnection, they have conflicting fleet-wide decisions.
Mitigation: Tie-breaking rules defined in advance.
- “Cluster containing node with lowest ID has priority.”
- “Cluster with most recent command contact has priority.”
- GPS-based: “Cluster closest to objective has priority.”
Conflict Detection at Reconciliation
When clusters reconnect, compare decision logs:
Detection: Identify overlapping authority claims.
Two decisions conflict if they affect overlapping scope and differ.
Classification: Reversible vs irreversible.
- Reversible: Route decisions before execution, target prioritization, resource allocation
- Irreversible: Physical actions taken, resources consumed, information disclosed
Resolution for reversible: Apply hierarchy.
If Cluster A made decision \(d_A\) and Cluster B made decision \(d_B\):
- If \(\text{authority}(A) > \text{authority}(B)\): \(d_A\) wins
- If \(\text{authority}(A) = \text{authority}(B)\): Apply tie-breaker
- Update both clusters to winning decision
Resolution for irreversible: Flag for human review.
Cannot undo physical actions. Log the conflict, document both decisions and outcomes, present to command for analysis. Learn from the conflict to improve future protocols.
Reconnection Protocols
State Reconciliation Sequence
When partition heals, clusters must reconcile state efficiently. Bandwidth may be limited during reconnection window. Protocol must be robust to partial completion if partition recurs.
Phase 1: State Summary Exchange
Each cluster computes a compact summary of its state using Merkle trees:
Where \(H\) is a hash function and \(s_i\) are state elements.
Exchange roots. If roots match, states are identical—no further sync needed.
Phase 2: Divergence Identification
If roots differ, descend Merkle tree to identify divergent subtrees. Exchange hashes at each level until divergent leaves are found.
Proposition 15 (Reconciliation Complexity). For \(n\)-item state with \(k\) divergent items, Merkle-based reconciliation requires \(O(\log n + k)\) messages: \(O(\log n)\) to traverse the tree and identify divergences, plus \(O(k)\) to transfer divergent data.
Proof: The Merkle tree has height \(O(\log n)\). In each round, parties exchange hashes for differing subtrees. At level \(i\), at most \(\min(k, 2^i)\) subtrees differ. Summing across \(O(\log(n/k))\) levels until subtrees contain \(\leq 1\) divergent item yields \(O(k)\) hash comparisons. Adding \(O(k)\) data transfers gives total complexity \(O(k \log(n/k) + k) = O(k \log n)\) in the worst case, or \(O(\log n + k)\) when divergent items cluster spatially. Phase 3: Divergent Data Exchange
Transfer the actual divergent key-value pairs. Prioritize by importance (Phase 4.2).
Phase 4: Merge Execution
Apply CRDT merge or custom merge functions to divergent items. Compute unified state.
Phase 5: Consistency Verification
Recompute Merkle roots. Exchange and verify they now match. If mismatch, identify remaining divergences and repeat from Phase 3.
Phase 6: Coordinated Operation Resumption
With consistent state, resume fleet-wide coordination. Notify all nodes that coherence is restored.
graph TD
A["Partition Heals
(connectivity restored)"] --> B["Exchange Merkle Roots
(state fingerprints)"]
B --> C{"Roots
Match?"}
C -->|"Yes"| G["Resume Coordination
(fleet coherent)"]
C -->|"No"| D["Identify Divergences
(traverse Merkle tree)"]
D --> E["Exchange Divergent Data
(priority-ordered)"]
E --> F["Merge States
(CRDT merge)"]
F --> B
style A fill:#c8e6c9,stroke:#388e3c
style G fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
style C fill:#fff9c4,stroke:#f9a825
style D fill:#bbdefb
style E fill:#bbdefb
style F fill:#bbdefb
Priority Ordering for Sync
Limited bandwidth during reconnection requires prioritization.
Priority 1: Safety-critical state
- Node availability (who is alive?)
- Threat locations (where is danger?)
- Critical failures (what is broken?)
Priority 2: Mission-critical state
- Objective status (what is complete?)
- Resource levels (what remains?)
- Current positions (where is everyone?)
Priority 3: Operational state
- Detailed sensor readings
- Historical positions
- Non-critical health metrics
Priority 4: Audit and logging
- Decision logs
- Event timestamps
- Diagnostic data
Sync Priority 1 first. If partition recurs, at least safety-critical state is consistent. Lower priorities can wait for more stable connectivity.
Optimization: Order sync items by expected information value:
High-impact, stale items should sync first. Low-impact, fresh items can wait.
Handling Actions Taken During Partition
Physical actions cannot be “merged” logically. If Cluster A drove north and Cluster B drove south, they cannot merge to “drove north and south simultaneously.”
Classification of partition actions:
Complementary actions: Both clusters did useful, non-overlapping work.
- Cluster A surveyed zone X, Cluster B surveyed zone Y
- Combined coverage is union: excellent outcome
Redundant actions: Both clusters did the same work.
- Both surveyed zone X
- Wasted effort but no harm
Conflicting actions: Actions are mutually incompatible.
- Cluster A classified entity T as anomaly and flagged for intervention
- Cluster B classified entity T as normal and continued monitoring
- Cannot reconcile: T was either anomalous or normal
Resolution by type:
| Type | Detection | Resolution |
|---|---|---|
| Complementary | Non-overlapping scope | Accept both; update state |
| Redundant | Identical scope and action | Deduplicate; note inefficiency |
| Conflicting | Overlapping scope, different action | Flag for review; assess damage |
Audit trail: All partition decisions must be logged with:
- Timestamp and node ID
- Information available at decision time
- Decision made and rationale
- Outcome observed
Post-mission review uses audit trail to:
- Identify conflict patterns
- Improve decision rules
- Update training data for future operations
CONVOY Coherence Protocol
Return to the CONVOY partition at the mountain pass.
State During Partition
Forward group (vehicles 1-5):
- Route: Via Route B
- Lead: Vehicle 1
- Intel: Bridge destroyed (received first)
- Decision authority: L2 (lead assumed authority after 35 minutes partition)
Rear group (vehicles 6-12):
- Route: Via bridge
- Lead: Vehicle 6
- Intel: Route B blocked (received minutes later)
- Decision authority: L2 (lead assumed authority after 35 minutes partition)
State divergence:
- Route plan: CONFLICTING (Route B vs bridge)
- Position: DIVERGENT (8 km separation)
- Intel database: DIVERGENT (different threat reports)
Reconnection at Mountain Base
Radio contact restored as both groups clear the mountain pass.
Phase 1: Vehicle 1 and Vehicle 6 exchange state summaries.
- Merkle roots differ
- Quick comparison shows route divergence
Phase 2: Identify specific divergences.
- Route decision differs
- Position differs
- Intel items differ
Phase 3: Exchange divergent data.
- Forward group shares Route B success
- Rear group shares bridge status (actually intact!)
- Both share complete intel received
Phase 4: Merge states.
Intel merge reconciles conflicting reports: bridge status marked UNCERTAIN from conflicting regional command intel, but updated to INTACT based on rear group visual confirmation. Route B status marked UNCERTAIN from forward group initial report, but updated to PASSABLE based on forward group successful traverse.
Route decision merge:
- Both groups made valid L2 decisions
- Neither can be “undone” (physical positions fixed)
- Resolution: Accept current positions, plan convergence point
Phase 5: Verify consistency.
- Both groups now have unified intel
- Both acknowledge divergent routes are fait accompli
- Both agree on convergence plan
Phase 6: Resume coordinated operation.
- Forward group continues on Route B
- Rear group continues to bridge
- Groups converge at km 95 junction
- Unified convoy from km 95 onward
Lessons Learned
-
Intel conflict: Regional command and forward group gave conflicting information. Neither was fully accurate. Convoy should have intel confidence scores.
-
Route lock: Once route decisions executed, cannot reverse. Pre-agree routing rules for partition scenarios.
-
Communication shadow mapped: km 47-52 is now known radio shadow. Future transits prepare for partition at this location.
-
Independent operation validated: Vehicles 6-12 operated successfully for 45 minutes under local lead. Confirms L2 delegation works.
The fleet emerges from partition with improved knowledge—an anti-fragile outcome.
RAVEN Coherence Protocol
The RAVEN swarm of 47 drones experiences partition due to terrain and jamming, splitting into three clusters.
State During Partition
Cluster A (20 drones, led by Drone 1):
- Coverage: Zones X1-X5
- Detections: Threat T1 at position (34.5, -118.2)
- Health: 2 drones degraded (low battery)
Cluster B (18 drones, led by Drone 21):
- Coverage: Zones Y1-Y4
- Detections: Threat T2 at position (34.7, -118.4)
- Health: 1 drone lost (collision with terrain)
Cluster C (9 drones, led by Drone 40):
- Coverage: Zones Z1-Z2
- Detections: None
- Health: All nominal
Reconnection as Swarm Reforms
Clusters gradually reconnect as jamming subsides.
Coverage merge (G-Set):
Simple union. No conflicts possible.
Threat merge:
Union of detected threats. No conflict—different threats at different positions.
Health merge:
Each drone’s health is LWW-Register. Latest observation wins.
- Cluster A degraded drones: Update swarm health map
- Cluster B lost drone: Mark as LOST in swarm roster
Coherence challenge: What if Cluster A and B both detected threats near zone W boundary?
Entity resolution: Compare threat attributes.
| Attribute | Cluster A (T1) | Cluster B (T3) |
|---|---|---|
| Position | (34.5102, -118.2205) | (34.5114, -118.2193) |
| Time offset | First observation | +2.5 minutes |
| Signature | Vehicle, moving NE | Vehicle, moving NE |
Position difference: 170 meters. Time difference: roughly 2.5 minutes. Same signature. Likely same entity observed from different angles at different times.
Resolution: Merge into single threat T1 with combined observations:
- Position: Average weighted by observation confidence
- Trajectory: Computed from multiple observations
- Confidence: Increased (multiple independent observations)
Where \(c\) is confidence and \(p\) is position.
Entity Resolution Formalization
For distributed observation systems, entity resolution is critical. Multiple observers may detect the same entity and assign different identifiers.
Observation tuple: \((id, pos, time, sig, observer)\)
Match probability:
Where \(\text{sim}\) is signature similarity function.
Merge criteria: If \(P(\text{same}) > \theta\), merge observations. Otherwise, keep as separate entities.
Confidence update: Merged entity has increased confidence:
Two 80% confident observations merge to 96% confident entity.
OUTPOST Coherence Protocol
The OUTPOST sensor mesh faces distinct coherence challenges: ultra-low bandwidth, extended partition durations (days to weeks), and hierarchical fusion architecture.
State Classification for Mesh Coherence
OUTPOST state partitions into categories with different reconciliation priorities:
| State Type | Update Frequency | Reconciliation Strategy | Priority |
|---|---|---|---|
| Detection events | Per-event | Union with deduplication | Highest |
| Sensor health | Per-minute | Latest-timestamp-wins | High |
| Coverage map | Per-hour | Merge with confidence weighting | Medium |
| Configuration | Per-day | Version-based with rollback | Low |
Multi-Fusion Coordination
When multiple fusion nodes operate, they must coordinate coverage and avoid duplicate alerts:
graph TD
subgraph Zone_A["Zone A (Fusion A responsibility)"]
S1[Sensor 1]
S2[Sensor 2]
S3[Sensor 3]
end
subgraph Zone_B["Zone B (Fusion B responsibility)"]
S4[Sensor 4]
S5[Sensor 5]
end
subgraph Overlap["Overlap Zone (shared responsibility)"]
S6["Sensor 6
(reports to both)"]
end
subgraph Fusion_Layer["Fusion Layer"]
F1[Fusion A]
F2[Fusion B]
end
S1 --> F1
S2 --> F1
S3 --> F1
S4 --> F2
S5 --> F2
S6 --> F1
S6 --> F2
F1 <-.->|"deduplication
coordination"| F2
style Overlap fill:#fff3e0,stroke:#f57c00
style Zone_A fill:#e3f2fd
style Zone_B fill:#e8f5e9
Overlapping coverage reconciliation: When sensors report to multiple fusion nodes:
Resolution rules:
- Same event, same timestamp: Deduplicate by event ID
- Same event, different timestamps: Use earliest detection time
- Conflicting assessments: Combine confidence, flag for review
Long-Duration Partition Handling
OUTPOST may operate for days without fusion node contact. Special handling for extended autonomy:
Local decision authority: Each sensor can make detection decisions locally. Decisions are logged for later reconciliation.
Detection event structure for eventual consistency:
The \(\text{reconciled}\) flag tracks whether the event has been confirmed by fusion node. Unreconciled events are treated with lower confidence.
Bandwidth-efficient reconciliation: Given ultra-low bandwidth (often < 1 Kbps), OUTPOST uses compact delta encoding:
Only changed state transmits. Merkle tree roots validate completeness without transmitting full state.
Sensor-Fusion Authority Hierarchy
Decision scopes:
- Sensor authority: Detection reporting, self-health assessment, local alert
- Fusion authority: Alert correlation, threat classification, response recommendation
- Uplink authority: Response authorization, policy updates, threat escalation
During partition:
- Sensors continue detecting and logging
- Fusion (if reachable) continues correlating
- Uplink authority decisions are deferred until reconnection
Proposition 16 (OUTPOST Coherence Bound). For an OUTPOST mesh with \(n\) sensors, \(k\) fusion nodes, and partition duration \(T_p\), the expected state divergence is bounded by:
where \(\lambda\) is the event arrival rate and the factor \((n-k)/k\) reflects the sensor-to-fusion ratio.
The Limits of Coherence
Irreconcilable Conflicts
Some conflicts cannot be resolved through merge functions or hierarchy.
Physical impossibilities: Cluster A reports target destroyed. Cluster B reports target escaped. Both cannot be true. The merge function cannot determine which is correct from state alone.
Resolution: Flag for external verification. Use sensor data from both clusters. Accept uncertainty if verification impossible.
Resource allocation conflicts: Cluster A allocated sensor drones to zone X. Cluster B allocated same drones to zone Y. The drones are physically in one place—but which?
Resolution: Trust current position reports. Update state to reflect actual positions. Flag allocation discrepancy for review.
Byzantine Actors
A compromised node may deliberately create conflicts:
- Inject false threat reports to trigger responses
- Report false positions to disrupt coordination
- Create state inconsistencies that prevent merge
Detection: Byzantine behavior often creates patterns:
- Inconsistent with multiple other observers
- Reports change implausibly fast
- State updates violate physical constraints
Isolation: Nodes detected as potentially Byzantine:
- Reduce trust weight in aggregation
- Quarantine from decision-making
- Flag for human review
Byzantine-tolerant CRDTs exist but are expensive. Recent work by Kleppmann et al. addresses making CRDTs Byzantine fault-tolerant, but the overhead is significant. Edge systems often use lightweight detection plus isolation rather than full Byzantine tolerance.
Stale-Forever State
Some state may never reconcile:
- Node destroyed before sync completes
- Observation made during partition lost when node fails
- History gap cannot be filled
Acceptance: Perfect consistency is impossible in distributed systems under partition and failure. The fleet must operate with incomplete history.
Mitigation: Redundant observation. If multiple nodes observe the same event, loss of one doesn’t lose the observation.
The Coherence-Autonomy Tradeoff
Perfect coherence requires consensus before action. Consensus requires communication. Communication may be impossible.
Maximum coherence means no action without agreement—the system blocks during partition. Maximum autonomy means action without coordination—coherence is minimal.
Edge architecture accepts imperfect coherence in exchange for operational autonomy. The question is not “how to achieve perfect coherence” but “how to achieve sufficient coherence for mission success.”
Sufficient coherence: The minimum consistency needed for the mission to succeed.
- Safety-critical state: High coherence required
- Mission-critical state: Medium coherence acceptable
- Operational state: Low coherence tolerable
- Logging state: Eventual consistency sufficient
Engineering Judgment
When should the system accept incoherence as the lesser evil?
- When enforcing coherence would prevent critical action
- When coherence delay exceeds mission window
- When coherence cost exceeds incoherence cost
This is engineering judgment, not algorithmic decision. The architect must define coherence requirements per state type and accept that perfect coherence is unachievable.
Closing: From Coherence to Anti-Fragility
The preceding articles developed resilience: the ability to survive partition and return to coordinated operation.
- Contested connectivity: Survive by designing for disconnection
- Self-measurement: Measure health even when central observability is unavailable
- Self-healing: Heal locally when human escalation is impossible
- Fleet coherence: Restore coordinated state when partition heals
But resilience—returning to baseline—is not the complete goal. The fleet that experiences partition should emerge better than before.
CONVOY at the mountain pass learned:
- Intel conflicts require confidence scoring
- Route B is actually passable (despite initial report)
- Vehicles 6-12 can operate independently for 45+ minutes
- Communication shadow exists at km 47-52
- Local lead authority delegation works in practice
This knowledge makes future operations stronger. The partition was stressful—but it generated valuable information.
The next article on anti-fragility develops systems that improve from stress rather than merely surviving it. The coherence challenge becomes a learning opportunity. Conflicts reveal hidden assumptions. Reconciliation tests merge logic. Each partition makes the fleet more robust.
The goal is not to prevent partition. The goal is to design systems that thrive despite partition—and grow stronger through it.