Free cookie consent management tool by TermsFeed Generator

Fleet Coherence Under Partition


Prerequisites

This article addresses the coordination challenge that emerges from the preceding foundations:

The preceding articles give each node and cluster the capability to survive independently. But survival is not the mission. The mission requires coordination across the fleet. When partition separates clusters, each makes decisions based on local information. When partition heals, those decisions must be reconciled.

This is the coherence problem: maintaining consistent fleet-wide state when the network prevents communication. The CAP theorem tells us we cannot have both consistency and availability during partition. Edge systems choose availability—continue operating—and must reconcile consistency when partition heals.


Theoretical Contributions

This article develops the theoretical foundations for maintaining fleet coherence in partitioned distributed systems. We make the following contributions:

  1. State Divergence Metric: We formalize divergence as a normalized symmetric difference and derive its growth rate as a function of partition duration and event arrival rate.

  2. CRDT Applicability Analysis: We characterize the class of edge state that admits conflict-free replication and identify the semantic constraints imposed by different CRDT types.

  3. Hierarchical Authority Framework: We formalize decision scope classification and derive conditions for safe authority delegation during partition.

  4. Merkle-Based Reconciliation Protocol: We analyze the communication complexity of state reconciliation and prove \(O(\log n + k)\) message complexity for \(k\) divergent items in \(n\)-item state.

  5. Entity Resolution Theory: We formalize the observation merge problem and derive confidence update rules for multi-observer scenarios.

These contributions connect to and extend prior work on eventual consistency, CRDTs, and Byzantine agreement, adapting these frameworks for edge deployments with physical constraints.


Opening Narrative: CONVOY Split

CONVOY: 12 vehicles traverse a mountain pass. At km 47, terrain creates radio shadow.

Forward group (vehicles 1-5) receives SATCOM: bridge at km 78 destroyed, reroute via Route B. They adjust course.

Rear group (vehicles 6-12) receives ground relay minutes later: Route B blocked by landslide, continue to bridge. They maintain course.

When both groups emerge from the radio shadow with full connectivity:

The coherence challenge: physical positions cannot be reconciled, but fleet state—route plan, decisions, threat assessments—must converge to consistent view.


The Coherence Challenge

Local Autonomy vs Fleet Coordination

Parts 1-3 developed local autonomy—essential, since without it partition means failure. But local autonomy creates coordination problems. Independent actions may:

DimensionLocal AutonomyFleet Coordination
Decision speedFast (local)Slow (consensus)
Information usedLocal sensors onlyFleet-wide picture
Failure modeSuboptimal but functionalComplete if quorum lost
Partition behaviorContinues operatingBlocks waiting for consensus

Coordination without communication is only possible through predetermined rules. If every node follows the same rules and starts with the same information, they will make the same decisions. But partition means information diverges—different nodes observe different events.

The tradeoff: more predetermined rules enable more coherence, but reduce adaptability. A fleet that pre-specifies every possible decision achieves perfect coherence but cannot adapt to novel situations. A fleet with maximum adaptability achieves minimum coherence—each node does its own thing.

Edge architecture must find the balance: enough rules for critical coherence, enough flexibility for operational adaptation.

State Divergence Sources

Definition 11 (State Divergence). For state sets \(S_A\) and \(S_B\) represented as key-value pairs, the divergence \(D(S_A, S_B)\) is the normalized symmetric difference:

where \(D \in [0, 1]\), with \(D = 0\) indicating identical states and \(D = 1\) indicating completely disjoint states.

During partition, state diverges through multiple mechanisms:

Environmental inputs differ. Each cluster observes different events. Cluster A sees threat T1 approach from the west. Cluster B, on the other side of the partition, sees nothing. Their threat models diverge.

Decisions made independently. Self-healing requires local decisions. Cluster A decides to redistribute workload after node failure. Cluster B, unaware of the failure, continues assuming the failed node is operational. Their understanding of fleet configuration diverges.

Time drift. Without network time synchronization, clocks diverge. After 6 hours of partition at 100ppm drift, clocks differ by 2 seconds. Timestamps become unreliable for ordering events.

Message loss. Before partition fully established, some gossip messages reach some nodes. The partial propagation creates uneven knowledge. Node A heard about event E before partition. Node B did not. Their histories diverge.

Proposition 12 (Divergence Growth Rate). If state-changing events arrive according to a Poisson process with rate \(\lambda\), the expected divergence after partition duration \(\tau\) is:

Proof sketch: Model state as a binary indicator per key: identical (0) or divergent (1). Under independent Poisson arrivals with rate \(\lambda\), the probability a given key remains synchronized is \(e^{-\lambda \tau}\). The expected fraction of divergent keys follows the complementary probability. For sparse state changes, \(E[D(\tau)] \approx 1 - e^{-\lambda \tau}\) provides a tight upper bound. Corollary 5. Reconciliation cost is linear in divergence: \(\text{Cost}(\tau) = c \cdot D(\tau) \cdot |S_A \cup S_B|\) where \(c\) is per-item sync cost.


Conflict-Free Data Structures

CRDTs at the Edge

Definition 12 (Conflict-Free Replicated Data Type). A state-based CRDT is a tuple \((S, s^0, q, u, m)\) where \(S\) is the state space, \(s^0\) is the initial state, \(q\) is the query function, \(u\) is the update function, and \(m: S \times S \rightarrow S\) is a merge function satisfying:

These properties make \((S, m)\) a join-semilattice, guaranteeing convergence regardless of merge order.

Conflict-free Replicated Data Types (CRDTs) are data structures designed for eventual consistency without coordination. Each node can update its local replica independently. When nodes reconnect, replicas merge deterministically to the same result regardless of message ordering.

If the merge operation is mathematically well-behaved, you get consistency for free.

CRDT TypeOperationEdge Application
G-CounterIncrement onlyMessage counts, observation counts
PN-CounterIncrement and decrementResource tracking (±)
G-SetAdd onlySurveyed zones, detected threats
2P-SetAdd and remove (once)Active targets, current alerts
LWW-RegisterLast-writer-wins valueConfiguration, status
MV-RegisterMulti-value (preserve conflicts)Concurrent updates

G-Set example: RAVEN surveillance coverage

Each drone maintains a local set of surveyed grid cells. When drones reconnect:

The union is commutative (order doesn’t matter), associative (grouping doesn’t matter), and idempotent (merging twice gives same result). These properties guarantee convergence.

Proposition 13 (CRDT Convergence). If all updates eventually propagate to all nodes (eventual delivery), and the merge function satisfies commutativity, associativity, and idempotency, then all replicas converge to the same state.

Proof sketch: Eventual delivery ensures all nodes receive all updates. The semilattice properties ensure merge order doesn’t matter. Therefore, all nodes applying all updates in any order reach the same state. Edge suitability: CRDTs require no coordination during partition. Updates are local. Merge is deterministic. This matches edge constraints perfectly.

    
    graph TD
    subgraph During_Partition["During Partition (independent updates)"]
    A1["Cluster A
State: {1,2,3}"] -->|"adds item 4"| A2["Cluster A
State: {1,2,3,4}"] B1["Cluster B
State: {1,2,3}"] -->|"adds item 5"| B2["Cluster B
State: {1,2,3,5}"] end subgraph After_Reconnection["After Reconnection"] M["CRDT Merge
(set union)"] R["Merged State
{1,2,3,4,5}"] end A2 --> M B2 --> M M --> R style M fill:#c8e6c9,stroke:#388e3c style R fill:#e8f5e9,stroke:#388e3c,stroke-width:2px style During_Partition fill:#fff3e0 style After_Reconnection fill:#e8f5e9

The merge operation is automatic and deterministic—no conflict resolution logic needed. Both clusters’ contributions are preserved.

Limitations: CRDTs impose semantic constraints. A counter that only increments cannot represent a value that should decrease. A set that only adds cannot represent removal. Application data must be structured to fit available CRDT semantics.

Choosing the right CRDT: The choice depends on application semantics:

Bounded-Memory Tactical CRDT Variants

Standard CRDTs assume unbounded state growth—problematic for edge nodes with constrained memory. We introduce bounded-memory variants tailored for tactical operations.

Sliding-Window G-Counter:

Maintain counts only for recent time windows, discarding old history:

where \(W(t) = {w : t - T_{\text{window}} \leq w < t}\) is the active window set. Memory: \(O(T_{\text{window}} / \Delta_w)\) instead of unbounded.

RAVEN application: Track observation counts per sector for the last hour. Older counts archived to fusion node when connectivity permits, then pruned locally.

Bounded OR-Set with Eviction:

Limit set cardinality with priority-based eviction:

where \(e_{\text{min}} = \arg\min_{e' \in S} \text{priority}(e')\). The eviction maintains CRDT properties:

Eviction commutativity proof sketch: Define \(\text{evict}(S) = S \setminus {e_{\text{min}}}\). For deterministic priority function, \(\text{evict}(\text{merge}(S_A, S_B)) = \text{merge}(\text{evict}(S_A), \text{evict}(S_B))\) when both exceed \(M_{\text{max}}\).

Priority functions for tactical state:

CONVOY application: Track at most 50 active threats. When capacity exceeded, evict lowest-priority (low-threat, stale) entities. Memory: fixed 50 × sizeof(entity) regardless of operation duration.

Compressed Delta-CRDT:

Standard delta-CRDTs transmit state changes. We compress deltas using domain-specific encoding:

where \(H(\Delta)\) is the entropy of the delta. For tactical state with predictable patterns, compression achieves 3-5× reduction.

Compression techniques:

  1. Spatial encoding: Position updates as offsets from predicted trajectory
  2. Temporal batching: Multiple updates to same entity merged before transmission
  3. Dictionary encoding: Common values (status codes, threat types) as indices

OUTPOST application: Sensor health updates compressed to 2-3 bytes per sensor versus 32 bytes uncompressed. 127-sensor mesh health fits in single packet.

Hierarchical State Pruning:

Tactical systems naturally have hierarchical state importance:

LevelRetentionPruning Trigger
Critical (threats, failures)IndefiniteNever auto-prune
Operational (positions, status)1 hourTime-based
Diagnostic (detailed health)10 minutesMemory pressure
Debug (raw sensor data)1 minuteAggressive

State automatically demotes under memory pressure:

where \(\text{level}_{\min}(s)\) is the minimum level for state type \(s\).

Memory budget enforcement:

Each CRDT type has a memory budget \(B_i\). Total memory:

When approaching limit, the system:

  1. Prunes diagnostic/debug state
  2. Compresses operational state
  3. Evicts low-priority entries from bounded sets
  4. Archives to persistent storage if available
  5. Drops new low-priority updates as last resort

RAVEN memory profile: 50 drones × 2KB state budget = 100KB CRDT state. Bounded OR-Set for 200 threats (4KB), sliding-window counters for 100 sectors (2KB), health registers for 50 nodes (1.6KB). Total: ~8KB active CRDT state, well within budget.

Last-Writer-Wins vs Application Semantics

Last-Writer-Wins (LWW) is a common conflict resolution strategy: when values conflict, the most recent timestamp wins.

LWW works for:

LWW fails for:

Edge complication: LWW assumes reliable timestamps. Clock drift makes “latest” ambiguous. If Cluster A’s clock is 3 seconds ahead of Cluster B, Cluster A’s updates always win—even if they’re actually older.

Vector Clocks for Causality

Before examining hybrid approaches, consider pure vector clocks. Each node \(i\) maintains a vector \(V_i[1..n]\) where \(V_i[j]\) represents node \(i\)’s knowledge of node \(j\)’s logical time.

Definition 13 (Vector Clock). A vector clock \(V\) is a function from node identifiers to non-negative integers. The vector clock ordering \(\leq\) is defined as:

Events are causally related iff their vector clocks are comparable; concurrent events have incomparable vectors.

Proposition 14 (Vector Clock Causality). For events \(e_1\) and \(e_2\) with vector timestamps \(V_1\) and \(V_2\):

The update rules are:

Edge limitation: Vector clocks grow linearly with node count. For a 50-drone swarm, each message carries 50 integers. For CONVOY with 12 vehicles, overhead is acceptable. For larger fleets, compressed representations or hierarchical clocks are needed.

Mitigation: Hybrid Logical Clocks (HLC) combine physical time with logical counters:

HLCs provide causal ordering when clocks are close and total ordering otherwise. The physical component bounds divergence even when logical ordering fails.

CONVOY routing example: Vehicles 3 and 8 both update route:

With LWW, Vehicle 8’s route wins. But what if Vehicle 3 had more recent intel that arrived at 14:32:15 and took 2 seconds to process? The “winning” route may be based on stale information.

Application semantics matter. Route decisions should consider information freshness, not just decision timestamp.

Custom Merge Functions

When standard CRDTs don’t fit, define custom merge functions. The requirements are the same:

Commutative: \(\text{merge}(A, B) = \text{merge}(B, A)\)

Associative: \(\text{merge}(\text{merge}(A, B), C) = \text{merge}(A, \text{merge}(B, C))\)

Idempotent: \(\text{merge}(A, A) = A\)

Example: Surveillance priority list

Each cluster maintains a list of priority targets. During partition, both clusters may add or reorder targets.

Merge function:

  1. Union of all targets: \(T_{\text{merged}} = T_A \cup T_B\)
  2. Priority = maximum priority assigned by any cluster
  3. Flag conflicts where clusters assigned significantly different priorities

This is commutative and associative. Conflicts are flagged for human review rather than silently resolved.

Example: Engagement authorization

Critical: a target should only be engaged if both clusters agree.

Merge function: intersection, not union.

If Cluster A authorized target T but Cluster B did not, the merged state does not authorize T. Conservative resolution for high-stakes decisions.

Verification: Custom merge functions must be proven correct. For each function, verify:

  1. Commutativity: formal proof or exhaustive testing
  2. Associativity: formal proof or exhaustive testing
  3. Idempotency: formal proof or exhaustive testing
  4. Safety: merged state satisfies application invariants

Hierarchical Decision Authority

Decision Scope Classification

Definition 14 (Decision Scope). The scope \(\text{scope}(d)\) of a decision \(d\) is the set of nodes whose state is affected by \(d\). Decisions are classified by scope cardinality: L0 (single node), L1 (local cluster), L2 (fleet-wide), L3 (command-level).

Not all decisions have the same scope. A decision affecting only one node is different from a decision affecting the entire fleet. Decision authority should match decision scope.

LevelScopeExamples
L0Single nodeSelf-healing, local sensor adjustment, power management
L1Local clusterFormation adjustment, local task redistribution, cluster healing
L2Fleet-wideRoute changes, objective prioritization, resource reallocation
L3CommandRules of engagement, mission abort, strategic reposition

L0 decisions can always be made locally. No coordination required. If a drone’s sensor needs recalibration, it recalibrates. No need to consult the swarm.

L1 decisions require cluster-level coordination but not fleet-wide. If a cluster needs to adjust formation due to member failure, the cluster lead coordinates locally. Other clusters don’t need to know immediately.

L2 decisions should involve fleet-wide coordination when possible. Route changes affect the entire convoy. Objective prioritization affects how all clusters allocate effort. These decisions benefit from fleet-wide information.

L3 decisions require external authority. Engagement rules come from command. Mission abort requires command approval. These cannot be made autonomously regardless of connectivity.

During partition: L0 and L1 decisions continue normally. L2 decisions become problematic—fleet-wide coordination is impossible. L3 decisions cannot be made; the system must operate within pre-authorized bounds.

    
    graph TD
    subgraph Connected["Connected State (full hierarchy)"]
    L3C["L3: Command
(strategic decisions)"] --> L2C["L2: Fleet
(fleet-wide coordination)"] L2C --> L1C["L1: Cluster
(local coordination)"] L1C --> L0C["L0: Node
(self-management)"] end subgraph Partitioned["Partitioned State (delegated authority)"] L1P["L1: Cluster Lead
(elevated to L2 authority)"] --> L0P["L0: Node
(autonomous operation)"] end L1C -.->|"partition
event"| L1P style L3C fill:#ffcdd2,stroke:#c62828 style L2C fill:#fff9c4,stroke:#f9a825 style L1C fill:#c8e6c9,stroke:#388e3c style L0C fill:#e8f5e9,stroke:#388e3c style L1P fill:#fff9c4,stroke:#f9a825 style L0P fill:#e8f5e9,stroke:#388e3c

Authority elevation during partition: When connectivity is lost, authority must be explicitly delegated downward. The system cannot simply assume lower levels can make higher-level decisions.

Authority Delegation Under Partition

When fleet-wide coordination is impossible, what authority do local nodes have?

Pre-delegated authority: Before mission start, define contingency authorities.

Bounded delegation: Authority expires or is limited in scope.

Mission-phase dependent: Authority varies by mission phase.

Risk: Parallel partitions may both claim authority. Cluster A and Cluster B both think they’re the senior cluster and both make L2 decisions. On reconnection, they have conflicting fleet-wide decisions.

Mitigation: Tie-breaking rules defined in advance.

Conflict Detection at Reconciliation

When clusters reconnect, compare decision logs:

Detection: Identify overlapping authority claims.

Two decisions conflict if they affect overlapping scope and differ.

Classification: Reversible vs irreversible.

Resolution for reversible: Apply hierarchy.

If Cluster A made decision \(d_A\) and Cluster B made decision \(d_B\):

  1. If \(\text{authority}(A) > \text{authority}(B)\): \(d_A\) wins
  2. If \(\text{authority}(A) = \text{authority}(B)\): Apply tie-breaker
  3. Update both clusters to winning decision

Resolution for irreversible: Flag for human review.

Cannot undo physical actions. Log the conflict, document both decisions and outcomes, present to command for analysis. Learn from the conflict to improve future protocols.


Reconnection Protocols

State Reconciliation Sequence

When partition heals, clusters must reconcile state efficiently. Bandwidth may be limited during reconnection window. Protocol must be robust to partial completion if partition recurs.

Phase 1: State Summary Exchange

Each cluster computes a compact summary of its state using Merkle trees:

Where \(H\) is a hash function and \(s_i\) are state elements.

Exchange roots. If roots match, states are identical—no further sync needed.

Phase 2: Divergence Identification

If roots differ, descend Merkle tree to identify divergent subtrees. Exchange hashes at each level until divergent leaves are found.

Proposition 15 (Reconciliation Complexity). For \(n\)-item state with \(k\) divergent items, Merkle-based reconciliation requires \(O(\log n + k)\) messages: \(O(\log n)\) to traverse the tree and identify divergences, plus \(O(k)\) to transfer divergent data.

Proof: The Merkle tree has height \(O(\log n)\). In each round, parties exchange hashes for differing subtrees. At level \(i\), at most \(\min(k, 2^i)\) subtrees differ. Summing across \(O(\log(n/k))\) levels until subtrees contain \(\leq 1\) divergent item yields \(O(k)\) hash comparisons. Adding \(O(k)\) data transfers gives total complexity \(O(k \log(n/k) + k) = O(k \log n)\) in the worst case, or \(O(\log n + k)\) when divergent items cluster spatially. Phase 3: Divergent Data Exchange

Transfer the actual divergent key-value pairs. Prioritize by importance (Phase 4.2).

Phase 4: Merge Execution

Apply CRDT merge or custom merge functions to divergent items. Compute unified state.

Phase 5: Consistency Verification

Recompute Merkle roots. Exchange and verify they now match. If mismatch, identify remaining divergences and repeat from Phase 3.

Phase 6: Coordinated Operation Resumption

With consistent state, resume fleet-wide coordination. Notify all nodes that coherence is restored.

    
    graph TD
    A["Partition Heals
(connectivity restored)"] --> B["Exchange Merkle Roots
(state fingerprints)"] B --> C{"Roots
Match?"} C -->|"Yes"| G["Resume Coordination
(fleet coherent)"] C -->|"No"| D["Identify Divergences
(traverse Merkle tree)"] D --> E["Exchange Divergent Data
(priority-ordered)"] E --> F["Merge States
(CRDT merge)"] F --> B style A fill:#c8e6c9,stroke:#388e3c style G fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style C fill:#fff9c4,stroke:#f9a825 style D fill:#bbdefb style E fill:#bbdefb style F fill:#bbdefb

Priority Ordering for Sync

Limited bandwidth during reconnection requires prioritization.

Priority 1: Safety-critical state

Priority 2: Mission-critical state

Priority 3: Operational state

Priority 4: Audit and logging

Sync Priority 1 first. If partition recurs, at least safety-critical state is consistent. Lower priorities can wait for more stable connectivity.

Optimization: Order sync items by expected information value:

High-impact, stale items should sync first. Low-impact, fresh items can wait.

Handling Actions Taken During Partition

Physical actions cannot be “merged” logically. If Cluster A drove north and Cluster B drove south, they cannot merge to “drove north and south simultaneously.”

Classification of partition actions:

Complementary actions: Both clusters did useful, non-overlapping work.

Redundant actions: Both clusters did the same work.

Conflicting actions: Actions are mutually incompatible.

Resolution by type:

TypeDetectionResolution
ComplementaryNon-overlapping scopeAccept both; update state
RedundantIdentical scope and actionDeduplicate; note inefficiency
ConflictingOverlapping scope, different actionFlag for review; assess damage

Audit trail: All partition decisions must be logged with:

Post-mission review uses audit trail to:


CONVOY Coherence Protocol

Return to the CONVOY partition at the mountain pass.

State During Partition

Forward group (vehicles 1-5):

Rear group (vehicles 6-12):

State divergence:

Reconnection at Mountain Base

Radio contact restored as both groups clear the mountain pass.

Phase 1: Vehicle 1 and Vehicle 6 exchange state summaries.

Phase 2: Identify specific divergences.

Phase 3: Exchange divergent data.

Phase 4: Merge states.

Intel merge reconciles conflicting reports: bridge status marked UNCERTAIN from conflicting regional command intel, but updated to INTACT based on rear group visual confirmation. Route B status marked UNCERTAIN from forward group initial report, but updated to PASSABLE based on forward group successful traverse.

Route decision merge:

Phase 5: Verify consistency.

Phase 6: Resume coordinated operation.

Lessons Learned

  1. Intel conflict: Regional command and forward group gave conflicting information. Neither was fully accurate. Convoy should have intel confidence scores.

  2. Route lock: Once route decisions executed, cannot reverse. Pre-agree routing rules for partition scenarios.

  3. Communication shadow mapped: km 47-52 is now known radio shadow. Future transits prepare for partition at this location.

  4. Independent operation validated: Vehicles 6-12 operated successfully for 45 minutes under local lead. Confirms L2 delegation works.

The fleet emerges from partition with improved knowledge—an anti-fragile outcome.


RAVEN Coherence Protocol

The RAVEN swarm of 47 drones experiences partition due to terrain and jamming, splitting into three clusters.

State During Partition

Cluster A (20 drones, led by Drone 1):

Cluster B (18 drones, led by Drone 21):

Cluster C (9 drones, led by Drone 40):

Reconnection as Swarm Reforms

Clusters gradually reconnect as jamming subsides.

Coverage merge (G-Set):

Simple union. No conflicts possible.

Threat merge:

Union of detected threats. No conflict—different threats at different positions.

Health merge:

Each drone’s health is LWW-Register. Latest observation wins.

Coherence challenge: What if Cluster A and B both detected threats near zone W boundary?

Entity resolution: Compare threat attributes.

AttributeCluster A (T1)Cluster B (T3)
Position(34.5102, -118.2205)(34.5114, -118.2193)
Time offsetFirst observation+2.5 minutes
SignatureVehicle, moving NEVehicle, moving NE

Position difference: 170 meters. Time difference: roughly 2.5 minutes. Same signature. Likely same entity observed from different angles at different times.

Resolution: Merge into single threat T1 with combined observations:

Where \(c\) is confidence and \(p\) is position.

Entity Resolution Formalization

For distributed observation systems, entity resolution is critical. Multiple observers may detect the same entity and assign different identifiers.

Observation tuple: \((id, pos, time, sig, observer)\)

Match probability:

Where \(\text{sim}\) is signature similarity function.

Merge criteria: If \(P(\text{same}) > \theta\), merge observations. Otherwise, keep as separate entities.

Confidence update: Merged entity has increased confidence:

Two 80% confident observations merge to 96% confident entity.


OUTPOST Coherence Protocol

The OUTPOST sensor mesh faces distinct coherence challenges: ultra-low bandwidth, extended partition durations (days to weeks), and hierarchical fusion architecture.

State Classification for Mesh Coherence

OUTPOST state partitions into categories with different reconciliation priorities:

State TypeUpdate FrequencyReconciliation StrategyPriority
Detection eventsPer-eventUnion with deduplicationHighest
Sensor healthPer-minuteLatest-timestamp-winsHigh
Coverage mapPer-hourMerge with confidence weightingMedium
ConfigurationPer-dayVersion-based with rollbackLow

Multi-Fusion Coordination

When multiple fusion nodes operate, they must coordinate coverage and avoid duplicate alerts:

    
    graph TD
    subgraph Zone_A["Zone A (Fusion A responsibility)"]
    S1[Sensor 1]
    S2[Sensor 2]
    S3[Sensor 3]
    end
    subgraph Zone_B["Zone B (Fusion B responsibility)"]
    S4[Sensor 4]
    S5[Sensor 5]
    end
    subgraph Overlap["Overlap Zone (shared responsibility)"]
    S6["Sensor 6
(reports to both)"] end subgraph Fusion_Layer["Fusion Layer"] F1[Fusion A] F2[Fusion B] end S1 --> F1 S2 --> F1 S3 --> F1 S4 --> F2 S5 --> F2 S6 --> F1 S6 --> F2 F1 <-.->|"deduplication
coordination"| F2 style Overlap fill:#fff3e0,stroke:#f57c00 style Zone_A fill:#e3f2fd style Zone_B fill:#e8f5e9

Overlapping coverage reconciliation: When sensors report to multiple fusion nodes:

Resolution rules:

  1. Same event, same timestamp: Deduplicate by event ID
  2. Same event, different timestamps: Use earliest detection time
  3. Conflicting assessments: Combine confidence, flag for review

Long-Duration Partition Handling

OUTPOST may operate for days without fusion node contact. Special handling for extended autonomy:

Local decision authority: Each sensor can make detection decisions locally. Decisions are logged for later reconciliation.

Detection event structure for eventual consistency:

The \(\text{reconciled}\) flag tracks whether the event has been confirmed by fusion node. Unreconciled events are treated with lower confidence.

Bandwidth-efficient reconciliation: Given ultra-low bandwidth (often < 1 Kbps), OUTPOST uses compact delta encoding:

Only changed state transmits. Merkle tree roots validate completeness without transmitting full state.

Sensor-Fusion Authority Hierarchy

Decision scopes:

During partition:

Proposition 16 (OUTPOST Coherence Bound). For an OUTPOST mesh with \(n\) sensors, \(k\) fusion nodes, and partition duration \(T_p\), the expected state divergence is bounded by:

where \(\lambda\) is the event arrival rate and the factor \((n-k)/k\) reflects the sensor-to-fusion ratio.


The Limits of Coherence

Irreconcilable Conflicts

Some conflicts cannot be resolved through merge functions or hierarchy.

Physical impossibilities: Cluster A reports target destroyed. Cluster B reports target escaped. Both cannot be true. The merge function cannot determine which is correct from state alone.

Resolution: Flag for external verification. Use sensor data from both clusters. Accept uncertainty if verification impossible.

Resource allocation conflicts: Cluster A allocated sensor drones to zone X. Cluster B allocated same drones to zone Y. The drones are physically in one place—but which?

Resolution: Trust current position reports. Update state to reflect actual positions. Flag allocation discrepancy for review.

Byzantine Actors

A compromised node may deliberately create conflicts:

Detection: Byzantine behavior often creates patterns:

Isolation: Nodes detected as potentially Byzantine:

  1. Reduce trust weight in aggregation
  2. Quarantine from decision-making
  3. Flag for human review

Byzantine-tolerant CRDTs exist but are expensive. Recent work by Kleppmann et al. addresses making CRDTs Byzantine fault-tolerant, but the overhead is significant. Edge systems often use lightweight detection plus isolation rather than full Byzantine tolerance.

Stale-Forever State

Some state may never reconcile:

Acceptance: Perfect consistency is impossible in distributed systems under partition and failure. The fleet must operate with incomplete history.

Mitigation: Redundant observation. If multiple nodes observe the same event, loss of one doesn’t lose the observation.

The Coherence-Autonomy Tradeoff

Perfect coherence requires consensus before action. Consensus requires communication. Communication may be impossible.

Maximum coherence means no action without agreement—the system blocks during partition. Maximum autonomy means action without coordination—coherence is minimal.

Edge architecture accepts imperfect coherence in exchange for operational autonomy. The question is not “how to achieve perfect coherence” but “how to achieve sufficient coherence for mission success.”

Sufficient coherence: The minimum consistency needed for the mission to succeed.

Engineering Judgment

When should the system accept incoherence as the lesser evil?

This is engineering judgment, not algorithmic decision. The architect must define coherence requirements per state type and accept that perfect coherence is unachievable.


Closing: From Coherence to Anti-Fragility

The preceding articles developed resilience: the ability to survive partition and return to coordinated operation.

But resilience—returning to baseline—is not the complete goal. The fleet that experiences partition should emerge better than before.

CONVOY at the mountain pass learned:

This knowledge makes future operations stronger. The partition was stressful—but it generated valuable information.

The next article on anti-fragility develops systems that improve from stress rather than merely surviving it. The coherence challenge becomes a learning opportunity. Conflicts reveal hidden assumptions. Reconciliation tests merge logic. Each partition makes the fleet more robust.

The goal is not to prevent partition. The goal is to design systems that thrive despite partition—and grow stronger through it.


Back to top