Free cookie consent management tool by TermsFeed Generator

Why Edge Is Not Cloud Minus Bandwidth

🛈 Info:

This series targets engineers building systems where connectivity cannot be guaranteed: tactical military platforms, remote industrial operations, autonomous mining fleets, smart grid substations, disaster response networks, and autonomous vehicle fleets. The mathematical frameworks - optimization theory, Markov processes, queueing theory, control systems - apply wherever systems must make autonomous decisions under uncertainty. Each part builds toward a unified theory of autonomic edge architecture: self-measuring, self-healing, self-optimizing systems that improve under stress rather than merely survive it.

The RAVEN swarm — forty-seven autonomous drones holding a 12-kilometer surveillance grid — loses backhaul without warning. One moment they’re streaming 2.4 Gb/s (illustrative value) of sensor data to operations. The next, forty-seven nodes face a decision that cloud-native systems are never designed to answer:

What do we do when no one is listening?

The behavioral envelope was designed for brief interruptions — thirty seconds, maybe sixty. Jamming shows no sign of clearing. The mission hasn’t changed: maintain surveillance, detect threats, report findings.

Continue the patrol pattern? Contract formation? Break off a subset to seek connectivity at altitude? And critically — who decides? Leadership was an emergent property of connectivity. Now everyone reads link quality zero.

This is not an edge case. Partition is the baseline operating condition. The rest of this series builds the formal machinery to architect systems that treat it that way.


Overview

This article establishes the formal framework for contested connectivity. Each concept connects theory directly to a structural design consequence — if you understand the design consequence, you can implement the right architecture without necessarily working through every proof.

ConceptWhat It Tells YouDesign Consequence
Inversion ThesisOnce disconnection exceeds 15% of operating time — \(P(C=0)>0.15\) — cloud-first architecture costs more than it savesDesign for disconnection as baseline, not as an exception handler
Connectivity ModelA semi-Markov chain with Weibull sojourn times captures how long partitions actually last, including the heavy tail of very long blackoutsSize buffers and timeouts from the stationary distribution, not from a worst-case guess
Capability Coupling — capability gain accumulates only above connectivity thresholdsPlace feature-enable thresholds in the tail of the connectivity distribution, so they activate reliably when connectivity exists
Coordination CrossoverDistributed coordination dominates when the fraction of time in Connected + Degraded regimes drops below 80%Pick your coordination mode from the regime distribution, not from peak-link assumptions
Constraint SequenceCapabilities must be built in a strict prerequisite order: survival \(\to\) measurement \(\to\) healing \(\to\) coherence \(\to\) anti-fragilityYou cannot safely build fleet coordination before self-healing is stable

Physical translation for : Think of this as a weighted sum. Each capability level — anomaly detection, fleet sync, anti-fragility learning — only “pays out” when connectivity is high enough to enable it. If connectivity almost never reaches a threshold, that capability contributes nothing to expected mission performance, no matter how well it’s implemented.

This framework builds on partition-tolerant systems [1, 2] , delay-tolerant networking [3] , and autonomic computing [5] . It targets contested environments where adversarial interference compounds natural connectivity challenges.

Three constraints you must answer before the framework can claim practical validity. These are introduced here as structural design constraints — not afterthoughts — because each shapes the formal machinery from the ground up:

  1. Clock drift. Crystal-oscillator nodes drift by seconds per hour and by minutes over 30-day (illustrative value) partitions. Last-Write-Wins conflict resolution silently inverts causal order when physical timestamps are untrustworthy. Constraint: the framework must not assume NTP availability for correctness guarantees. Answered in Fleet Coherence Under Partition by Definition 59 (Clock Trust Window) and Definition 71 (Causality Header). Both pivot from physical HLC ordering to pure logical ordering when the partition accumulator exceeds the drift tolerance.

  2. Resource floor. The full autonomic monitoring stack — EWMA, Merkle health tree, gossip table, EXP3-IX weight vector, event queue, vector clock — totals approximately 13 KB (illustrative value) of SRAM. A 4 KB (illustrative value) MCU has an autonomic ceiling of roughly 800 bytes (theoretical bound) under Proposition 68 . That means the stack exceeds its own budget by \(16\times\) before a single mission byte runs. Constraint: the stack must have a zero-tax tier that fits within 200 bytes (threshold — requires the full Zero-Tax implementation tier from Definitions 95–97). Answered in The Constraint Sequence and the Handover Boundary by Definitions 95–97 (Zero-Tax State Machine, In-Place Hash Chain, Fixed-Point EWMA) and Proposition 69 (Wakeup Latency Bound).

  3. Stability under mode-switching. Proposition 22 ’s gain bound assumes linear, time-invariant plant dynamics. Power-shedding makes a discrete function of capability level \(q\), turning the closed loop into a switched linear system. The LTI stability proof does not transfer across mode boundaries. Constraint: the stability proof must remain valid under mode transitions, including transitions forced by the resource floor constraint above. Answered in Self-Healing Without Connectivity by Proposition 25 (CBF mode-invariant safety), Theorem PWL (SMJLS mean-square stability), and Proposition 31 (CBF-derived refractory bound).

  4. Health observability without central infrastructure. This article defines the health vector \(\mathbf{H}(t)\) and the resource state \(R(t)\), but provides no mechanism for a node to populate those vectors without a central collector. During partition, there is no ground truth: a node can only observe its own sensors and neighbors it can reach. Constraint: anomaly detection and health propagation must operate locally on partial information. Answered in Self-Measurement Without Central Observability by Definitions 19–27 (Local Anomaly Detection, Gossip Health Protocol, Staleness, Byzantine Node) and Propositions 9–20 .

  5. State divergence under partition — the reconciliation debt. The state divergence metric \(D(t)\) grows during every partition but this article provides no mechanism for bounding or resolving that debt when connectivity resumes. Fleet-wide consistency requires concurrent writes on disconnected nodes to converge without centralized arbitration. Constraint: merge must be deterministic, commutative, and correct regardless of partition duration. Answered in Fleet Coherence Under Partition by Definitions 57–68 (State Divergence, CRDT, Vector Clock, Authority Tier) and Propositions 41–56 .

  6. Decision quality under sustained uncertainty. The capability hierarchy specifies what to run at each connectivity level, but says nothing about how to improve decision quality over time. A system that degrades gracefully but never learns from partition events is structurally correct but operationally stagnant. Constraint: the system must improve its decisions after stress events, not merely survive them. Answered in Anti-Fragile Decision-Making at the Edge by Definition 79 (Anti-Fragility), Propositions 57–65 , and the EXP3-IX / UCB bandit framework.

These six constraints are not independent. The resource floor (Constraint 2) directly tightens the mode-switching envelope (Constraint 3): a node entering OBSERVE state deactivates MAPE-K, rendering the healing stability proof vacuously satisfied but also providing zero healing coverage. The clock drift (Constraint 1) interacts with mode-switching: an OUTPOST crystal-oscillator node exceeds in under three hours, meaning the causal pivot fires before most fault scenarios escalate. The first three fixes compose into a single 20-byte field — the Unified Autonomic Header ( Definition 72 in Fleet Coherence Under Partition) — that carries the clock fix, stability flag, and resource-tier signal in every gossip exchange without increasing per-item MTU. Constraints 4–26 are addressed independently in their respective articles but share the same gossip transport and state-vector substrate.

Applicability note. These solutions apply when the deployment satisfies the Inversion Threshold condition ( Proposition 2 ): \(P(C(t) = 0) > \tau^*\). If partition probability is below \(\tau^*\), cloud-connected diagnostics may be more cost-effective than the local autonomy stack developed across this series.


Epistemic Positioning and Methodology

Before building any formal machinery, this section is precise about what kind of claims the framework makes — and what it does not. A common mistake is reading a theoretical bound as a measured benchmark; the two are structurally different and must not be conflated.

What Kind of Claims Does This Framework Make?

Definition 0 (Framework Scope). Every claim in this series is produced by exactly one of three operations:

This framework is:

(Notation: denotes the assumption set in this section; it also appears as authority tier in Fleet Coherence Under Partition, as the anti-fragility coefficient in Anti-Fragile Decision-Making at the Edge, and as the action space in game-theoretic contexts. Subscripts and section context differentiate the four roles throughout the series.)

This framework is not:

Methodological Principles

Three principles govern how every mechanism in this series was derived:

Principle 1 (Assumption Explicitness). Every mechanism \(M\) is paired with an assumption set . The validity of \(M\) is conditional on holding in the deployment context. If you skip verifying the assumptions, you’ve skipped the validity check.

Principle 2 (Derivation from Constraints). Mechanisms are not chosen arbitrarily — they are derived as logical consequences of constraints. If constraint \(c\) implies mechanism \(m\), we write \(c \vdash m\). The notation signals that a mechanism is forced, not selected.

Principle 3 (Architectural Coherence over Empirical Benchmarking). The primary goal is internal consistency — mechanisms compose correctly and satisfy stated objectives — not comparison against measured baselines. The framework succeeds if it’s coherent; proving it’s fast requires a separate empirical study.

How to Read Quantitative Elements

Throughout this series, numbers appear in three distinct roles. Conflating them is the most common misreading:

Element TypeWhat It MeansExample
BoundsTheoretical limits derived from model structure — the system cannot exceed these regret for the EXP3-IX bandit
ThresholdsDecision boundaries derived from cost analysis — if your costs match the model, use this value
IllustrationsConcrete numbers for pedagogical clarity only — they show how the framework applies, not what will happen on your hardware“3–24 observations” for anomaly detection

Illustrations are not performance claims. They show the framework in action under specific parameter choices. Calibrate every illustration against your own hardware before relying on it.

From Theory to Practice

This framework provides architectural templates and reasoning patterns. Putting it into production requires three steps that the framework itself cannot do for you:

  1. Assumption verification — Does your deployment context satisfy ? If not, which mechanisms need re-derivation?
  2. Parameter instantiation — What are the concrete values for in your system?
  3. Empirical validation — Does the implemented system actually achieve acceptable performance?

Reading Conditional Claims

Throughout the series, every conclusion is expressed in this form:

Plain English: “ ” reads as “if the listed assumptions hold, then property \(P\) is guaranteed.” It is not saying \(P\) holds in general — only conditionally. Every time you see this notation, ask yourself: do my deployment conditions actually satisfy ?


Formal Foundations

An edge node is always doing one of two things: adapting to changing conditions, or failing to. Five state variables capture everything the node needs to know about itself to make that distinction precise.

Core State Variables

Five quantities fully describe an edge node’s operational state at any instant. Every other framework parameter derives from these five.

VariableRangeWhat It MeasuresWhat Happens at the Limits
\(C(t)\) — link quality0 = disconnected, 1 = full capacityFraction of nominal link capacity currently availableAt \(C=0\): no bits flow, all cloud-dependent capabilities stop. At \(C=1\): full datarate available, all regimes enabled
\(\Xi(t)\) — operating regime\(\mathcal{C}\) (Connected), \(\mathcal{D}\) (Degraded), \(\mathcal{I}\) (Intermittent), \(\mathcal{N}\) (None)Discrete behavioral label derived from \(C(t)\) thresholds — the RAVEN swarm runs four distinct behavioral envelopes based on this variable aloneAt \(\mathcal{C}\): full consensus and fleet sync. At \(\mathcal{N}\): local autonomy only, no external coordination
\(\mathcal{L}(t)\) — capability level0 = survival-only, 4 = full optimizationInteger tier of service the node currently delivers. \(\mathcal{L}_0\): basic sensing and local storage. \(\mathcal{L}_4\): full coordinated learning across the fleetAt \(\mathcal{L}_0\): the node survives but does nothing beyond staying alive. At \(\mathcal{L}_4\): every optimization layer is running
\(\mathbf{H}(t)\) — health vectorOne score per subsystem, each in \([0,1]\)Per-subsystem health across \(n\) monitored components. For RAVEN with \(n=6\), a vector like \([0.9, 0.3, 1.0, \ldots]\) immediately flags the second subsystem as critically degradedAny component at 0 = failed; triggers self-healing. All at 1.0 = nominal; no action required
\(D(t)\) — state divergence0 = in sync with fleet, 1 = fully isolated stateHow far this node’s local state has drifted from fleet consensus during disconnection — the debt the system accumulates by operating autonomouslyAt \(D=0\): zero reconciliation cost at reconnection. At \(D=1\): maximum reconciliation cost — the node may need to replay the entire partition period
\(R(t)\) — resource availability\([0, 1]\)Normalized composite of battery SOC, free memory, and idle CPU. Critical threshold: . Formal definition: Definition 1 belowAt \(R > 0.2\): normal degraded operation. At \(R < 0.2\): emergency resource shedding; capability drops to \(\mathcal{L}_0\)

(\(\mathbf{H}(t)\) here is a per-node vector of cardinality \(k\), one entry per monitored subsystem within a single node. Self-Measurement Without Central Observability extends this to a fleet-wide health vector \(\mathbf{H} \in [0,1]^n\) over all \(n\) nodes; the per-node score used here is that article’s \(h_i\) component.)

Definition 1 (Resource State). Let \(R(t) \in [0, 1]\) denote the normalized composite resource availability at time \(t\):

with weights \(w_E + w_M + w_C = 1\). Critical threshold: — 20% composite resource availability triggers survival mode regardless of connectivity state.

The constraint sequence article uses ‘Denied’ for \(0 < C \leq 0.3\) and ‘Emergency’ for \(C = 0\) as an illustrative simplification of the Intermittent/None boundary; the authoritative label for complete disconnection remains (None) as defined here.

Notation Legend

Several symbols carry different meanings depending on context. The table below lists every symbol with multiple roles across the series — the subscript or context always resolves ambiguity.

SymbolPrimary MeaningOther Roles (subscript disambiguates)
\(\mathcal{A}_x\)Assumption set — subscript names the scenarioAuthority tier, anti-fragility coefficient, and action space in later articles
\(\mathcal{Q}_j\)Authority tier: Node (0), Cluster (1), Fleet (2), Command (3)Text variant for delegated authority
\(\mathbb{A}\)Anti-fragility coefficient \((P_1 - P_0)/\sigma\) — scalarDistinct from assumption set \(\mathcal{A}\); double-struck A signals this
\(\mathcal{U}\)Action or control space in optimizationDomain of optimization: \(a \in \mathcal{U}\)
\(\Gamma\)Constraint set of all deployment constraintsAppears as \(c \in \Gamma\) and
\(\mathcal{C}\)Connected regime — highest connectivity stateAlso: constraint set in the constraint sequence article; regime tuple
\(E\)Edge-ness Score \(\in [0,1]\) classifying deployment typeThreshold comparisons: \(E < 0.3\) = edge, \(E \geq 0.6\) = cloud
\(T_d\)Energy per local compute decision — joules, range 10–72 \(\mu\text{J}\)Subscript d = “decide”; never a time value
\(T_s\)Energy per radio packet transmission — joules, range 1–10 mJSubscript s = “send”; \(T_s / T_d \approx 10^2\)–\(10^3\)
\(\tau\)Loop delay ; staleness ; partition duration ; burst duration Subscript selects role; bare \(\tau\) = ; \(\tau^*\) = Inversion Threshold
\(\gamma\)Semantic convergence factor (Def 1b); staleness-weight decay \(\gamma_s\); RBF kernel bandwidth \(\gamma_{\text{rbf}}\); Byzantine reputation rates Bare \(\gamma\) = Def 1b in this article; Holt-Winters seasonal smoothing is \(s_{\text{hw}}\); subscript selects other roles
\(k_i\)Weibull shape for regime \(i\) — controls partition tail heaviness\(k_\mathcal{N} < 1\) = heavy tail; \(k=1\) = exponential; \(k>1\) = light tail. Distinct from uppercase \(K\) (control loop gain) — \(k_\mathcal{N}\) is a Weibull parameter never used as a gain
\(\lambda_i\)Weibull scale for regime \(i\) — sets characteristic sojourn time ; at \(k=1\): \(\lambda_i = 1/q_i\)
Partition duration accumulator — contiguous time in \(\mathcal{N}\)Reset to 0 on partition end; input to \(\theta^*(t)\) and circuit breaker (Proposition 37 — forward reference, defined in Self-Healing Without Connectivity)
P95 partition duration planning threshold ; MCU: one pow() call
False-negative cost escalation rate \(\geq 0\)0 = static threshold; 2.0 = OUTPOST calibration; bounded by \([0, 5]\) in practice
\(\beta\)Reconciliation cost (Prop 1); Holt-Winters trend coefficient; bandwidth asymmetry ; Gamma prior rate \(\beta_i^0\)Subscript or context selects meaning across articles
Compute-to-transmit energy ratio — used in Proposition 1 dominance threshold; the local-dominant compute-cycle ceiling is \(\rho_q\) = CBF stability margin in Proposition 25 (defined in Self-Healing Without Connectivity); \(\rho\) (bare, without subscript) = retry-overhead multiplier in the inversion derivation — a distinct quantity
Spatial jamming correlation factor — scales how strongly a neighbor’s denial state elevates a node’s own Denied transition rate (Definition 17)Distinct from (compute-to-transmit ratio above); = independent fading; = full area-denial coupling
\(\lambda_i\)Weibull scale parameter for regime \(i\) — sets characteristic sojourn time; subscript \(i\) always presentState update rate in Proposition 41 uses a separate subscripted form; subscript always disambiguates
\(f_g\)Gossip contact rate (events/s) — fanout rate at which a node contacts random neighbors (Proposition 12)Replaces former bare \(\lambda\) for gossip rate; distinct from Weibull scale \(\lambda_i\)
\(\kappa_{\text{drift}}\)Kalman process noise rate (\(s^{-1}\)) — controls how fast the adaptive baseline adjusts across capability levelsReplaces former \(\lambda_{\text{drift}}\); distinct from Weibull scale \(\lambda_i\), gossip contact rate \(f_g\), and information decay rate \(\lambda_c\)

Multi-Meaning Symbol Index

The following symbols carry distinct meanings across the series. When in doubt, the subscript always disambiguates. Use this table as a cross-reference anchor.

SymbolMeaning in this articleSame symbol, different meaningDefined where
Compute-to-transmit energy ratio \(T_d/T_s\)\(\rho_q\): CBF stability margin (dimensionless); \(\rho\) (bare): retry-overhead multiplier in inversion derivationSelf-Healing Without Connectivity, Definition 39
\(\rho\) (bare)Retry-overhead multiplier in the Inversion Threshold derivation\(\rho_{\text{energy}}\): compute-to-transmit energy ratio; \(\delta_{\text{ppm}}\): physical clock drift rateFleet Coherence Under Partition, HLC section
\(\lambda_i\)Weibull scale parameter per regime (subscript \(i\) always present)State update rate (subscripted separately); information decay rate \(\lambda_c\)Subscript always disambiguates
\(f_g\)Gossip contact rate (events/s)Replaces former bare \(\lambda\) for gossip; distinct from \(\lambda_i\) and \(\kappa_{\text{drift}}\)Self-Measurement Without Central Observability, Proposition 12
\(\kappa_{\text{drift}}\)Kalman process noise rate (\(s^{-1}\))Replaces former \(\lambda_{\text{drift}}\); distinct from \(\lambda_i\), \(f_g\), and \(\lambda_c\)Self-Measurement Without Central Observability
\(\gamma\)Semantic convergence factor (Definition 5b)\(\gamma_s\): staleness-weight decay; \(\gamma_{\text{rbf}}\): RBF kernel; \(s_{\text{hw}}\): Holt-Winters seasonal; dCBF contraction rateSubscript always disambiguates; bare Holt-Winters \(\gamma\) renamed \(s_{\text{hw}}\)
\(\beta\)Reconciliation cost (Proposition 2)Holt-Winters trend coefficient; bandwidth asymmetry \(B_{\text{backhaul}}/B_{\text{local}}\); Gamma prior rateSubscript always disambiguates
\(\tau^*\)Inversion threshold (partition probability)\(\theta^*\): anomaly classification thresholdSelf-Measurement Without Central Observability, Definition 19
H(t)Per-component health vector for one nodeH = fleet-wide health vector over \(n\) nodesSelf-Measurement Without Central Observability, Definition 24
\(K\)Control loop gain (Constraint Structure; Definitions 3–4)EXP3-IX arm count in Anti-Fragile Decision-Making at the Edge (Definition 81); \(k_\mathcal{N}\) = Weibull shape (subscript always present, never a gain)Lines 222, 230, 239 and Definition 3 (control gain) vs. Definition 81, Part 5 (arm count)
L0–L4Capability levels (functional service tier)Authority tier \(\mathcal{Q}_j\): decision-scope hierarchyDefinition 3 (capability) vs. Definition 68 in Fleet Coherence Under Partition (authority)

The complete series notation registry is maintained at Notation Registry.

Constraint Structure

Three hard constraints govern everything that follows. Violate any one of them and the architecture fails — not degrades, fails. \(B(t)\) is available bandwidth, \(R(t)\) is remaining resource budget (power, compute, memory combined), \(K\) is control loop gain throughout this article (the EXP3-IX arm count in Anti-Fragile Decision-Making at the Edge uses the same letter; context always resolves the role), and \(\tau\) is loop delay. Here \(\tau\) means ; the Notation Legend above lists all four roles \(\tau\) plays across the series.

Scope — LTI assumption: The following constraint is derived under Linear-Time-Invariant dynamics. When power-shedding makes \(T_\text{tick}\) mode-dependent, the system becomes a switched linear system and this bound must be tightened by the SMJLS factor established in Self-Healing Without Connectivity.

This gain bound assumes linear time-invariant dynamics. Because Definition 3 introduces a switched system across capability modes , the loop delay \(\tau\) and sampling period both become mode-dependent; implementers must read the dwell-time analysis in Self-Healing Without Connectivity before using this gain bound for production control tuning.

Physical translation:

A fourth constraint captures the energy asymmetry that distinguishes edge from cloud — and it forces every architectural choice from the ground up.

Physical translation: Transmitting one radio packet costs roughly 1,000 times more energy than running one local compute operation. The system that offloads decisions to the cloud to “save compute” actually burns orders of magnitude more energy on the radio link than it saves on silicon.

Definition 2 (Energy-per-Decision Metric). The total energy cost of decision \(a\) is:

where \(n_c(a)\) is the number of local compute cycles required, \(n_s(a, C)\) is the number of radio packets required (zero when \(C = 0\)), \(T_d\) is joules per compute operation, and \(T_s\) is joules per transmitted packet.

This metric reframes every architectural choice as an energy budget problem. Sending one gossip packet costs the same as running local inference cycles. The system that offloads decisions to the cloud to “save compute” actually spends orders of magnitude more energy on the radio link than it saves on silicon.

Proposition 1 (Compute-Transmit Dominance Threshold). Local computation is energetically dominant — cheaper than radio-assisted offloading — for any decision requiring fewer than compute cycles, where is the energy ratio:

On RAVEN , CONVOY , and OUTPOST hardware, one radio packet costs more energy than a thousand local compute operations — so any algorithm below that count must run locally.

Physical translation: Running a decision locally is cheaper than transmitting it whenever the local compute footprint is less than \(T_s/T_d \approx 1000\) cycles (illustrative value) — a hardware constant, not a policy choice.

The “send everything to the cloud” model carries a hidden energy penalty: on most edge hardware, running a local inference pass costs less than transmitting a single sensor packet. Any algorithm with fewer than ~1,000 (illustrative value) operations should run locally — unconditionally, regardless of connectivity — because the radio link costs more energy than the silicon. This only becomes visible when both sides of the comparison are instrumented.

Empirical status: The threshold cycles (illustrative value) is derived from order-of-magnitude datasheet estimates for representative hardware classes; actual \(T_d\) and \(T_s\) values vary by \(2{-}5\times\) (illustrative value) across specific MCU and radio combinations and must be measured on the target platform before using this threshold for design decisions.

For (tactical radio): any decision requiring fewer than 1,000 (illustrative value) local compute cycles is cheaper to run locally than to transmit — even when connectivity is available.

Design consequence: The inversion threshold \(\tau^*\) from Proposition 2 has an energy analog. Even when \(C(t) > \tau^*\) and distributed autonomy does not strictly dominate cloud control on latency or capability grounds, it may still dominate on energy grounds if . At the edge, physics — not just connectivity — mandates local compute.

Watch out for: \(T_d\) and \(T_s\) must be measured on the actual hardware at operating temperature; if \(\rho_{\text{energy}}\) is estimated from datasheets, thermal throttling at high CPU load can double \(T_d\), halving the dominance threshold and reclassifying algorithms from local-dominant to radio-optimal.

Illustrative hardware parameters (order-of-magnitude estimates consistent with representative datasheets for each platform class; not measured values — calibrate \(T_d\) and \(T_s\) for the target hardware):

System\(T_d\)\(T_s\)\(\rho_{\text{energy}}\)Local-dominant threshold
RAVEN drone MCU\(50\,\mu\text{J}\) (illustrative value) (illustrative value) \(<100\) compute cycles
CONVOY vehicle ECU\(20\,\mu\text{J}\) \(<400\) compute cycles
OUTPOST sensor node\(10\,\mu\text{J}\) \(<1000\) compute cycles

Detection-value extension: Proposition 1 assumes all local computation has equivalent value per unit energy. For decision processes that prevent high-cost downstream events — anomaly detection avoiding cascading failure, intrusion detection preventing node compromise — the effective dominance threshold extends. Let denote the energy-equivalent value of a correct detection (joules of downstream cost avoided). The extended dominance condition is:

For , the local-dominant region expands by factor \((1 + k)\):

RAVEN extended thresholdDesign implication
\(T_s\) (one avoided spurious alert)\(n_c < 200\)Models up to 2x more complex are local-dominant
\(5\,T_s\) (cluster-level false positive)\(n_c < 600\)Medium-complexity models (autoencoder, small TCN) justified
\(10\,T_s\) (mission-abort cost)\(n_c < 1{,}100\)Full TCN ensemble remains energetically dominant

Quantifying requires estimating the failure cost — the energy and mission consequence of missing an anomaly — which is system-specific. The anomaly detection framework applies this extended threshold when selecting between EWMA, TCN, and ensemble models on resource-constrained edge nodes.

The autonomic floor problem. The energy analysis above assumes the autonomic management stack itself fits within the available headroom. On ultra-constrained MCUs with 4–80 KB (illustrative value) of SRAM, even a minimal EWMA baseline (80 B (illustrative value)) paired with a Merkle-tree health ledger (8 KB (illustrative value)) and gossip table (1 KB (illustrative value)) can exceed the autonomic ceiling from Proposition 68 — before a single mission byte runs. This is the autonomic floor problem: the monitoring stack outweighs its subject. The Constraint Sequence and the Handover Boundary addresses this directly with a Zero-Tax implementation tier that drops the active footprint from 13 KB (illustrative value) to under 200 bytes (threshold — requires the full Zero-Tax implementation tier) by deferring stack initialization until anomaly evidence accumulates.

Memory Tier Summary: The capability hierarchy maps directly to memory allocation. requires only volatile SRAM for heartbeat state — a few hundred bytes (illustrative value). (anomaly detection) requires enough working memory for the adaptive baseline estimator, typically 4–8 KB (illustrative value) per sensor stream. (self-healing) requires persistent storage for the recovery plan DAG, typically 32–128 KB (illustrative value). (fleet coherence) requires proportional CRDT state, growing with fleet size. (anti-fragility) requires reinforcement learning weight tables, typically 2–16 KB (illustrative value) depending on the arm count.

Power-contingent operating modes and the switched-system regime. The energy analysis above treats \(T_d\) and \(T_s\) as fixed hardware constants. In practice, thermal throttling and power-shedding make \(T_d\) a function of the current capability level \(q\). A node under 50% thermal throttle doubles compute time per operation, directly inflating :

System\(T_d\) (L3, full)\(T_d\) (L1, 50% throttle)\(T_d\) (L0, monitor only)
RAVEN drone MCU\(50\,\mu\text{J}\)\(100\,\mu\text{J}\)\(150\,\mu\text{J}\)
CONVOY vehicle ECU\(20\,\mu\text{J}\)\(40\,\mu\text{J}\)\(65\,\mu\text{J}\)
OUTPOST sensor node\(10\,\mu\text{J}\)\(22\,\mu\text{J}\)\(35\,\mu\text{J}\)

This mode-dependence is the surface symptom of a deeper structural problem. The healing control loop — developed formally in Self-Healing Without Connectivity — assumes constant \(A\) and \(B\) matrices in \(\dot{x} = Ax + Bu\). Power-shedding changes (the MAPE-K sampling period) and \(K(q)\) (the control gain); the system becomes a switched linear system that jumps between discrete stability envelopes as capability degrades. Two definitions anchor the full stability analysis in Self-Healing Without Connectivity.

Definition 3 (Hybrid Capability Automaton). The edge node’s closed-loop autonomic dynamics are modeled as a hybrid automaton:

The discrete and continuous state components are:

The flow and invariant for each mode are:

Transitions are governed by:

Physical translation: Think of this as a state machine where each mode has its own stability rules. The guard fires when the resource state \(R(t)\) drops out of a mode’s valid range — like a thermostat tripping a relay. Crucially, the error history \(x\) carries over across transitions with no reset, meaning every mode jump inherits the full consequence of what came before.

State resets are absent: the error history \(x\) is continuous across mode transitions. Every capability-level change is therefore a potential stability hazard that requires explicit pre-transition verification by Definition 39 (Nonlinear Safety Guardrail, defined in Self-Healing Without Connectivity).

Definition 4 (Stability Region). For capability level \(q \in Q\), the Stability Region is the maximal forward-invariant ellipsoidal set under mode-\(q\) dynamics — the set of error states from which the healing loop provably converges to equilibrium:

where \(P_q \succ 0\) is the mode-\(q\) Lyapunov matrix computed offline via LMI (Theorem PWL, proved in Self-Healing Without Connectivity) and \(c_q > 0\) is the level-set radius. The stability margin at time \(t\) is:

Physical translation: is the safety envelope for the healing loop at capability level \(q\). When the error state \(x(t)\) stays inside this ellipse, healing converges. When it exits — \(\rho_q < 0\) — the loop diverges and must be suspended. The ellipse shrinks as the node degrades.

The shrinking stability envelope. Computed from the LMI solution for RAVEN parameters ( , P99 delay \(= 25\,\text{s}\), Weibull \(k_N = 0.62\) — Definition 13 , below):

Mode \(q\) at \(\rho = 0.5\)\(\mathcal{R}_q\) diameter
L3 (nominal)5 s0.500.43\(4.2\sigma\)
L2 (reduced sensing)8 s0.380.32\(3.5\sigma\)
L1 (thermal throttle)10 s0.330.28\(2.8\sigma\)
L0 (monitoring only)60 s0

A fault at \(3.2\sigma\) lies safely inside (diameter \(4.2\sigma\)) and is correctable. The identical fault under L1 thermal throttle exceeds (diameter \(2.8\sigma\)) — the healing loop diverges without a prior gain reduction. Definition 39 (Nonlinear Safety Guardrail, Self-Healing Without Connectivity) detects this pre-transition and enforces the required derate automatically.

Prerequisite Ordering

Capabilities form a directed acyclic graph where \(A \prec B\) means “\(A\) must be validated before \(B\) is useful”:

Design consequence: Building anti-fragility before self-healing wastes effort. A node that learns from stress but cannot heal itself amplifies its own failures.

Objective Hierarchy

The system optimizes four objectives in strict lexicographic order — each must be satisfied before the next is considered. This ordering is not a preference; it is a correctness condition.

PriorityObjectiveFormulaDesign Consequence
1Survival Never sacrifice L0 for higher capability
2Autonomy Capability under partition drives architecture
3Coherence Design for fast merge at reconnection
4Anti-fragility\(\max \mathbb{A}\)Learn from stress; improve under adversity

Primary metric: Expected integrated capability . This quantity drives threshold placement, resource allocation, and every protocol design trade-off throughout the series.

System Boundaries

Decision scope determines protocol complexity and partition tolerance. Wider scope demands higher connectivity to execute — narrower scope succeeds even in full partition.

BoundaryTimescaleProtocolWhat Happens at Partition
NodeMillisecondsLocal state onlyFully autonomous — partition has no effect
ClusterSeconds–minutesGossipCluster operates independently; no external coordination needed
FleetMinutes–hoursHierarchical syncDelegate to cluster leads with pre-authorized bounds
CommandHours–daysHuman-in-loopDefer non-critical decisions; execute within pre-authorized envelope

Authority tier for classifies decisions by scope: (node), (cluster), (fleet), (command). Higher authority requires higher connectivity; partition triggers delegation to lower tiers with bounded autonomy.


The Inversion Thesis

Fog computing [7] , mobile edge computing [8] , and the edge-cloud continuum share one foundational assumption: connectivity is the baseline state, and disconnection is an exception to recover from. This section inverts that assumption formally. Partition is the baseline. Connectivity is the opportunity. The derivation below establishes exactly where the crossover happens.

Cloud architecture assumes \(P(C = 0) < 0.01\) and seconds. Partition handling exists but receives minimal optimization effort — it’s a fallback, not a design mode.

Edge architecture [9] operates under \(P(C = 0) > 0.15\) and seconds. Under these conditions, designing for disconnection as baseline provably outperforms designing for connectivity as baseline — above a computable threshold.

Bounded claim: The difference becomes categorical above threshold \(\tau^*\). Below \(\tau^*\), cloud patterns may suffice. Above it, they cannot.

Assumption Set :

The table below makes explicit the eight structural assumptions where cloud-native and tactical edge systems differ. Each row is not a performance difference — it is a different architectural universe.

AssumptionCloud-Native SystemsTactical Edge Systems
Connectivity baselineAvailable, reliable, optimizableContested, intermittent, adversarial
Partition frequencyExceptional (<0.1% of operating time)*Normal (>50% of operating time)
Latency characterVariable but boundedUnbounded (including \(\infty\))
Central coordinationAlways reachable (eventually)May never be reachable
Human operatorsAvailable for escalationCannot assume availability
Decision authorityCentralized, delegated on failureDistributed, aggregated on connection
State synchronizationContinuous or near-continuousOpportunistic, burst-oriented
Trust modelNetwork is trustedNetwork is actively hostile

*Based on major cloud provider SLAs (AWS, GCP, Azure) targeting 99.9%+ availability. Actual partition rates vary by region and service tier.

Definition 5 (Connectivity State). The connectivity state is a right-continuous stochastic process where \(C(t) = 1\) denotes full connectivity, \(C(t) = 0\) denotes complete partition, and intermediate values represent degraded connectivity as a fraction of nominal bandwidth.

Plain English: \(C(t)\) is simply the fraction of your radio link that’s working right now — a continuous dial from zero (blackout) to one (full capacity). “Right-continuous” means when the link drops, it drops instantly; there’s no grace period. The regime \(\Xi(t)\) then maps this continuous signal to one of four discrete behaviors.

Definition 6 (Connectivity Regime). A system operates in the cloud regime if and \(P(C(t) = 0) < 0.01\). A system operates in the contested edge regime if and \(P(C(t) = 0) > 0.1\) [9] .

Plain English: Cloud regime means connectivity is nearly always there — less than 1% chance of full blackout. Contested edge means the link is below half-capacity on average and fully dark more than 10% of the time. Most tactical and industrial deployments measured in the field sit firmly in the edge regime before architecture is chosen.

Note on terminology: “Partition” refers to a contiguous duration spent in the Denied regime (C(t) = 0); “Denied regime” (state ) is the connectivity state itself; “disconnection” is a generic informal term for either. These are used precisely: “partition duration” is always a time interval, never a state label.

Analogy: A ship’s navigator switching from GPS to dead reckoning to compass-only as signal degrades — each mode uses a different accuracy/resource tradeoff, with no guarantee of returning to the previous mode.

Logic: The regime process discretizes the continuous link-quality signal \(C(t)\) via fixed thresholds; the semi-Markov model ( Definition 12 ) governs how long the system stays in each regime and which regime follows next.

    
    stateDiagram-v2
    [*] --> Connected
    Connected --> Degraded : RTT spike / partial loss
    Connected --> Denied : total link failure
    Degraded --> Connected : RTT < tau_low + sync complete
    Degraded --> Denied : loss_rate > L_max
    Denied --> Degraded : partial reconnect
    Denied --> Connected : full reconnect + sync complete
    Connected --> Connected : normal ops
    Degraded --> Degraded : adaptive throttling
    Denied --> Denied : local autonomy

Proposition 2 ( Inversion Threshold ). Under assumption set , there exists a threshold \(\tau^*\) such that cloud-native coordination patterns yield lower expected utility than partition-first patterns when \(P(C(t) = 0) > \tau^*\)

When disconnection exceeds a critical fraction — roughly 15% (illustrative value) for CONVOY — building for partition beats assuming connectivity, because retry-storm latency grows faster than reconciliation cost. [1, 2] .

The Problem: Cloud-native systems wait for connectivity to make decisions. When the link is down 15–61% (illustrative value) of the time, that wait compounds into unbounded latency — coordination overhead grows superlinearly as partition probability approaches the retry-storm regime.

The Trade-off: Partition-first architecture pays a reconciliation cost \(\beta\) every time connectivity resumes. You are trading per-reconnection overhead for freedom from coordination stalls during partition.

Analogy: A field reporter deciding whether to phone headquarters or make the editorial call on-site — above a certain call cost, you just decide locally. The threshold is not a policy preference; it is the exact break-even point where the retry-queue overhead exceeds the reconciliation cost.

Logic: Proposition 2 derives \(\tau^*\) by setting \(U_{\text{cloud}}(p) = U_{\text{edge}}(p)\) and solving for the crossover; above \(\tau^*\), the \(1/(1-p)\) latency blowup on the cloud side dominates the fixed reconciliation cost \(\beta\) on the edge side.

Formal Derivation:

Note: throughout this derivation, \(T_d\) denotes decision latency (seconds) and \(T_s\) denotes synchronization period (seconds) — distinct from the energy-per-decision and energy-per-packet homonyms in the Notation Legend, which carry the explicit superscript and when disambiguation is needed.

Let denote expected utility under cloud-native patterns and under partition-first patterns, where \(p = P(C(t) = 0)\).

Cloud-native utility — coordination waits for connectivity. Expected decision latency grows with partition probability \(p\), where \(T_s\) is the synchronization period and \(\rho\) (here \(\rho\) = retry-overhead multiplier; distinct from energy ratio in Proposition 1 ) is the per-attempt retry overhead:

Physical translation: The \(1/(1-p)\) factor is the geometric series of retry attempts. At \(p = 0.5\) (illustrative value), expected latency doubles (illustrative value). At \(p = 0.9\) (illustrative value), it is \(10\times\) nominal (illustrative value) — the system spends most of its time in the retry queue, not executing decisions.

Partition-first utility — decisions proceed locally; reconciliation cost \(\beta\) (here \(\beta\) = reconciliation cost ratio; see Notation Legend for other roles of \(\beta\) in this series) is paid at reconnection:

Physical translation: Edge utility has two costs — the fixed cost of a local decision (\(\alpha T_d\)) and the reconciliation cost at reconnection (\(\beta(1-p)\), which shrinks as \(p\) increases because reconnection happens less often). As \(p \to 1\), the edge pays almost no reconciliation cost — it never reconnects.

Threshold derivation — setting and solving for the crossover:

To make the dimensional analysis explicit: \(\delta_s = E_s \cdot T_s\) is the energy spent in steady-state operations per unit time interval \(T_s\), and \(\delta_d = E_d \cdot T_d\) is the energy spent during disconnected operations per unit time interval \(T_d\). In the utility comparison above, \(T_s\) and \(T_d\) appear as time quantities (synchronization period and decision latency); \(\delta_s\) and \(\delta_d\) make the energy dimension explicit when converting the time-based utility into an energy budget.

Notation: In this formula, \(T_s\) denotes the synchronization period and \(T_d\) denotes decision latency — both time quantities in seconds. These are distinct from the energy symbols \(T_d\) (joules, energy per decision) and \(T_s\) (joules, energy per transmission) defined in Definition 2 . To avoid ambiguity, the time quantities are also written \(\delta_s\) (synchronization period) and \(\delta_d\) (decision latency) elsewhere in this section.

Physical translation: \(\tau^*\) is the break-even disconnection rate. Below it, cloud coordination is cheaper; above it, local autonomy wins. The numerator is the savings from avoiding cloud sync (\(T_s - T_d\)) minus the amortized reconnection cost (\(\beta/\alpha\)). The denominator normalizes by the full retry-inclusive sync burden. Plug in your field-measured \(P(C=0)\): if it exceeds \(\tau^*\), cloud-native architecture is provably suboptimal for your deployment.

The inversion threshold is the exact partition probability at which the energy math switches sides. Below \(\tau^*\), transmitting to the cloud costs less than computing locally; above it, the reverse holds. This single number determines the entire architectural regime for a given deployment — a system with \(\tau^* = 0.3\) (illustrative value) is cloud-first under 30% (illustrative value) denial probability but must be edge-first above it.

Exceeding this bound causes control actions to compound across ticks faster than the plant dynamics settle, producing sustained oscillation or divergence — \(\tau^*\) is the minimum fraction of time that must be spent in partition-first mode to avoid this.

For systems where \(T_s = kT_d\) with \(k \geq 5\) (synchronization slower than decisions) and \(\rho \approx T_s\), :

(where \(k = \delta_s/\delta_d\) is the sync-to-decision time ratio)

For \(k = 5\) (illustrative value): \(\tau^* = 0.4\) (threshold — requires the assumption set 𝒜_inv to hold). Including retry storms (\(\rho\) increases superlinearly with \(p\)), the effective threshold drops to \(\tau^* \in [0.12, 0.18]\) (threshold — requires TCP-like congestion backoff at the MAC layer).

The retry storm correction is derived as follows. Under TCP-like congestion collapse, each retry attempt contends with active retries: (linear in availability pressure).

Substituting into the \(\tau^*\) formula with \(k=5\) (illustrative value) and solving numerically: at \(\rho_0 = T_s\) (retry cost equals one sync period), the crossover shifts from \(p = 0.40\) (illustrative value) to \(p \approx 0.17\) (illustrative value); at \(\rho_0 = 2T_s\), it shifts to \(p \approx 0.13\) (illustrative value). The range \([0.12, 0.18]\) (threshold — requires TCP-like congestion backoff at the MAC layer) corresponds to \(\rho_0 \in [T_s, 2T_s]\) — one to two sync periods of retry overhead, consistent with measured backoff behavior on contested tactical links.

Uniqueness caveat: The derivation assumes \(\rho(p)\) is monotonically increasing, which guarantees at most one zero crossing — a unique \(\tau^*\). When \(\rho_0\) is itself correlated with \(p\) (e.g., load-dependent exponential backoff), the effective retry cost becomes nonlinear and can produce two crossover points: a lower where partition-first becomes preferable, and an upper where extreme availability loss reverses the advantage. In such cases, solve numerically and verify only one root exists in \([0, 1]\) before applying the threshold.

Retry elasticity measures the fractional change in retry cost per fractional change in partition rate.

Retry Model Sensitivity Analysis: The threshold \(\tau^*\) is sensitive to the functional form of \(\rho(p)\). Retry elasticity : \(\eta_\rho = 0\) means retry cost does not grow with partition rate; \(\eta_\rho = 1\) means it grows proportionally. Four representative models bracket the practical range (all entries assume \(k = T_s/T_d = 5\), \(\rho_0 = T_s\)):

Retry Model\(\rho(p)\) FormPhysical Mechanism\(\eta_\rho\)\(\tau^*\) range
Fixed overhead\(\rho_0\)TDMA slot reservation, Link-16 fixed slot0~0.40
Soft exponential backoff CSMA/CA at low-to-moderate channel load0 to 1~0.25–0.38
TCP-like linear congestion\(\rho_0/(1-p)\)AIMD; each retry competes with concurrent retries1~0.12–0.18
Hard channel saturation , Frequency-limited tactical net; no retry above 1 to \(\infty\) near Below 0.12 near

Robust bound (model-agnostic): For any \(\rho(p)\) with — retry overhead non-decreasing in \(p\) — the threshold satisfies (threshold — requires fixed-overhead MAC such as TDMA or Link-16). No physically realistic MAC protocol with non-negative congestion response can produce a \(\tau^*\) above 0.40 (threshold — requires fixed-overhead MAC such as TDMA or Link-16).

The MAC protocol, not the synchronization ratio \(k\), drives most of the variation — systems with identical \(k\) values but different protocols can differ in \(\tau^*\) by a factor of 3–19 (illustrative value).

Retry elasticity is estimated by measuring mean per-attempt retry cost at \(p_1 = 0.10\) (light jamming) and \(p_2 = 0.30\) (moderate jamming). The retry elasticity estimate is:

The three regimes of retry elasticity determine the threshold range to use:

RAVEN ’s measured at two jamming intensities placed it in the (illustrative value) range, justifying the TCP-like model and the \([0.12, 0.18]\) (threshold — requires TCP-like congestion backoff at the MAC layer) threshold.

Utility gain from switching to partition-first:

when \(p > \tau^*\) because the coordination delay term grows as \(O(1/(1-p))\) while reconciliation cost grows only as \(O(1-p)\).

Validity domain — this derivation holds when:

where \(N_d\) is the CRDT -resolvable data-conflict count and \(\gamma\) is the semantic convergence factor ( Definition 5b ). The CRDT data-conflict term is bounded; the semantic term is not.

Heavy-tail correction: Under the Weibull partition model ( Definition 13 , below), individual partitions have , meaning the expected retry cost during a specific ongoing partition is higher than the time-average suggests — long partitions generate disproportionate storm traffic. The effective threshold:

For (illustrative value) ( CONVOY calibration): (theoretical bound under illustrative Weibull parameters). Systems near \(\tau^*\) under the exponential assumption should re-evaluate — they may already be past the inversion point.

Empirical status: The \(\tau^* \approx 0.15\) (illustrative value) figure derives from TCP-like retry elasticity (illustrative value) measured on CONVOY ’s contested link; deployments with different MAC protocols (TDMA, CSMA/CA) may place \(\tau^*\) anywhere in \([0.10, 0.40]\) (theoretical bound) — measure retry elasticity at two jamming intensities before using a fixed threshold.

Definition 5b (Semantic Convergence Factor). Let be the set of all state items produced by a reconciliation event, and the subset with no policy violations after merge. The semantic convergence factor is:

\(\gamma = 1\) means all merged state satisfies system policy. When \(\gamma < 1 - \varepsilon\), policy violations accumulate faster than they can be resolved — nodes must re-negotiate conflicting decisions, driving the term into the storm regime regardless of CRDT sync speed.

Critical distinction: CRDTs guarantee data convergence ( ) but have no effect on \(\gamma\). CRDT merge is syntactic — it resolves which bytes win. Policy compliance is semantic — it resolves whether the merged state is valid. These are independent problems.

Note: Setting yields a quadratic in \(p\). The closed-form \(\tau^*\) is a first-order linear approximation valid when \(\beta(1-p)\) is small relative to \(\alpha T_d\). For large \(\beta\), solve the quadratic numerically.

When the inversion fails — two counter-scenarios worth stress-testing against your deployment:

  1. Short partitions, tolerant latency: A system with \(P(C = 0) = 0.20\) (illustrative value) but mean partition duration of 5 seconds (illustrative value) and \(T_d > 30\) seconds (illustrative value). Store-and-forward suffices; partition-first architecture adds unnecessary complexity. The inversion threshold assumes partitions are long enough to matter.

  2. Conflict cascade: Two clusters independently allocate the same exclusive resource. Upon reconnection, one allocation must be revoked — potentially cascading to dependent decisions. When , partition-first yields lower utility than blocking. The \(\beta\) estimate must account for semantic conflict cost, not just data merge cost.

Watch out for: \(\tau^*\) is derived under the assumption that \(\rho(p)\) is monotonically increasing — if the MAC protocol has a saturation regime where heavy jamming briefly reduces contention (e.g., TDMA collapses cleanly rather than thrashing), the utility crossing may be non-unique and the single-threshold model will underestimate the true crossover, leaving the system in cloud-native architecture past the actual inversion point.

Game-Theoretic Extension: Adversarial Inversion Threshold

Proposition 2 treats partition probability \(p\) as a property of the environment — exogenous, stable, measurable. In contested deployments, \(p\) is set by an adversary who observes your architecture and responds to it. This changes the analysis fundamentally.

The Problem: A cloud-native system with \(p = 0.10\) in peacetime may face \(p = 0.80\) when an adversary learns it depends on connectivity. The architecture that was “below the threshold” in design becomes catastrophically above it in operation.

The Solution: Model the adversary as a rational actor in a Stackelberg game — the defender commits to an architecture first, then the adversary selects jamming intensity to minimize defender utility. The game reveals which architecture is strategically dominant, not just expected-value optimal.

The Trade-off: Game-theoretic robustness requires committing to partition-first architecture even when current \(p\) is below \(\tau^*\) — accepting mild reconciliation overhead in exchange for removing the adversary’s most effective lever.

Stackelberg Game: The defender commits to an architecture; the adversary observes it and selects jamming intensity \(p \in [0, \bar{p}]\) to minimize defender utility.

Under cloud-native architecture, is strictly decreasing and convex in \(p\) via the \(1/(1-p)\) term. The adversary’s best response is trivial: apply maximum feasible jamming \(p = \bar{p}\). Every unit of jamming degrades the defender.

Under partition-first architecture, defender utility depends on \(p\) only through the reconciliation term:

Physical translation: This expression is increasing in \(p\) — more jamming means fewer reconnections, which means lower reconciliation overhead. The adversary facing a partition-first defender has no beneficial jamming strategy. Their rational response is to restore connectivity, not deny it.

Adversarially robust guarantee: The worst-case utility under each architecture when the adversary applies maximum jamming \(\bar{p}\):

Physical translation: The partition-first guarantee is a constant — it does not depend on how hard the adversary jams. The cloud-native guarantee collapses as ; it diverges to \(-\infty\) because the \(1/(1-\bar{p})\) term blows up. An adversary with access to heavy jamming can make cloud-native architecture arbitrarily bad; they cannot do the same to a partition-first system.

The game-theoretic threshold (where ) satisfies . Systems in the hybrid zone \(0.3 \leq E < 0.6\) (theoretical bound) of the Edge-ness Score should be reassessed: an adversary can push them past \(\tau^*\) at will, but cannot degrade a partition-first system below its adversarially-robust guarantee.

Practical implication: For contested deployments, evaluate the inversion threshold using \(\bar{p}\) — maximum feasible jamming given the threat model — rather than expected \(p\). Partition-first architecture is strategically dominant against jamming adversaries. This property does not appear in the expected-utility analysis of Proposition 2 and only emerges from the game-theoretic formulation.

Non-Linear Inversion Threshold: Age of Information and Tiered Value Decay

Proposition 2 treats \(\alpha\) as a constant loss-cost slope: every second of waiting incurs the same \(\alpha\) utility penalty. This linear \(\alpha\) assumption holds in bulk data systems where a 10-second delay is mildly worse than a 5-second delay and both are tolerable. It fails completely for tactical, medical, and safety-critical operations where information has a hard expiry. A drone position fix that is 3 seconds old is useful; the same fix 30 seconds old is operationally worthless. These are not points on a line — they are separated by a cliff.

The formal framework for this distinction is Age of Information (AoI): the elapsed time \(\Delta(t) = t - u(t)\) since the last update \(u(t)\) was generated. The value of an observation is a function of its AoI, not merely its transmission delay. Crucially, different data classes have fundamentally different value-versus-age shapes.

Definition 7 (Tiered Value Decay Function). The value decay function \(v_c : [0,\infty) \to [0,1]\) for data class \(c\) maps AoI \(\Delta\) to residual information value as a fraction of maximum value \(V_0(c)\). Three operational tiers are distinguished:

Note: \(\lambda_c\) here is the per-class information decay rate in s ; distinct from the gossip fanout rate \(\lambda\) of Proposition 12 and from the connectivity-regime rate of Definition 20 . Subscript \(c\) selects data class.

Physical translation: A configuration update (soft tier, \(\lambda_c \approx 0\)) sent 10 minutes late (illustrative value) costs 10% (illustrative value) of baseline utility if the linear model fits. A drone position fix (tactical tier, (illustrative value)) sent 10 seconds late (illustrative value) retains only (illustrative value) of its value — the remaining 75% (illustrative value) is lost regardless of how efficiently it eventually arrives. A fire control solution (safety-critical tier, \(D_c = 2\,\text{s}\) (illustrative value)) sent 3 seconds late (illustrative value) has zero utility. The linear model conflates all three into the same slope \(\alpha\); the error is not a rounding problem, it is a structural misrepresentation.

AoI-Corrected Utility Functions

Under exponential decay , the value of a cloud decision depends on how long the system waited before connectivity was available. Let denote the fraction of information value surviving one complete sync period \(\delta_s\) (the synchronization period in seconds; see Assumption \(A_3\) — distinct from the energy-per-transmission \(T_s\) in Definition 2 ).

The cloud system makes retry attempts spaced \(T_s\) apart; the number of attempts before success is geometrically distributed with success probability \(1-p\), so the waiting time is where . The expected residual value at the moment the decision executes is:

using the probability generating function of the geometric distribution . The AoI-corrected cloud utility is:

Physical translation: As \(p \to 1\), — the information is worthless by the time connectivity is restored, regardless of how little reconciliation costs. At \(r_c = 0.5\) (illustrative value) (half-life equals one sync period), a 50% (illustrative value) partition probability reduces cloud utility to (illustrative value) — a 67% (illustrative value) utility reduction, not the 50% (illustrative value) the linear model would predict.

The edge decision executes immediately at local latency \(T_d\); the AoI-corrected edge utility becomes:

Since \(T_d \ll T_s\) in practice (local compute is \(10^3\times\) faster than sync), and the edge retains essentially full information value.

Re-Deriving the Inversion Threshold

Proposition 3 (Non-Linear Inversion Threshold). Under tiered value decay (Definition 7), the AoI-corrected inversion threshold satisfies:

For fast-decaying data like RAVEN collision alerts, the inversion point is lower than the linear model suggests — edge-first wins even when connectivity is nearly perfect.

with equality only in the degenerate case \(\lambda_c \to 0\) (no decay). For the exponential-decay tier, solves:

Setting and rearranging yields the exact crossover condition:

This quadratic in \(p\) admits a valid threshold only when the reconciliation cost exceeds the per-period value loss:

When the condition fails — i.e., \(\beta < U_0(1-r_c)\) — there is no crossover: edge-first is unconditionally preferred regardless of partition probability \(p\).

The multiplicative structure — \(U_0\) scaled by residual value \(r_c\), not added to a connectivity penalty — is what makes this failure non-local: as information freshness tightens (\(r_c \to 0\)), the edge advantage \(U_0(1-r_c) \to U_0\) grows stronger, not weaker. An additive cost model of the form \(\alpha T + \beta E + \gamma C\) cannot reproduce this behaviour, because any such model produces a threshold that eases as connectivity improves, rather than one that hardens as information value decays.

Two tractable special cases illustrate the boundary:

Case A — fast decay (\(r_c \ll 1\), i.e., \(\lambda_c T_s \gg 1\)): Cloud utility collapses because information expires before connectivity resumes. The threshold-existence condition becomes \(\beta \geq U_0\) — reconciliation cost must be at least as large as the total information value. For typical deployments with \(\beta \ll U_0\), no valid threshold exists:

The edge advantage at \(p = 0\) (always connected) — the cloud’s best-case — is:

For \(\beta/U_0 = 0.05\) (illustrative value) and \(r_c = 0.50\) (illustrative value) ( RAVEN position): (illustrative value). Edge-first delivers 45% (illustrative value) more utility even when cloud connectivity is perfect — the data expires during the sync period itself.

Case B — hard deadline (safety-critical tier ): When the deadline \(D_c < T_s\) (deadline shorter than a single sync period), only the first attempt can succeed within the window. The cloud utility collapses to:

Setting :

For \(\beta/U_0 = 0.05\) (illustrative value): (theoretical bound under illustrative parameters) — the inversion crosses at less than 5% (illustrative value) partition probability.

Proof sketch (Case B): From \(U_0(1-p) = U_0 - \beta(1-p)\): \((1-p)(U_0 + \beta) = U_0\), giving \(p = \beta/(U_0+\beta)\). For \(n\) sync periods available before deadline (\(D_c = nT_s\)), ; the threshold satisfies , which converges to 1 as \(n \to \infty\) — confirming that only tight deadlines drive the threshold below the linear result. \(\square\)

Threshold existence table: The following shows whether a valid exists by data class (\(T_s = 5\,\text{s}\) (illustrative value), \(\beta/U_0 = 0.05\) (illustrative value)); if none, the edge-advantage at \(p=0\) (best-case cloud) quantifies how unconditionally edge-first wins.

Data class\(\lambda_c\) (s )\(1 - r_c\)Threshold exists?Implication
Config / audit log\(\approx 0\)\(\approx 0\)Yes — Linear model valid; no AoI correction needed
OUTPOST thermal reading0.010.05Marginal (\(\beta/U_0 = 1-r_c\)) — Edge preferred for any \(p > 0\); cloud viable only at \(p=0\)
CONVOY position fix0.070.29No — \(U_0(1-r_c) > \beta\)Edge advantage at \(p=0\): \(+24\%\,U_0\). Always use edge-first
RAVEN position / collision0.140.50No — \(U_0(1-r_c) \gg \beta\)Edge advantage at \(p=0\): \(+45\%\,U_0\). Always use edge-first
Threat alert / fire control1.000.993No — nearly all value lost per periodEdge advantage at \(p=0\): \(+94\%\,U_0\). Cloud-native is architecturally incoherent for this class

Physical translation: The correct question is not “what threshold drives the switch to edge-first?” — it is “does a threshold even exist for this data class?” For RAVEN collision-avoidance data, cloud-native delivers 45% (illustrative value) less utility than local autonomy even when connectivity is perfect (\(p=0\)), because the data expires during the sync period itself. A system designer who uses the linear \(\tau^* = 0.40\) (threshold — requires fixed-overhead MAC such as TDMA or Link-16) threshold to justify cloud-native coordination for position tracking is not near the boundary — they are off the map entirely. The inversion is unconditional for any data with \(1 - r_c > \beta/U_0\), which is true of every tactical real-time data class.

Revised validity condition: The linear model’s condition \(\beta < \alpha T_s\) (“reconciliation cheaper than prolonged waiting”) generalizes under exponential decay to:

As \(r_c \to 0\): the bound approaches \(U_0\) — almost any reconciliation cost is acceptable because the stale data being reconciled has negligible value anyway. As \(r_c \to 1\) (slow decay): the bound approaches — stricter than the original condition by a factor of 2 in the slow-decay limit, due to the geometric compounding of sync latency.

Empirical status: The decay rates \(\lambda_c\) (0.07 (illustrative value) for CONVOY position, 0.14 (illustrative value) for RAVEN collision) and the \(\beta/U_0 = 0.05\) (illustrative value) reconciliation ratio are calibrated from scenario-specific field estimates; actual values are system-dependent and should be measured from reconnection logs before concluding that a threshold exists or does not exist for a given data class.

Watch out for: \(\lambda_c\) must be calibrated from observed value loss in the field rather than from design specifications; if a data class is assigned a lower \(\lambda_c\) than its true decay rate warrants, the threshold-existence condition \(\beta \geq U_0(1-r_c)\) may be satisfied on paper while the actual AoI behavior places the system in the unconditional edge-first regime — and the mistaken threshold will delay the switch past the true inversion point.

Value-Density Routing

Definition 8 (Value-Density Metric). For a message \(m\) of data class \(c\) with current AoI \(\Delta_m\), the value density is:

where is the transmission size in bytes. At generation time (\(\Delta_m = 0\)): . Value density equals the marginal rate of value loss per byte of channel capacity consumed.

Proposition 4 (AoI-Optimal Routing Priority). Among all non-preemptive transmission schedules with bounded channel capacity \(B\) bytes/s and a queue of \(n\) messages with independent exponential decay, the schedule minimizing total value lost in \([0, T]\) is the greedy schedule that transmits messages in decreasing order of \(\nu(m, \Delta_m)\) at each decision epoch.

Always send the message losing value fastest per byte of bandwidth — a RAVEN collision alert should jump the queue over a large diagnostic bundle regardless of arrival order.

Proof sketch: By exchange argument — swapping any two adjacent messages in the queue that are out of value-density order strictly increases total transmitted value. The greedy rule is therefore optimal among single-channel non-preemptive schedulers. \(\square\)

Physical translation: Transmit the message that is losing value fastest per byte of bandwidth it consumes. A brief high-urgency alert (small, fast-decaying) should always preempt a large low-urgency diagnostic bundle. This is the operational implementation of the non-linear inversion insight: data classes with high \(\lambda_c\) must move first, not just because they are important, but because their value-per-byte ratio deteriorates faster than any other resource cost.

Multi-class implementation: Partition messages into priority classes using static \(\lambda_c\) thresholds. Within each class, sort by \(\nu(m, \Delta_m)\). Head-of-line blocking rules: messages with \(\Delta_m > D_c/2\) pre-empt any lower class unconditionally — the deadline is approaching and no soft-class transmission can recover the lost value.

Watch out for: the proposition assumes independent value decay for each message; when messages form bundles where a correction or context packet has no value without its anchor (a position fix without its associated reference frame, for instance), greedy single-message ordering can transmit the high-\(\lambda_c\) correction while the low-\(\lambda_c\) anchor is deferred, delivering a useless message at the cost of a useful one.

Criticality-Aware TTL

Definition 9 (Criticality-Aware TTL). For data class \(c\) with decay rate \(\lambda_c\) and a minimum value floor below which a message is operationally worthless, the criticality-aware TTL is the AoI at which the floor is crossed:

A message that has not been reconciled (applied to state, forwarded, or acknowledged) by is self-deleted from the queue. Delivery after this point consumes channel bandwidth without delivering operationally useful information.

Clock-drift correction: In practice, onboard clocks drift at 10–100 ppm (illustrative value) relative to GPS time. After a partition of duration \(T_p\), the effective TTL should be reduced by where is the measured drift rate. For a 48-hour (illustrative value) partition at 100 ppm (illustrative value) drift, this adds a 17-second (illustrative value) correction — negligible for most applications but material for tight synchronization requirements.

Data class\(\lambda_c\) (s ) (1% floor)Operational meaning
Config / policy0.00177 minSurvives any realistic partition; always transmit
OUTPOST perimeter reading0.017.7 minDrop if still queued after 7 minutes
CONVOY position fix0.0766 sDrop if not delivered within 1 minute
RAVEN collision-avoidance0.1433 sDrop if not delivered within 33 seconds
Fire control solution1.004.6 sDrop if not delivered within 5 seconds
Defibrillator timing10.00.46 sDrop if not delivered within 0.5 seconds

Physical translation: The TTL is the inverse of urgency. Do not transmit a fire control solution that was generated 10 seconds ago — it is worse than sending nothing, because it consumes bandwidth needed for the current solution. The reconnection storm after a long partition should not retransmit every queued message: only messages with carry residual value. Transmitting the rest is channel pollution.

Reconnection storm TTL filter: When connectivity resumes after a partition of duration , apply a TTL pre-filter before reconciliation ( Definition 70 in Fleet Coherence Under Partition): discard all queued messages with . These messages have already expired by the time the reconnection window opens. For RAVEN with : every position fix and collision-avoidance vector in the queue is discarded — the fleet must re-acquire current position from fresh sensor readings, not from stale gossip.

For the RAVEN swarm — 47 drones at 1 Hz position update rate and 48-byte message size — without value-density routing, position fixes ( ) compete equally with diagnostic telemetry ( ). With value-density routing, position fixes carry a \(\nu\) ratio \(140\times\) higher per byte — they clear the queue first on every reconnection.

At a 250 kbps uplink — representative of CONVOY — with 15 stale position fixes queued ( ), the TTL filter discards all 15 before transmission, freeing the channel for the 3 messages with residual value (config updates, still within their 77-minute TTL).

Causal Ordering Hazard in Physical-Time TTL Filtering

The TTL filter above uses the physical timestamp embedded in each message to compute AoI. Physical timestamps are only as reliable as the generating node’s local clock.

A node whose clock drifts forward by \(\varepsilon\) seconds produces timestamps that are \(\varepsilon\) seconds too large — its messages appear artificially newer than they are. A node whose clock drifts backward by \(\varepsilon\) produces timestamps that are \(\varepsilon\) seconds too old — its messages appear artificially staler, potentially crossing the TTL boundary and being discarded while causally later messages (with more accurate clocks) survive.

The consequence is causal inversion: effects arrive at the reconciling node without their causes, which the TTL filter has already discarded as stale.

Example: Node A (clock +500ms fast) detects a target at real time \(t = 0\) and stamps the detection \(E_1\) with physical time 500ms. Node B (accurate clock) receives \(E_1\), acts on it, and stamps “Target Neutralized” \(E_2\) with physical time 200ms. Node C, sorting by physical timestamp, orders — effects before cause. The TTL filter exacerbates this: if the TTL for fire control events is tight, \(E_1\) may be discarded (500ms old) while \(E_2\) survives (200ms old), leaving the event log with a neutralization and no detection.

Definition 10 (HLC-Augmented Message Stamp with Dotted Version Vector). Each message \(m\) in the queue carries a compound causality stamp:

where is the Hybrid Logical Clock timestamp ( Definition 61 in Fleet Coherence Under Partition) and is a Dotted Version Vector (DVV) — the set of dot pairs encoding every causal predecessor of \(m\). A dot means “I have seen all events from node \(i\) through sequence number \(n\).” The generating node assigns and inherits all dots from the causal predecessors it observed before generating \(m\) (following the dot-kernel model of Definition 73 in Fleet Coherence Under Partition).

Causal precedence: \(m_1 \prec m_2\) iff for \(m_1\)’s self-dot. Under this relation, the reconciliation queue forms a partial order, not a linear sequence. Physical timestamps and HLC provide a total order for tie-breaking non-causally-related events; DVV is the ground truth for causally-related events.

At send time, \(pt_m\) is set to , and \(c\) is incremented if (HLC tie-break rule from Definition 61 ); the node appends self-dot and inherits all dots from incoming messages that causally precede \(m\). Each stamp adds 10 bytes of overhead — 5 bytes HLC (4-byte microsecond timestamp + 1-byte counter) + 5 bytes DVV (1 origin ID byte + 4-byte sequence number) for single-dependency events; multi-predecessor DVVs add 5 bytes per additional ancestor.

Definition 11 (Clock Uncertainty Window). Given per-node maximum clock drift \(\varepsilon\) (seconds) relative to a reference, define the uncertainty window of message \(m\) as the interval:

Two messages are uncertainty-concurrent iff their windows overlap:

When two messages are uncertainty-concurrent, physical-time ordering is unreliable — either ordering is consistent with both clocks being within their drift bounds. The system must not act on the physical-time ordering alone; it must enter the Conflict Resolution Branch:

  1. Check DVV precedence: if , then is definitive — apply \(m_1\) first.
  2. Check HLC ordering: if DVV does not resolve the order (concurrent events, neither is an ancestor of the other), use as the tiebreaker (lexicographic HLC comparison).
  3. If \(m_2\)’s DVV references a dot that is not yet in the local event log (a missing predecessor), hold \(m_2\) in a pending buffer until the predecessor arrives or the causal dependency times out.

Proposition 5 (Causal Anti-Inversion Guarantee). Under Definition 10 stamps and Definition 11 uncertainty windows, if event (Target Detection) causally precedes event (Target Neutralized) — i.e., — then at every node in the fleet, is applied to state before , regardless of clock drift .

Even with a 500 ms clock error at OUTPOST , causal vector tracking ensures “Target Detected” is always processed before “Target Neutralized” at every node.

Proof: Since , the DVV check in the Conflict Resolution Branch immediately resolves the order as . Physical-time ordering is overridden. If has not yet arrived when is received, \(E_2\) is held in the pending buffer until arrives (bounded by its TTL). The pending buffer prevents \(E_2\) from being applied with a missing causal predecessor. \(\square\)

Note on TTL interaction: if expires ( ) before it arrives, \(E_2\) is also discarded from the pending buffer — both events are stale. A neutralization without a detectable cause is operationally invalid, and discarding both is safer than applying \(E_2\) alone with an unresolvable causal gap.

Empirical status: The \(\varepsilon = 500\) ms drift bound is calibrated to OUTPOST ’s TCXO oscillator (50 ppm over 10 minutes); deployments using cheaper oscillators or longer partitions will see larger drift, requiring a proportionally wider uncertainty window and tighter TTL budget.

Three-Node OUTPOST Scenario: 500ms Drift

Nodes A (sensor, +500ms fast clock), B (actuator, accurate clock), C (command, accurate clock). Drift bound . Fire control TTL .

TimeEventPhysical stampHLC stampDVV dotNotes
real 0msA detects target, emitting \(E_1\)500ms (A fast)(500ms, 0)(A, 1)A’s clock is 500ms ahead
real 50msB receives \(E_1\); B’s HLC advances(500ms, 0)B inherits A’s HLC on receipt
real 200msB neutralizes, emitting \(E_2\)200ms (B accurate)(500ms, 1){(A,1),(B,1)}HLC = max(200ms, 500ms prev) + counter; DVV records causal dep on E_1
real 600msC reconnects; receives \(E_2\) first, then \(E_1\)Network may deliver in any order

Without HLC+DVV (physical-time only):

With HLC+DVV ( Proposition 5 ):

  1. C receives \(E_2\); notes ; checks local log for dot (A,1) — not present.
  2. \(E_2\) enters pending buffer; pending timer set to .
  3. C receives \(E_1\); dot (A,1) satisfies pending dependency for \(E_2\).
  4. Conflict Resolution Branch: — uncertainty-concurrent; physical order unreliable.
  5. DVV check: ; therefore \(E_1 \prec E_2\) definitively.
  6. HLC order confirms: — consistent.
  7. C applies \(E_1\) then \(E_2\): Target Detected then Target Neutralized. \(\square\)

The 500ms clock error on Node A is entirely absorbed by the HLC’s rule: B’s HLC advances to match A’s, making \(E_2\)’s HLC timestamp strictly greater than \(E_1\)’s. Physical-time inversion is impossible under this construction for any — drift bound is not 500ms but over two seconds for fire control events.

Watch out for: the guarantee requires the DVV dependency to be explicitly recorded at the time the causal event is generated; if a node emits \(E_2\) before receiving acknowledgment that \(E_1\) was applied — for instance, under a concurrent-write optimization that skips DVV population when the predecessor is “assumed delivered” — the pending buffer has no entry to wait on, and causal inversion is not prevented.

Architectural Response: Hierarchical Edge Tiers

Knowing when partition-first architecture wins is necessary but not sufficient. The inversion thesis requires a concrete structural response: layered autonomy, where each tier operates independently when partitioned and contributes to fleet objectives when connected. Tiers differ by compute capacity, connectivity probability, and decision authority — and the architecture is explicitly designed so that the tiers making safety-critical decisions are never connectivity-dependent.

Read the diagram carefully: dashed links (T0 \(\to\) T1) represent opportunistic sync — they may not exist. Solid links (T2 \(\to\) T3) represent local mesh — always available. The architecture guarantees mission continuity without any dashed link ever firing.

    
    graph TB
    subgraph "Tier 0: Cloud/Regional"
        C1["Regional Command
Full compute, persistent storage
Global optimization"] end subgraph "Tier 1: Edge Gateway" G1["Gateway Alpha
Local coordination
Tier 2 aggregation"] G2["Gateway Beta
Local coordination
Tier 2 aggregation"] end subgraph "Tier 2: Edge Cluster" E1["Cluster Lead
Intra-cluster consensus"] E2["Cluster Lead
Intra-cluster consensus"] E3["Cluster Lead
Intra-cluster consensus"] end subgraph "Tier 3: Edge Node" N1["Node"] N2["Node"] N3["Node"] N4["Node"] N5["Node"] N6["Node"] end C1 -.->|"Opportunistic
sync"| G1 C1 -.->|"Opportunistic
sync"| G2 G1 -->|"Cluster
coordination"| E1 G1 -->|"Cluster
coordination"| E2 G2 -->|"Cluster
coordination"| E3 E1 -->|"Local mesh"| N1 E1 -->|"Local mesh"| N2 E2 -->|"Local mesh"| N3 E2 -->|"Local mesh"| N4 E3 -->|"Local mesh"| N5 E3 -->|"Local mesh"| N6 style C1 fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style G1 fill:#fff3e0,stroke:#f57c00 style G2 fill:#fff3e0,stroke:#f57c00 style E1 fill:#e8f5e9,stroke:#388e3c style E2 fill:#e8f5e9,stroke:#388e3c style E3 fill:#e8f5e9,stroke:#388e3c style N1 fill:#fce4ec,stroke:#c2185b style N2 fill:#fce4ec,stroke:#c2185b style N3 fill:#fce4ec,stroke:#c2185b style N4 fill:#fce4ec,stroke:#c2185b style N5 fill:#fce4ec,stroke:#c2185b style N6 fill:#fce4ec,stroke:#c2185b

Each tier has a different partition survival requirement. T0 assumes connectivity. T3 must survive indefinitely without it — disconnection is its default operating condition, not a degraded state.

TierComputeStorageAuthority ScopePartition Tolerance
T0UnlimitedPetabytesGlobal policy, historical analysisNone required
T1High (GPU clusters)TerabytesRegional coordination, model updatesHours to days
T2Moderate (edge servers)GigabytesCluster consensus, task allocationMinutes to hours
T3Limited (embedded)MegabytesLocal action, immediate responseIndefinite

The design rule this forces: never place a safety-critical decision at a tier that requires higher-tier contact to execute it. T3 nodes handling immediate threat response, collision avoidance, or power management cannot wait for T2 coordination — and the architecture must guarantee they never have to.

Game-Theoretic Extension: Dynamic Coalition Formation Under Partition

The tier architecture pre-assigns nodes to fixed clusters. Partition breaks those assignments. When cluster communication is severed, nodes must form operating coalitions without centralized coordination — a hedonic coalition formation game where each node acts in its own interest and the architecture must guarantee the outcome is still collectively safe.

The Problem: Pre-assigned clusters may be split by a partition into subgroups that can no longer communicate with each other. Each subgroup must decide independently: stay as-is, merge with another reachable subgroup, or operate alone? Making that decision incorrectly wastes resources (too large a coalition) or risks missing MVS (too small).

The Solution: Model each node’s preferences formally and find the Nash-stable coalition size — the size where no node has an incentive to defect. The optimal size is a function of expected partition duration: short partition \(\to\) larger coalition; long partition \(\to\) smaller self-sufficient unit.

The Trade-off: Larger coalitions deliver more aggregate capability but accumulate more state divergence \(D(t)\) and create costlier reconciliation at reconnection. Smaller coalitions are cheap to reconcile but may fall below the MVS threshold and lose mission-critical function.

Model: Each node \(i\) has preferences over coalitions \(S \ni i\) based on three competing factors:

A Nash-stable partition is one where no node prefers to join a different coalition or operate alone: no \(i\) with current coalition \(S_i\) prefers any \(S' \neq S_i\) containing \(i\).

Optimal coalition size — the size \(k\) that maximizes the difference between MVS achievability and cumulative messaging cost scaled by expected partition duration:

This formula requires from the Weibull partition duration model; see the formal definition later in this article. The derivation here proceeds with as a parameter; its computation is addressed when the model is introduced.

Physical translation: The optimal coalition size trades coordination cost against the cost of idle redundancy. Longer expected partitions justify larger, self-sufficient coalitions — a CONVOY vehicle should cluster with 3–4 neighbors for a 30-minute mountain blackout, but only 1–2 for a 5-minute tunnel pass. The formula balances these costs at the crossover point.

Practical implication: When partition duration forecasts predict a short partition, preserve existing clusters. When they predict extended isolation, allow cluster fragmentation into smaller self-sufficient units. The formal fragmentation criterion: fragment if and only if any sub-coalition satisfies for a majority of members.

Capability Level Transition: Multi-Objective Decision Problem

Every connectivity change forces a decision: which capability level should the node target next? Three objectives compete and cannot all be maximized simultaneously — this is the core tension in every capability transition.

Competing Objectives: The node selects target level by jointly optimizing:

The core tension: maximizes mission value but is the least stable — one connectivity fluctuation below \(C = 0.9\) forces an immediate downgrade. is maximally stable (no connectivity required) but delivers minimal mission value. The optimal level sits between these extremes, weighted by how volatile the current connectivity regime is.

Single-objective simplification (when stability dominates): Select the capability level that maximizes expected accumulated value over a planning horizon \(\tau\), given the current system state \(\Sigma_t\) (connectivity estimate, health vector, resource levels):

where awards capability value only when connectivity supports it.

Physical translation: The integral accumulates value over the planning horizon only during intervals when \(C(s)\) stays above the level’s minimum threshold. A level that nominally delivers value 4 but requires \(C \geq 0.9\) in a regime where connectivity drops frequently will accumulate less expected value than a level that delivers value 2 but requires only \(C \geq 0.3\). The Weibull connectivity model ( Definition 13 , below) feeds the \(C(s)\) distribution directly into this integral.

Constraint Set: Three conditions gate every capability transition. All three must hold simultaneously for an upgrade; any single violation triggers an automatic downgrade.

Physical translation for \(g_3\): Single-step transitions prevent jumping from (survival) to (full optimization) in one move. Each step requires the previous level to be stable first — this is the architectural enforcement of the prerequisite ordering from the Formal Foundations section.

Capability Thresholds: The minimum link quality and resource fraction required to enter each level. A deployment where \(P(C \geq 0.9)\) is small should not architect as a primary operating mode.

Capability Functions Enabled
\(\mathcal{L}_0\)0.05%Survival, distress beacon
\(\mathcal{L}_1\)0.020%Core mission, local autonomy
\(\mathcal{L}_2\)0.340%Cluster coordination, gossip
\(\mathcal{L}_3\)0.860%Fleet integration, hierarchical sync
\(\mathcal{L}_4\)0.980%Full capability, streaming

State Transition Model: Upgrades are deliberate; downgrades are automatic. This asymmetry is intentional — the system never hesitates to shed capability when constraints are violated, but requires explicit satisfaction of all three gates before assuming higher capability.

Physical translation: (1) All gates pass, upgrade requested \(\to\) step up one level. (2) Any gate fails for the current level \(\to\) immediate single-step downgrade, never below . (3) Gates pass, no upgrade requested \(\to\) hold. The downgrade branch fires without delay — there is no grace period when connectivity or resources fall below threshold.

Commercial Application: AUTOHAULER Mining Fleet

The AUTOHAULER fleet is a commercial proof of the inversion thesis in an environment without jammers. Thirty-four autonomous haul trucks navigate an open-pit copper mine spanning 8 kilometers. The terrain — steep ramps, ore crusher canyons, underground ore passes — creates RF shadows and complete connectivity blackouts lasting 2–79 minutes per truck cycle, purely from physics. No adversary required. The environment itself exceeds \(\tau^*\).

The tier architecture maps directly to the mine’s operational structure. Dashed links from T0 to pit controllers carry only shift-level plans every 8 hours — safety-critical decisions never depend on them. Solid links from T2 haul road segments to individual trucks carry real-time collision avoidance and route commands — always available via local mesh.

    
    graph TB
    subgraph "T0: Mine Operations Center"
        MOC["Operations Center
Fleet scheduling, shift planning
Global optimization"] end subgraph "T1: Pit Controllers" PC1["North Pit Controller
Zone coordination
12 trucks"] PC2["South Pit Controller
Zone coordination
14 trucks"] PC3["Processing Controller
Crusher queue management
8 trucks in queue"] end subgraph "T2: Haul Road Segments" HR1["Segment Alpha
Ramp traffic control"] HR2["Segment Beta
Ore pass queuing"] HR3["Segment Gamma
Dump coordination"] end subgraph "T3: Autonomous Trucks" T1["Truck 01"] T2["Truck 02"] T3["Truck 03"] T4["Truck 04"] end MOC -.->|"Shift plans
every 8h"| PC1 MOC -.->|"Shift plans
every 8h"| PC2 MOC -.->|"Demand signal"| PC3 PC1 -->|"Route
assignment"| HR1 PC2 -->|"Route
assignment"| HR2 PC3 -->|"Queue
position"| HR3 HR1 -->|"Local
mesh"| T1 HR1 -->|"Local
mesh"| T2 HR2 -->|"Local
mesh"| T3 HR3 -->|"Local
mesh"| T4 style MOC fill:#e3f2fd,stroke:#1976d2,stroke-width:2px style PC1 fill:#fff3e0,stroke:#f57c00 style PC2 fill:#fff3e0,stroke:#f57c00 style PC3 fill:#fff3e0,stroke:#f57c00 style HR1 fill:#e8f5e9,stroke:#388e3c style HR2 fill:#e8f5e9,stroke:#388e3c style HR3 fill:#e8f5e9,stroke:#388e3c style T1 fill:#fce4ec,stroke:#c2185b style T2 fill:#fce4ec,stroke:#c2185b style T3 fill:#fce4ec,stroke:#c2185b style T4 fill:#fce4ec,stroke:#c2185b

T0 (Operations Center) handles shift-level planning — which trucks service which loading points, maintenance scheduling, production targets. 8-hour decision horizon; tolerates hours of disconnection from the pit.

T1 (Pit Controllers) manage zone-level coordination — balancing truck allocation as ore grades vary, responding to equipment breakdowns, adjusting for weather. 15-minute horizon; operate autonomously when satellite links drop.

T2 (Haul Road Segments) coordinate local traffic — managing passing bays, controlling single-lane ramp traffic, sequencing trucks at dump points. 5-minute horizon; handle T1 disconnection routinely.

T3 (Trucks) make immediate decisions — collision avoidance, obstacle response, speed regulation, emergency stops. Zero connectivity dependency. The tier with the most safety-critical decisions has the least connectivity requirement.

Each tier exposes three interfaces: a state interface to its parent (position, load, estimated completion), a command interface from its parent (route, speed limits, destination), and a peer interface for same-tier coordination (precedence negotiation, passing). When parent connectivity fails, the tier activates delegated authority — bounded decision rights that enable continued operation without escalation.

Connectivity Model — from terrain geometry and RF propagation under assumption set :

LocationConnectivity \(C\)Derivation
Open pit benches\(\geq 0.9\)Line-of-sight to base station
Haul road switchbacks\(\approx 0.7\)
Ore pass tunnel\(= 0\)Complete RF occlusion (\(A_2\))
Crusher queue\(\approx 0.85\)Partial occlusion from equipment

Blackout duration: Given tunnel segment length \(L_t\) and truck speed \(v\), blackout duration \(T_b = L_t / v\). For typical and \(v \approx 0.8\) m/s (loaded through narrow ore pass): \(T_b \in [8, 12]\) minutes.

What this means: A truck entering an ore pass tunnel at (fleet integration active) drops to \(C = 0\) — immediately triggering a downgrade to (local autonomy). For 8–58 minutes it must handle collision avoidance, speed control, and ore pass queuing without any external input. This is not a failure mode — it is the designed operating condition.

The tier architecture ensures : trucks maintain capability during blackouts, Segment controllers maintain coordination, and reconciliation occurs at when connectivity restores.

Tier Transition Protocol — when a tier loses parent connectivity, it activates delegated authority in five steps:

  1. Detect — no acknowledgment within
  2. Broadcast — partition event to siblings with timestamp and last-known parent state
  3. Assume authority — inherit bounded decision rights from parent (T2 clusters make T1-level allocation decisions within pre-authorized bounds)
  4. Log — all delegated-authority decisions with causality chain for later merge
  5. Reconnect — exponential backoff

Maximum retry interval cap: where \(T_\text{cap} = 3600\,\text{s}\) (1 hour), beyond which the retry interval is held constant at \(T_\text{cap}\). At the Weibull P95 partition duration of approximately 27 hours, this limits the maximum inter-retry wait to 1 hour regardless of partition length.

Why Step 4 matters: Logging causality is what makes reconciliation cheap at reconnection. A truck that made 15 autonomous decisions during a 12-minute blackout produces a compact, ordered log that the T2 segment can replay and merge in seconds. Without it, reconciliation requires full state comparison — the \(D(t)\) divergence cost grows quadratically.

Quantifying Edge-ness

The inversion thesis tells you whether a system needs edge architecture. The Edge-ness Score tells you how much — giving a single scalar that classifies any deployment and drives architectural regime selection.

The Edge-ness Score \(E \in [0,1]\) quantifies edge characteristics across four independently measurable dimensions:

What each term measures:

Weight derivation — each weight \(w_i\) is proportional to the marginal impact of its dimension on system utility at the critical operating point:

Partition probability \(w_1 = 0.35\) dominates because is discontinuous at \(\tau^*\) — a small increase near the threshold causes a large utility collapse. Reversibility \(w_2 = 0.25\) and timing ratio \(w_3 = 0.25\) contribute equally — both affect decision quality linearly. Adversarial fraction \(w_4 = 0.15\) has lower weight because adversarial scenarios are a subset of partition scenarios already captured by \(w_1\). Adjust weights for your deployment via: .

Interpretation thresholds:

\(E\) rangeRegimeArchitectural implication
\(E < 0.3\)Cloud-viableCloud-native patterns work; edge patterns optional
\(0.3 \leq E < 0.6\)HybridEdge patterns mandatory for safety-critical paths; cloud usable for coordination
\(E \geq 0.6\)Full-edgeCloud-native patterns will fail; full edge architecture required

CONVOY calculation — \(P(C=0) = 0.21\), , , :

\(E = 0.77\) — firmly full-edge. Active EW forces autonomous operation at every tier; no cloud-first assumption survives contact with the threat model.

AUTOHAULER calculation — \(P(C=0) = 0.13\), , , :

\(E = 0.55\) — hybrid zone. Edge patterns are mandatory for collision avoidance and tunnel operations; pit-top coordination can still use reliable connectivity for non-safety functions.

GRIDEDGE calculation — \(P(C=0) = 0.16\), (relay trips permanent until manual reset), (500ms fault response vs. 30s SCADA polling), :

\(E = 0.48\) — hybrid zone, but driven by an unusual combination. Partition probability alone would classify it cloud-viable; irreversibility and timing mismatch override it. This is why all four dimensions must be evaluated — single-metric classification would misclassify GRIDEDGE .


Detailed positioning against fog computing, edge-cloud continuum, and multi-agent system paradigms is in the Reference section at the end of this article.

The inversion thesis establishes when edge autonomy outperforms cloud coordination. Formalizing how often and how long connectivity fails requires a stochastic model of connectivity states — the Connectivity Spectrum introduces this model and grounds the analysis in measurable parameters.

The Contested Connectivity Spectrum

Not all disconnection is equal. Reduced bandwidth demands different protocols than an adversary injecting false packets. We define four connectivity regimes , each with distinct characteristics and required countermeasures:

RegimeCharacteristicsExample ScenarioArchitectural Response
DegradedReduced bandwidth, elevated latency, increased packet lossCONVOY in mountain terrain with intermittent line-of-sightPrioritized sync, compressed protocols, delta encoding
IntermittentUnpredictable connectivity windows, unknown durationRAVEN beyond relay horizon, periodic satellite passesStore-and-forward, opportunistic burst sync, prediction models
DeniedNo connectivity for extended periods, possibly permanent OUTPOST under sustained jamming, cable cutFull autonomy, local decision authority, self-contained operation
AdversarialConnectivity exists but is compromised or manipulatedMan-in-the-middle, replay attacks, GPS spoofingAuthenticated channels, Byzantine fault tolerance, trust verification

Markov Model of Connectivity Transitions

The continuous connectivity state \(C(t) \in [0,1]\) ( Definition 5 ) is discretized into regimes via a state quantization mapping \(q: [0,1] \rightarrow S\) with thresholds .

For CONVOY , the thresholds are \(\theta_N = 0\), \(\theta_I = 0.1\), \(\theta_D = 0.3\), \(\theta_F = 0.8\) — calibrated from operational telemetry. Mesh connectivity below 10% is effectively Denied; below 30% it limits coordination; below 80% it prevents synchronized maneuvers.

Definition 12 (Connectivity Semi-Markov Process). Let denote the connectivity regime space. The regime process is modeled as a semi-Markov process with two components:

1. Embedded Markov chain , where is the probability of transitioning to regime \(j\) upon leaving regime \(i\), with . Derived from operational telemetry by normalizing each row of the rate matrix: where .

2. Sojourn distribution in regime \(i\), where \(k_i > 0\) is the shape parameter and \(\lambda_i > 0\) is the scale parameter:

The stationary distribution of the semi-Markov process is:

Plain English. Each regime’s share of total time equals its “frequency of visits” times its “average stay.” A regime that is visited rarely but lasts a long time can still dominate the stationary distribution. This is why Denied — though not the most common destination — accounts for 21% of CONVOY ’s operating time: partitions last much longer than transitions suggest.

where satisfies . Special case: when \(k_i = 1\) for all \(i\), each , , and exactly — the original continuous-time Markov chain is recovered.

We separate which regime comes next (governed by \(P\)) from how long we stay there (governed by Weibull). The original CTMC assumed both were memoryless exponentials; the semi-Markov process lets each regime have its own sojourn distribution.

Why Weibull, not exponential? Exponential sojourn times assume a constant hazard rate: the probability of recovery in the next minute is the same whether you’ve been Denied for 5 minutes or 5 hours. Operational data from tactical networks contradicts this. Denied periods show decreasing hazard rates — the longer a partition has lasted, the less likely recovery becomes in the next instant. Weibull with \(k < 1\) captures exactly this behavior [11] . The exponential model (\(k = 1\)) systematically underestimates how long the worst partitions last.

For CONVOY , the rate matrix from operational telemetry is:

The embedded chain \(P\) is derived by row-normalizing the off-diagonal entries ( ):

Fitting Weibull parameters to partition telemetry from 120 missions (regimes , , retain \(k=1\) — their exponential fit is adequate; only the Denied regime shows a heavy tail requiring \(k < 1\)):

Regime\(k_i\)\(\lambda_i\) (hr)\(\mathbb{E}[T_i]\) (hr)Sojourn model
\(\mathcal{C}\)1.006.676.67Exponential (\(k=1\))
\(\mathcal{D}\)1.004.554.55Exponential (\(k=1\))
\(\mathcal{I}\)1.004.174.17Exponential (\(k=1\))
\(\mathcal{N}\)0.624.626.67Weibull heavy-tail

The Weibull model is calibrated to preserve the CTMC mean sojourn times exactly. The Denied-regime scale parameter hr with gives hr, matching hr exactly. The other three regimes use \(k=1\), so their means also equal the CTMC inverses.

Because all match the CTMC values, the stationary distribution equals exactly. The semi-Markov formula reduces to , the standard CTMC result when the embedded chain is derived from row-normalizing \(Q\):

For CONVOY , — the system spends only 32% of operating time in the Connected regime. Any architecture assuming full connectivity as baseline fails to match operational reality more than two-thirds of the time.

Why yet the models differ. The stationary fractions are identical by calibration — but individual partition durations are not. Under Weibull ( ), the coefficient of variation is CV = 1.69 versus CV = 1.00 for the exponential; the P95 extends from 20.0 hr (CTMC) to 27.1 hr (Weibull). An architecture sized for a 20-hour self-sufficiency window will fail to cover 5% of actual partitions that last up to 27 hours. The CTMC systematically underestimates the tail.

Tail metricCTMC (\(k=1\))Weibull (\(k=0.62\))Underestimate
\(\mathbb{E}[T_\mathcal{N}]\)6.67 hr6.67 hr0%
SD\([T_\mathcal{N}]\)6.67 hr11.26 hr\(-69\%\)
CV1.001.69
P9520.0 hr27.1 hr\(-35\%\)

Physical translation. The means match by calibration — but the tails diverge. An architecture sized for a 20-hour self-sufficiency window (CTMC P95) will fail to cover 5% of actual partitions that last up to 27 hours. The 7-hour gap is not a rounding error; it is the difference between a system that survives extended jamming and one that runs out of local state before connectivity returns.

Regime Transition Rates and Recovery Paths (rates per hour, edge thickness = frequency, node size = stationary probability): The diagram shows all twelve regime-to-regime transition rates for CONVOY ; the key pattern is that recovery from Denied (N) flows first through Intermittent rather than directly back to Connected, while the Connected-to-Degraded edge carries the highest outbound rate.

    
    stateDiagram-v2
    direction LR

    C: Connected (pi=0.32)
    D: Degraded (pi=0.25)
    I: Intermittent (pi=0.22)
    N: Denied (pi=0.21)

    C --> D: 0.08/hr
    C --> I: 0.05/hr
    C --> N: 0.02/hr

    D --> C: 0.12/hr
    D --> I: 0.07/hr
    D --> N: 0.03/hr

    I --> C: 0.06/hr
    I --> D: 0.10/hr
    I --> N: 0.08/hr

    N --> C: 0.02/hr
    N --> D: 0.04/hr
    N --> I: 0.09/hr

    note right of C
        Capability: L4
        Full coordination
    end note

    note right of D
        Capability: L2
        Priority queuing
    end note

    note right of I
        Capability: L1
        Store-and-forward
    end note

    note right of N
        Capability: L0-L1
        Full autonomy
    end note

Interpreting the diagram: CONVOY transitions most frequently between adjacent states (Full-Degraded, Degraded-Intermittent, Intermittent-Denied); direct jumps to Denied are rare (0.02-0.03/hr). Recovery from Denied follows a gradual path - the Denied-to-Intermittent rate (0.09/hr) exceeds Denied-to-Full (0.02/hr). Partition recovery architectures must anticipate phased restoration, not instant full connectivity.

Partition Event Timeline:

A typical 8-hour CONVOY operation might experience the following connectivity pattern:

    
    gantt
    title CONVOY Connectivity Timeline (8-hour mission)
    dateFormat HH:mm
    axisFormat %H:%M
    tickInterval 1h

    section Connectivity
    Connected (L4)                :f1, 00:00, 01:30
    Degraded (L2)                  :d1, after f1, 45m
    Intermittent (L1)              :i1, after d1, 30m
    Denied - Partition (L0-L1)    :crit, n1, after i1, 75m
    Intermittent Recovery         :i2, after n1, 20m
    Degraded                      :d2, after i2, 40m
    Connected                     :f2, after d2, 60m
    Degraded                      :d3, after f2, 30m
    Denied - Jamming              :crit, n2, after d3, 45m
    Intermittent                  :i3, after n2, 25m
    Connected                     :f3, after i3, 20m

    section Authority
    Central Coordination          :active, a1, 00:00, 90m
    Delegated Authority           :a2, after a1, 75m
    Local Autonomy Active         :crit, a3, after a2, 75m
    Delegated Recovery            :a4, after a3, 60m
    Central Coordination          :active, a5, after a4, 60m
    Delegated Authority           :a6, after a5, 30m
    Local Autonomy Active         :crit, a7, after a6, 45m
    Delegated Recovery            :a8, after a7, 25m
    Central Coordination          :active, a9, after a8, 20m

    section State Sync
    Continuous Sync               :s1, 00:00, 90m
    Priority Sync                 :s2, after s1, 45m
    Buffering                     :s3, after s2, 30m
    Local State Only              :crit, s4, after s3, 75m
    Reconciliation                :done, r1, after s4, 20m
    Priority Sync                 :s5, after r1, 40m
    Continuous Sync               :s6, after s5, 60m
    Priority Sync                 :s7, after s6, 30m
    Local State Only              :crit, s8, after s7, 45m
    Reconciliation                :done, r2, after s8, 25m
    Continuous Sync               :s9, after r2, 20m

CONVOY experiences two partition events totaling 120 minutes (25% of mission time in Denied state). The architecture handles authority transitions, state buffering, and reconciliation — automatically, without human intervention.

Cognitive Map — Section 11. Connectivity regimes (Table) \(\to\) Semi-Markov model separating transition probabilities from sojourn durations ( Definition 12 ) \(\to\) Weibull captures heavy tails that exponential misses \(\to\) Stationary distribution shows CONVOY is Denied 21% of the time \(\to\) P95 underestimate (20 hr vs. 27 hr) drives the self-sufficiency design target.


Definition 13 (Weibull Partition Duration Model). The sojourn time of the Denied regime in Definition 12 is modeled as [Weibull, 1951] with shape and scale . The expected partition duration, variance, and planning quantiles are:

Physical translation: The 95th-percentile partition duration — how long 95 in 100 blackouts will last. With Weibull shape k < 1 (heavy-tailed, typical in contested environments), this value is far larger than the mean duration. Exponential models catastrophically underestimate it; the Weibull shape parameter k is the difference between “probably back in 5 minutes” and “plan for 2 hours.”

MCU implementation: \(\Gamma(1 + 1/k)\) and \(\Gamma(1 + 2/k)\) are pre-computed offline and stored in a static 8-entry look-up table (LUT) for ; values between table entries are linearly interpolated. The formula requires one pow() call on the constant \(\ln 20 \approx 2.996\) — the only floating-point primitive needed at runtime.

\(k\)\(\Gamma(1+1/k)\)\(\Gamma(1+2/k)\)CV
0.309.2602.59 \(\times 10^3\)5.41
0.403.323120.03.14
0.502.00024.002.24
0.601.5059.2611.76
0.701.2665.0291.46
0.801.1333.3231.26
0.901.0522.4791.11
1.001.0002.000 (exact: \(\Gamma(3)=2\))1.00

For \(k < 0.30\), grows rapidly; use (median) for mission planning rather than the mean.


Definition 14 (Adaptive Weibull Shape Parameter). The shape parameter is not static; it is maintained by an EXP3-IX [Neu, 2015] multi-armed bandit (Definition 81) with arms indexed over . The reward signal for arm \(k\) at partition end is:

where \(f_\text{mission}(t)\) is mission-progress score, \(f_\text{stability}(t)\) is link-stability score, and \(f_\text{efficiency}(t)\) is energy-efficiency score, each normalized to \([0,1]\). The per-arm reward signal at partition end is the normalized timing penalty:

Physical translation: The reward function is a weighted scoreboard: mission progress, stability, and energy efficiency each contribute. Setting \(w_3 = 0.5\) for a power-constrained deployment tells the bandit that energy efficiency is worth as much as mission and stability combined — the system will voluntarily accept slower mission progress to avoid running out of power.

A partition shorter than gives \(r = 0\); one longer than expected gives a negative reward proportional to the normalized excess. Arms with smaller \(k\) (heavier tails) are penalized less by unexpectedly long partitions, so the bandit shifts downward as the node accumulates evidence of heavy-tail behavior — mathematically bracing itself for longer, deeper denied periods.

Prior: for tactical environments ( RAVEN , CONVOY , OUTPOST ); for commercial environments ( AUTOHAULER , GRIDEDGE ). Bandit requires \(\approx 18\) partition events to converge; is frozen at the prior during warm-up.


Definition 15 (Partition Duration Accumulator). The partition duration accumulator tracks contiguous time in the Denied regime :

Updated at each MAPE-K tick:

Reset condition: when \(\Xi\) transitions out of (partition ends). The accumulator is the input to both the time-varying anomaly threshold ( Proposition 9 in the self-measurement article) and the circuit breaker ( Proposition 37 ).

Reset rule. runs only during partition ( ). It resets to zero on partition end (first successful round-trip to any peer). It does NOT reset on capability-level transitions; a node degrading from L3 to L1 accumulates continuously through all level changes until connectivity resumes. This ensures that Proposition 9 ’s time-varying threshold (Self-Measurement Without Central Observability) remains valid and monotonically tightening across mode changes during a single partition episode.


Definition 16 (Gilbert-Elliott Bursty Channel Model). A 2-state hidden Markov chain operating at packet timescale models bursty RF interference on each link .

Channel states: (Good) and (Bad) with transition matrix:

where is the burst-onset rate and is the recovery rate.

Packet error rates: (near-zero in Good state) and (near-total loss in Bad state).

Connectivity signal: The regime-level connectivity metric \(C(t)\) is derived from the GE output via a sliding-window moving average over \(W\) packet slots:

where is the packet loss indicator at slot \(\tau\). The GE model operates at millisecond-to-second packet timescale; \(C(t)\) smoothed over \(W\) feeds into the regime-transition process of Definition 12 , which operates at minutes-to-hours timescale. This two-timescale coupling preserves the semi-Markov structure of Definition 12 while capturing bursty loss patterns that a memoryless Markov chain cannot represent.

Physical translation: The Good state models a clear radio channel with occasional dropped packets; the Bad state models active jamming or severe multipath fading where almost nothing gets through. The transition rates between states are calibrated to the environment — RAVEN ’s contested airspace has more frequent Bad-state entry and shorter Good-state dwell than OUTPOST ’s static mesh.


Definition 17 (Spatial Jamming Correlation). Let denote the set of neighbors of node \(i\) visible via gossip (Definition 24). The neighborhood denial fraction at time \(t\) is:

The spatial jamming correlation factor modifies the transition rates of the embedded Markov chain in Definition 12 . For any node \(i\) currently in a connected or degraded state, the rate toward the Denied regime is amplified and the recovery rate is suppressed:

When the model reduces to the spatially independent case ( Definition 12 ); models partial area denial; models full coordinated jamming where a node’s own survival probability is strongly coupled to its neighbors’ fates — denial onset rate doubles and recovery rate halves.

The modified rates \(\tilde{q}\) replace the static \(q\) values in the embedded chain, making the semi-Markov process time-inhomogeneous. The Weibull sojourn distributions ( Definition 13 ) remain; only the jump probabilities change with \(f_N(t)\).


Proposition 6 (Mode-Switching Hysteresis). To prevent oscillation when the connectivity signal \(C(t)\) fluctuates at a regime boundary, each transition in the embedded chain of Definition 12 uses asymmetric Schmitt-trigger thresholds.

A CONVOY vehicle near the Connected/Degraded boundary will not thrash between architectural modes during a burst — the dead band absorbs the flicker and the refractory lock caps transition rate.

Let be the nominal regime boundary between adjacent regimes \(k\) and \(k+1\), and let be the hysteresis half-width. The transition rule is:

Once a transition fires, the trigger is locked out for the refractory period ( Definition 46 ) before the opposite threshold becomes active. A signal flickering within the dead band produces at most one transition per refractory window.

Corollary: Under the Gilbert-Elliott model ( Definition 16 ), a burst of duration packets produces a transient dip in \(C(t)\) of expected magnitude . Setting ensures that a single burst cannot cross the downward threshold unilaterally, eliminating spurious regime transitions during burst events.

Reasoning: The dead band absorbs transient \(C(t)\) dips without triggering architectural mode changes. The refractory lockout ( Definition 46 ) ensures that even a sustained boundary-straddling signal cannot produce unbounded transition chatter; combined, these two mechanisms bound the switching rate independently of jamming intensity.

The Schmitt-trigger hysteresis here operates on regime-level \(C(t)\) at minutes timescale; Definition 47 operates on sensor-level MAPE-K thresholds at seconds timescale. They are complementary, not redundant.

Empirical status: The hysteresis half-widths \(\delta_h = 0.08\) (illustrative value) and \(\delta_h = 0.05\) (illustrative value) and the burst fraction \(\pi_B = 0.12\) (illustrative value) are calibrated from CONVOY field measurements; deployments with different burst statistics or regime boundary positions require recalibration — set \(\delta_h \geq\) measured \(\pi_B\) at each boundary.

Watch out for: \(\delta_h\) is calibrated under the assumption that burst statistics are stationary; if an adversary shifts from random jamming to correlated burst patterns, the measured \(\pi_B\) from peacetime telemetry underestimates the effective burst fraction during attack, and transitions that the dead band should absorb will cross the downward threshold, producing mode-switching during the exact scenario the hysteresis was designed to prevent.


Proposition 7 (Architectural Regime Boundaries). Under stated assumptions, the stationary distribution \(\pi\) provides guidance for architectural choices:

CONVOY ’s measured connectivity distribution places it decisively past all three thresholds: centralized control is impractical, local authority is required, and opportunistic sync beats scheduled sync.

(i) Centralized coordination may become impractical when

(ii) Local decision authority becomes beneficial when

(iii) Opportunistic synchronization may outperform scheduled synchronization when

Reasoning: Boundary (i) follows from coordination message complexity analysis - centralized protocols require \(O(n)\) messages per decision, achievable only when coordinator reachability is high. Boundary (ii) follows from decision latency constraints - waiting for central authority when denial probability exceeds 10% increases expected decision delay. Boundary (iii) derives from sync window analysis - intermittent connectivity above 25% makes scheduled synchronization less reliable.

Uncertainty note: These boundaries are approximate. The actual transition points depend on specific system parameters (message complexity, latency tolerance, sync period). Use as heuristics, not hard rules. Systems near boundaries warrant empirical evaluation. The specific fractions (0.8 (theoretical bound), 0.1 (theoretical bound), 0.25 (theoretical bound)) are derived from CONVOY ’s message-complexity profile and may shift by \(\pm 0.10{-}0.15\) (illustrative value) for systems with different coordination overhead — calibrate from your own connectivity telemetry.

Corollary 1. CONVOY with \(\pi = (0.32, 0.25, 0.22, 0.21)\) (illustrative value) falls decisively in the contested edge regime: (illustrative value) precludes centralized coordination, and (illustrative value) mandates local decision authority.

Under the Weibull semi-Markov model ( Definition 12 ), is computed via the stationary formula; for CONVOY with (illustrative value) calibrated to preserve , the regime fractions are unchanged. However, the P95 self-sufficiency requirement extends from 20 hr (illustrative value) to 27.1 hr (theoretical bound under illustrative Weibull parameters) ( Definition 13 ) — this must be reflected in resource buffer sizing even though the boundary conditions themselves remain satisfied.

Watch out for: \(\pi\) is estimated as the long-run stationary distribution, which requires the connectivity process to be ergodic over the observation window; if the environment has a diurnal or seasonal cycle (e.g., terrain masking that varies with convoy route), a short observation period will yield a biased \(\pi\) that overstates one regime and understates another, placing the architecture in the wrong tier.

Architectural Response: Fog Computing Layers

The connectivity spectrum suggests a natural architectural pattern: process data at the earliest viable point given current connectivity. This fog computing model [7] distributes computation along the data path, with each layer adapted to the connectivity regime it typically experiences.

Problem: Forwarding all raw sensor data to the cloud for processing requires a reliable uplink. When connectivity is Intermittent or Denied, unprocessed data queues until reconnection — creating dangerous decision lag precisely when decisions matter most.

Solution: Place computation as close to the data source as hardware allows. Each layer processes and reduces data before forwarding; higher layers receive structured insights, not raw streams, so they can function even if lower-layer feeds are delayed.

Trade-off: Each processing hop discards information. A fog node reducing 2.4 Gbps to 10 kbps preserves detection events but loses raw pixels. If the classifier was wrong, there is no recovery path. Fog processing trades reversibility for connectivity resilience — an explicit design choice, not an oversight.

    
    graph LR
    subgraph "Device Layer"
        D1["Sensor
Raw data generation"] D2["Actuator
Physical action"] end subgraph "Fog Layer" F1["Fog Node
Filtering, aggregation
Local inference"] F2["Fog Node
Filtering, aggregation
Local inference"] end subgraph "Edge Layer" E1["Edge Server
Complex inference
Multi-node correlation"] end subgraph "Cloud Layer" C1["Cloud
Training, archival
Global analytics"] end D1 -->|"100 kbps
raw"| F1 D2 <-->|"Commands"| F2 F1 -->|"10 kbps
filtered"| E1 F2 -->|"10 kbps
filtered"| E1 E1 -.->|"1 kbps
events"| C1 C1 -.->|"Model
updates"| E1 E1 -->|"Policy
updates"| F1 E1 -->|"Policy
updates"| F2 style D1 fill:#fce4ec,stroke:#c2185b style D2 fill:#fce4ec,stroke:#c2185b style F1 fill:#e8f5e9,stroke:#388e3c style F2 fill:#e8f5e9,stroke:#388e3c style E1 fill:#fff3e0,stroke:#f57c00 style C1 fill:#e3f2fd,stroke:#1976d2

Read the diagram carefully. Solid arrows carry data and commands that must always succeed; dashed arrows carry model updates that flow down when connectivity permits. The architecture is designed so every solid-arrow path works independently — cloud unavailability degrades precision but never stops operation.

The connection to the Markov model is direct: each layer operates in a different connectivity regime . The Device-to-Fog link typically experiences Full or Degraded connectivity (local mesh). The Fog-to-Edge link experiences Intermittent connectivity (cluster boundaries). The Edge-to-Cloud link experiences Denied or Adversarial regimes under contested conditions. The architecture matches processing capability to expected connectivity.

Data Reduction Cascade: Each layer applies transformations that reduce data volume while preserving decision-relevant information:

where \(r_i < 1\) is the reduction ratio at layer \(i\). For RAVEN with and :

A \(100\times\) reduction makes satellite backhaul feasible even during Degraded regime. But each reduction stage must preserve information sufficient for its downstream consumers. The fog layer discards raw pixels but preserves detection events. The edge layer discards individual detections but preserves track hypotheses.

Physical translation. Multiply all the reduction ratios together to find what fraction of raw data reaches the cloud. RAVEN ’s two-stage \(10\times\) reduction at fog and edge leaves 1% of source bandwidth — 24 Mbps from 2.4 Gbps — which fits a satellite uplink. Losing a layer means losing the decisions that depended on it: if the fog node fails, 2.4 Gbps arrives at the edge with no preprocessing budget to absorb it.

Fog processing pipeline:

  1. Validate: Check integrity, timestamp freshness, source authentication
  2. Filter: Apply domain filters ( RAVEN : motion detection, background subtraction, ROI extraction)
  3. Infer: Run lightweight classifiers, producing structured detections rather than raw imagery
  4. Aggregate: Combine across time windows, suppress duplicates, compute confidence
  5. Forward: Transmit based on novelty, confidence threshold, or heartbeat interval

Commercial Application: GRIDEDGE Power Distribution

GRIDEDGE manages a power distribution network at a scale that makes cloud-dependent architecture immediately untenable: 180,000 customers, , with 847 transformers, 156 reclosers, 43 capacitor banks, and 12 substations. The 500 ms fault-isolation mandate is not a performance goal — it is a physical constraint imposed by upstream breaker trip times. Fog processing is the only architecture that meets it.

Power distribution faces a unique connectivity challenge: the very events that require coordination - storms, equipment failures, vegetation contact - are the same events that damage communication infrastructure. A storm that causes a line fault likely also damages the cellular tower serving that feeder.

The Markov connectivity model for GRIDEDGE captures this correlation. Compared to CONVOY , note the elevated Intermittent-to-Denied rate and the faster recovery from Denied, reflecting correlated storm-driven outages that are severe but finite in duration.

The elevated and rates reflect fault-communication correlation: grid disturbances that push connectivity from Full to Degraded or Intermittent frequently cascade to Denied as the underlying cause affects both systems.

Solving \(\pi Q = 0\) for GRIDEDGE yields the long-run fraction of time spent in each regime; the result below shows that GRIDEDGE is predominantly connected but cannot rely on that — one in five hours is in Denied state.

GRIDEDGE spends 46% of time in Full connectivity — substantially better than tactical environments, but still insufficient for cloud-dependent architecture.

What this tells you. GRIDEDGE looks connected on average. But the 19% Denied fraction coincides with the highest-consequence decisions: fault isolation, load shedding, and protective relay coordination. The fog architecture must be designed for the 19%, not sized for the 46%.

The fog computing architecture for GRIDEDGE implements hierarchical protection. The diagram below shows data and command flows between each layer; dashed arrows indicate SCADA polling that may be unavailable during a storm, while solid arrows carry protection commands that must always succeed.

    
    graph LR
    subgraph "Device Layer"
        S1["Smart Meter
Voltage, current
15-min intervals"] S2["Line Sensor
Fault detection
Sub-cycle response"] R1["Recloser
Fault isolation
60ms operation"] end subgraph "Fog Layer" F1["Feeder Controller
Protection coordination
Fault location"] F2["Feeder Controller
Protection coordination
Fault location"] end subgraph "Edge Layer" SUB["Substation
SCADA integration
Multi-feeder coordination"] end subgraph "Cloud Layer" RCC["Regional Control
Load forecasting
Outage management"] end S1 -->|"96 reads/day"| F1 S2 -->|"Events only"| F1 R1 <-->|"Trip/close
commands"| F1 F1 -->|"Feeder status"| SUB F2 -->|"Feeder status"| SUB SUB -.->|"SCADA
polling"| RCC RCC -.->|"Settings
updates"| SUB SUB -->|"Coordination
settings"| F1 SUB -->|"Coordination
settings"| F2 style S1 fill:#fce4ec,stroke:#c2185b style S2 fill:#fce4ec,stroke:#c2185b style R1 fill:#fce4ec,stroke:#c2185b style F1 fill:#e8f5e9,stroke:#388e3c style F2 fill:#e8f5e9,stroke:#388e3c style SUB fill:#fff3e0,stroke:#f57c00 style RCC fill:#e3f2fd,stroke:#1976d2

Device Layer sensors generate continuous telemetry but have minimal local intelligence. Smart meters report 15-minute interval data; line sensors report event-triggered fault signatures; reclosers execute protection logic but don’t coordinate independently.

Fog Layer feeder controllers implement the critical protection coordination. When a fault occurs, the feeder controller must:

  1. Detect fault location from sensor signatures (within 100ms)
  2. Determine isolation strategy - which switches to open (within 200ms)
  3. Coordinate with adjacent feeders to prevent upstream trips (within 300ms)
  4. Execute switching sequence (within 500ms total)

This 500ms budget is the survival constraint - slower response causes upstream breaker trips, expanding outages from tens to thousands of customers. The fog controller cannot wait for substation or regional center involvement.

Edge Layer substations coordinate multi-feeder response: if Feeder A trips, can Feeder B absorb transferred load? This decision requires 2-5 seconds and can tolerate intermittent fog-to-edge connectivity.

Cloud Layer (Regional Control Center) handles non-time-critical functions: outage reporting, crew dispatch, load forecasting, rate optimization. These tolerate minutes to hours of disconnection.

Data reduction through fog processing: A single 12kV feeder with 1,200 smart meters, 45 line sensors, and 8 reclosers generates approximately 11 MB/day of raw telemetry. The fog layer reduces this to 400 KB/day of processed events and status summaries - a \(27\times\) reduction.

Learning Transition Rates Online

Static estimates of \(Q\) are insufficient for systems that must adapt to changing environments. An anti-fragile system learns its connectivity dynamics online, updating estimates as new transitions are observed.

Define as the count of observed transitions from state \(i\) to state \(j\) by time \(t\), and \(T_i(t)\) as total time spent in state \(i\). The maximum likelihood estimate of transition rates is:

But raw MLE is unstable with sparse observations. Placing a Gamma prior over each rate — parameterized by prior pseudo-count and prior time — and then updating with observed transition counts and dwell time \(T_i\) yields a posterior that shrinks toward the prior when data are sparse.

The prior hyperparameters \(\alpha^0, \beta^0\) encode baseline expectations from similar environments. The posterior concentrates around observed rates as data accumulates.

This is where models meet their limits. The Bayesian update assumes transitions are Markovian - future connectivity depends only on current state, not history. Real adversaries learn and adapt. A jamming system that observes CONVOY ’s movement patterns may change its transition rates to maximize disruption. The model provides a useful baseline, but engineering judgment must recognize when adversarial adaptation has invalidated the model’s assumptions.

Semi-Markov Extension for Realistic Dwell Times

The basic CTMC assumes exponentially distributed dwell times in each state. Operational data often shows non-exponential patterns - jamming may have a characteristic duration, or network recovery may follow a heavy-tailed distribution.

The semi-Markov extension replaces exponential dwell times with general distributions \(F_i(t)\) for each state \(i\):

For CONVOY , operational telemetry suggests: The Connected regime follows Exponential distribution with rate /hour (memoryless), the Degraded regime follows Log-normal with \(\mu = 0.5\), \(\sigma = 0.8\) (terrain-dependent), the Intermittent regime follows Weibull with \(k = 1.5\), \(\lambda = 2.0\) (jamming burst patterns), and the Denied regime follows Pareto with \(\alpha = 1.2\), \(x_m = 0.5\) (heavy-tailed adversarial denial; empirically calibrated to adversarial denial durations measured in RAVEN red-team exercises; the heavy tail captures coordinated jamming events where denial periods cluster at multiples of base duration).

Note: The Pareto model here is an illustrative fit for adversarially-constrained denial scenarios (extended jamming). The authoritative analytical model for CONVOY ’s Denied regime is Weibull(\(k=0.62\), \(\lambda=4.62\)) per Definition 13 . The Pareto and bimodal-mixture models in this section serve as sensitivity examples for heavy-tailed conditions; they do not replace Definition 13 as the canonical representation.

The semi-Markov stationary distribution weights each state by how long the system actually stays there, not just how often it visits: a state visited rarely but for long periods gets more probability mass than a state visited often but briefly.

where is the embedded Markov chain distribution and \(E[T_i]\) is the mean sojourn time in state \(i\).

Adversarial Adaptation Detection

When an adversary adapts to our connectivity patterns, the transition rates become non-stationary. We detect this through change-point analysis on the rate estimates.

Define the CUSUM statistic for detecting rate increase in :

where is the minimum detectable shift. An alarm triggers when \(S_t > h\) for threshold \(h\).

Adversarial indicators (any triggers investigation; thresholds are configurable):

  1. Transition rates to Denied ( ) state increase significantly from baseline (e.g., >50%)
  2. Dwell time in Connected ( ) state decreases significantly (e.g., >30%)
  3. Correlation between own actions and subsequent transitions is positive and significant
  4. Recovery times from Denied state follow bimodal distribution (adversary sometimes releases, sometimes persists)

When adversarial adaptation is detected:

Structural inconsistency with the adversarial game model. The CTMC formulation above treats the generator matrix \(Q\) as stationary — the transition rates , are fixed properties of the environment, and the stationary distribution \(\pi\) is well-defined.

This assumption is incompatible with an adaptive adversary. In the adversarial Markov game ( Definition 80 in the anti-fragile decision-making article), the adversary’s strategy \(\sigma_A\) controls exactly these rates. Under an adaptive adversary, \(Q\) is a function of both the defender’s and adversary’s joint policy: . The stationary distribution \(\pi_N \approx 0.17\) derived from the CTMC is therefore invalid when an adversary is present — the system never reaches stationarity.

Correct interpretation: The CTMC model applies in non-adversarial partitioned environments (physical obstacles, atmospheric conditions, hardware faults). For adversarially contested environments, the CTMC provides an optimistic baseline that bounds performance under no adversary; actual performance under an adaptive adversary requires the game-theoretic analysis in the anti-fragile decision-making article.

The adversarial indicators above are the operational bridge: they signal when to switch from the CTMC regime assumption to the adversarial game regime.

Cognitive Map — Section 12. Connectivity spectrum \(\to\) fog computing places computation at the earliest viable layer \(\to\) four-layer architecture with \(100\times\) bandwidth reduction \(\to\) GRIDEDGE shows commercial fog: 500 ms budget forces fog-layer protection logic \(\to\) stationary distribution (19% Denied) drives design to the tail, not the mean \(\to\) Bayesian online learning updates Q as conditions change \(\to\) adversarial adaptation detection signals when to switch from CTMC to game-theoretic analysis.


Illustrative Connectivity Profiles

The connectivity analysis so far has used CONVOY as the worked example. These profiles extend the same framework to commercial environments — confirming that the inversion thesis is not a tactical anomaly, and extracting design rules that apply across environments.

Representative Parameterizations by Environment

Methodological note: These profiles are illustrative examples showing plausible parameter ranges. Actual deployments would derive parameters from operational data.

Industrial IoT: Connectivity ranges from near-cloud (clean rooms: ) to contested-edge (underground mining: ), driven by EMI, physical obstacles, and environmental extremes.

Drone Operations: Terrain dominates - flat terrain yields , mountainous terrain drops to . Combined adverse conditions approach tactical contested levels.

Connected Vehicles: Urban dense achieves , but tunnels create deterministic denied states ( ). Mountain passes and urban canyons degrade reliability despite cellular coverage.

Latency Distribution Analysis

Beyond connectivity state , latency distribution within each regime determines operational viability. A raw “connectivity” percentage hides the tail: p99 latency in Degraded is already \(10\times\) the median, and Intermittent delivers occasional latencies \(25\times\) higher than p95. Designing for median or p95 latency fails for any capability where tail events matter.

Representative Latency Distributions by Regime: The diagram shows how median latency grows from 12 ms in full connectivity to unbounded in the Denied regime; the critical pattern to observe is the extreme tail growth — p99 is already \(10\times\) the median in Degraded, making tail-based design mandatory.

    
    graph LR
    subgraph "Full Connectivity"
        F_LAT["Median: 12ms
p95: 45ms
p99: 120ms"] end subgraph "Degraded" D_LAT["Median: 180ms
p95: 850ms
p99: 2.4s"] end subgraph "Intermittent" I_LAT["Median: 3.2s
p95: 18s
p99: 45s
(when connected)"] end subgraph "Denied" N_LAT["Latency: unbounded
Queue until
reconnection"] end F_LAT --> D_LAT --> I_LAT --> N_LAT style F_LAT fill:#c8e6c9,stroke:#388e3c style D_LAT fill:#fff3e0,stroke:#f57c00 style I_LAT fill:#ffcdd2,stroke:#c62828 style N_LAT fill:#e0e0e0,stroke:#757575

Statistical characterization: Full connectivity typically follows log-normal distribution. Degraded follows gamma. Intermittent exhibits heavy-tailed Pareto — occasional latencies orders of magnitude higher than median. Designing for p95 latency fails on the tail.

Latency-Capability Mapping:

Different capabilities have different latency tolerance. We define the viability threshold \(\tau_c\) for capability \(c\):

Physical translation. Capability \(c\) is viable in a regime only if at least 95% of requests in that regime complete within threshold \(\tau_c\). A synchronized coordination system with \(\tau_c = 500\) ms is not viable in Intermittent — not because you’re disconnected, but because more than 5% of requests take longer than 500 ms even when connected. The capability table below shows exactly where each function loses viability.

The table below applies the viability condition to five capability types: a “Yes” entry means the p95 latency in that regime falls within the capability’s threshold \(\tau_c\), making it viable there; “No” means latency exceeds the threshold at least 5% of the time.

Capability\(\tau_c\)FullDegradedIntermittentDenied
Real-time video streaming100msYesNoNoNo
Synchronized coordination500msYesYesNoNo
State reconciliation5sYesYesYesNo
Opportunistic sync60sYesYesYesNo
Store-and-forward\(\infty\)YesYesYesYes

This matrix determines which capabilities can be offered in each regime. An architecture that assumes synchronized coordination (500ms threshold) will fail most of the time in Intermittent regime - not because connectivity is zero, but because latency exceeds the viability threshold (p95 latency of 2400ms in the illustrative distribution far exceeds 500ms).

Probabilistic Partition Models

The Markov model predicts long-run behavior but doesn’t answer operational questions: What is the probability of a partition lasting more than 1 hour? If we’re currently in Degraded state, how long until we likely enter Denied?

First Passage Time Analysis:

The first passage time from state \(i\) to state \(j\) has distribution determined by the generator matrix \(Q\). For the absorbing case (time to first reach Denied):

For CONVOY with :

Starting StateMean Time to DeniedStd Devp95
Full8.2 hours6.4 hours19.1 hours
Degraded5.1 hours4.8 hours13.2 hours
Intermittent2.8 hours3.1 hours8.4 hours

Partition Duration Distribution:

How long do partitions (Denied state) last? The formula below gives the survival function \(P(T_N > t)\) — the probability that a partition lasts longer than \(t\) hours — modeled as a mixture of two exponentials with rates \(\mu_1\) (short bursts, 70% of partitions) and \(\mu_2\) (extended outages, 30% of partitions).

The bimodal mixture captures two partition types: short partitions (70%) average 29 minutes from terrain shadowing or temporary interference, while long partitions (30%) average 6.7 hours from equipment failure or extended RF denial.

Physical translation. Short partitions are terrain-driven transients — handle with store-and-forward buffering. Long partitions are equipment failures or extended RF denial — they require full local decision authority. Designing only for the short partition type leaves the system without a plan for 30% of events. The mixture model is the mathematical statement that one architecture cannot serve both; you need layers, with each layer sized for the partition type it must survive.

The two partition types require two different architectures. Store-and-forward handles brief interruptions; full local decision authority is required for extended autonomy.

Conditional Partition Probability: Longer dwell in degraded states increases partition probability. Semi- Markov model s capture this non-Markovian behavior. When degraded dwell time exceeds a threshold, proactive measures — state sync, authority delegation — reduce reconciliation cost at reconnection.

Module Placement Strategies

Placement Optimization Formulation:

Let \(M\) be the set of modules and \(L\) be the set of placement locations (device, fog, edge, cloud). For module \(m\) placed at location \(l\), the binary decision variable if module \(m\) is assigned to location \(l\). The objective minimizes total expected latency across all modules.

The three constraints below enforce that each module is placed exactly once, and that CPU and memory capacity at each location is not exceeded.

Expected latency is the connectivity-weighted average of latency across all regimes, where the stationary probabilities \(\pi_\Xi\) serve as weights and cloud-dependent modules contribute infinite latency in the Denied regime.

where for cloud-dependent modules during Denied regime, with .

Cloud modules in Denied regime carry a prohibitive but finite latency penalty ( ) to allow the solver to find the least-bad feasible placement rather than declaring infeasibility.

Placement Heuristics by Module Type:

Module TypeOptimal PlacementRationaleExample
Safety-criticalDevice/FogMust function in DeniedCollision avoidance
Time-critical (<100ms)FogLatency budget excludes cloudFault detection
CoordinationEdgeNeeds multi-node visibilityFormation control
Learning/adaptationCloud (cached at edge)Compute-intensive, tolerates delayModel training
Archival/auditCloudNot time-sensitiveLog storage

Connectivity-Aware Placement Algorithm:

The placement proceeds in phases, respecting the constraint sequence:

Phase 1 (Survival): Place all modules that must function in Denied regime at device or fog layer. These are non-negotiable - if \(\pi_N > 0.05\), any cloud-dependent safety function is architectural malpractice.

Phase 2 (Time-critical): For modules with latency threshold \(\tau < 500\)ms, verify that the placement meets the constraint below, which requires the total probability mass of regimes where the module can respond within \(\tau\) to be at least 95%.

If cloud placement fails this test, move to edge. If edge fails, move to fog.

Phase 3 (Optimization): Remaining modules placed to minimize cost subject to latency SLO. Cloud preferred for compute cost; edge/fog preferred for latency.

GRIDEDGE Placement Example:

ModuleLatency RequirementPlacementRationale
Fault detection<100msFog (feeder controller)p95 latency to edge = 340ms
Protection coordination<500msFogMust function during storm
Load balancing<5sEdge (substation)Multi-feeder visibility needed
Demand forecasting<1 hourCloudCompute-intensive ML
Regulatory reporting<24 hoursCloudNot time-sensitive

Redundancy Planning Framework

Connectivity regimes determine redundancy requirements. The goal: maintain capability despite connectivity loss and component failure.

Redundancy Dimensions:

  1. Compute redundancy: Multiple nodes capable of running critical modules
  2. Data redundancy: State replicated across connectivity boundaries
  3. Path redundancy: Multiple communication paths to higher tiers
  4. Authority redundancy: Backup decision-makers when primary unreachable

Redundancy Factor Calculation:

The required redundancy factor \(R\) for capability \(c\) with availability target \(A_c\):

where \(a\) is single-component availability. For \(a = 0.95\) and \(A_c = 0.999\):

Connectivity-Adjusted Redundancy:

Component availability varies by connectivity regime . The formula below computes effective availability as the regime-weighted average of per-regime availability \(a_r\), using the stationary distribution \(\pi_r\) as weights.

For a cloud-dependent component with :

To achieve 99.9% availability with :

Six redundant cloud instances are needed — versus three if connectivity were reliable.

Physical translation. With reliable connectivity, three cloud replicas achieve 99.9% availability. Multiply the component availability by the connectivity availability (0.71 effective), and you need six replicas for the same target. The two extra replicas are the direct cost of unreliable connectivity — a cost invisible to architects who treat connectivity as a baseline assumption.

Hierarchical Redundancy Architecture: The diagram shows how redundancy factor increases from R=1 at the device layer to R=3 at the cloud layer, reflecting that cloud components depend on connectivity probability multiplied with component availability — the further from the device, the more replicas are needed to reach the same effective availability.

    
    graph TD
    subgraph "Device Layer (R=1)"
        D1["Sensor"]
        D2["Sensor"]
    end

    subgraph "Fog Layer (R=2)"
        F1["Fog Node A
(primary)"] F2["Fog Node B
(standby)"] end subgraph "Edge Layer (R=2)" E1["Edge Server 1
(active)"] E2["Edge Server 2
(active)"] end subgraph "Cloud Layer (R=3)" C1["Region A"] C2["Region B"] C3["Region C"] end D1 --> F1 D1 -.-> F2 D2 --> F1 D2 -.-> F2 F1 --> E1 F1 -.-> E2 F2 --> E1 F2 -.-> E2 E1 -.-> C1 E1 -.-> C2 E2 -.-> C2 E2 -.-> C3 style F2 fill:#fff3e0,stroke:#f57c00 style E2 fill:#fff3e0,stroke:#f57c00 style C2 fill:#e3f2fd,stroke:#1976d2 style C3 fill:#e3f2fd,stroke:#1976d2

Redundancy decreases toward device layer because device-layer components must function independently (R=1 is acceptable if the device itself is the unit of survival). Redundancy increases toward cloud layer because cloud availability is multiplied by connectivity probability.

State Replication Strategy:

For CRDT -based state, replication factor determines reconciliation complexity:

State TypeReplicationRationale
Safety-critical3+ (fog layer)Must survive any single failure
Coordination2 (edge layer)Cluster-level redundancy
Archival3 (cloud, geo-distributed)Durability over availability

Cross-Boundary Replication:

State that must survive connectivity loss should be replicated across connectivity boundaries:

For CONVOY , placing one replica at each vehicle (T3), one at platoon controller (T2), and one at convoy coordinator (T1) ensures state survives any single connectivity boundary failure.

Physical translation. Replicas that fail together provide no redundancy. The constraint requires that each pair of replica locations has negligibly correlated failure probability. Jamming one vehicle does not guarantee jamming another. Placing all replicas in the cloud violates this constraint: a connectivity denial takes all cloud replicas simultaneously, regardless of how many there are.

Redundancy Cost-Benefit Analysis: The table below shows how effective availability and cost scale together as redundancy increases, using the CONVOY baseline; the Break-even column gives the minimum per-hour downtime cost (in normalized cost units, c.u.; calibrate to actual downtime cost for your system) that justifies each additional replica.

Redundancy LevelCompute CostStorage CostAvailabilityBreak-even
R=11x1x71%Baseline
R=22x2x92%If downtime >50 c.u./hr
R=33x3x98%If downtime >200 c.u./hr
R=44x4x99.4%If downtime >800 c.u./hr

The economic break-even depends on downtime cost. Safety-critical systems (infinite downtime cost) justify maximum redundancy; informational systems may accept R=1.

Synthesis: From Connectivity Analysis to Architecture

The illustrative profiles, latency distributions, and redundancy calculations converge on architectural guidance:

For CONVOY -like environments (\(\pi_N > 0.2\), partition duration potentially hours):

For AUTOHAULER -like environments (\(\pi_N \approx 0.13\), partitions short but frequent):

For GRIDEDGE -like environments (\(\pi_N \approx 0.19\), fault-connectivity correlation):

The connectivity regime analysis transforms abstract architectural principles into quantified design decisions.

Cognitive Map — Section 13. Illustrative profiles calibrate the framework to real environments \(\to\) latency distributions show tail behavior is the actual constraint \(\to\) viability matrix converts latency to per-capability regime viability \(\to\) probabilistic partition models answer “how long until Denied?” \(\to\) placement optimization formalizes fog-vs-cloud decisions \(\to\) redundancy framework quantifies how unreliable connectivity multiplies replica count \(\to\) synthesis translates all of this into environment-specific design rules.

Component Interactions: From Theory to Implementation

The theoretical framework - Markov connectivity, capability hierarchy, tiered architecture - manifests in concrete component interfaces. Understanding these interactions clarifies how autonomic behavior emerges from well-defined contracts between system layers.

AUTOHAULER Component Interfaces

Each tier exposes three interface categories to adjacent tiers:

The table below enumerates every inter-tier message type in AUTOHAULER , including the cadence (Message Pattern) and the information carried; the peer coordination row is the only one that operates without any parent involvement.

TierInterface TypeMessage PatternContent
T3 to T2Status ReportEvery 2s + on-eventPosition, speed, load state, obstacle detections
T2 to T3Route CommandOn-changeWaypoint sequence, speed limits, priority
T3 peerPeer CoordinationAd-hocPrecedence negotiation, passing intention
T2 to T1Segment StatusEvery 30s + on-eventTruck positions, queue lengths, incidents
T1 to T2Zone PolicyEvery 5min + on-changeTraffic rules, speed limits, restricted areas
T1 to T0Production DataEvery 15minTonnes moved, cycle times, equipment status
T0 to T1Shift PlanEvery 8hRoute assignments, maintenance windows

The delegated authority model governs what happens when connectivity fails. The diagram below shows how each tier’s responsibilities expand when its parent tier becomes unreachable; read left-to-right to compare normal operation with the two partition cases.

    
    flowchart LR
    subgraph "Normal Operation"
        T3N["Truck: Execute route
Report status"] T2N["Segment: Coordinate traffic
Forward commands"] T1N["Pit: Optimize allocation
Handle exceptions"] end subgraph "T2 Disconnected from T1" T3D["Truck: No change
Continue route"] T2D["Segment: ELEVATED
+Route reassignment
+Exception handling"] end subgraph "T3 Disconnected from T2" T3I["Truck: AUTONOMOUS
+Obstacle response
+Peer-only coord
+Safe stop if needed"] end T3N --> T3D T2N --> T2D T3N --> T3I style T3N fill:#e8f5e9,stroke:#388e3c style T2N fill:#e8f5e9,stroke:#388e3c style T1N fill:#e8f5e9,stroke:#388e3c style T2D fill:#fff3e0,stroke:#f57c00 style T3D fill:#e8f5e9,stroke:#388e3c style T3I fill:#fce4ec,stroke:#c2185b

When a truck enters an ore pass tunnel (T3 disconnected from T2), it activates autonomous mode:

GRIDEDGE Message Flows

Protection coordination requires precise timing. The fog-layer feeder controller implements a state machine with strict timing bounds:

The table below defines the feeder controller state machine; the Max Duration column is the hard time budget for each state — exceeding it causes upstream protection to operate and enlarges the outage.

StateMax DurationInputsOutputs
MONITORINGIndefiniteSensor telemetry at 4 samples/cycleAggregated status to substation
FAULT_DETECTED50msFault signatures from line sensorsBlock signal to upstream, location estimate
ISOLATING150msConfirmation from adjacent controllersTrip commands to sectionalizing switches
VERIFYING200msSwitch position feedbackIsolation complete message to substation
RESTORATION5sSubstation authorization (if available)Reclose commands, load transfer requests

The critical path - from fault detection to isolation - must complete within 200ms with zero upstream communication. The fog controller makes the isolation decision locally, using pre-configured coordination tables that define which switches to open for each fault location.

Inter-feeder coordination uses a simple protocol: when Feeder A detects a fault, it broadcasts a block signal containing fault location and estimated magnitude. Adjacent feeders receiving this signal suppress their own upstream trip for a coordination window (100ms), allowing Feeder A to isolate the fault before upstream protection operates. If Feeder A fails to isolate within the window, adjacent feeders proceed with their own protection logic.

This protocol is connectivity-agnostic: if the inter-feeder link is available, coordination improves selectivity (smaller outage scope); if unavailable, each feeder protects independently with wider isolation (larger outage scope but still safe). The system degrades gracefully rather than failing catastrophically.

API Patterns Common to Both Systems

Both AUTOHAULER and GRIDEDGE implement three API patterns that emerge from the theoretical framework:

  1. Heartbeat with Capability: Regular status messages include not just health indicators but current capability level . A truck reporting “L1 capability” signals it can follow routes but cannot coordinate with peers. A feeder controller reporting “L2 capability” signals it can coordinate with adjacent feeders but cannot optimize with substation.

  2. Command with Deadline: All commands include an expiration timestamp. A route command expiring in 30 seconds gives the truck time to request clarification; a route command expiring in 3 seconds indicates urgency. Expired commands are discarded, not executed - preventing stale instructions from causing harm after connectivity restoration.

  3. State Merge on Reconnection: When connectivity restores, components exchange state digests (compact representations of decisions made during disconnection). Conflicts are resolved using domain-specific rules: for AUTOHAULER , completed actions are facts (a truck that already dumped cannot un-dump); for GRIDEDGE , switch positions are facts but restoration sequences can be adjusted.


Why Mobile Offline-First Doesn’t Transfer

Offline-first mobile apps and tactical edge systems share one surface feature — both operate without guaranteed connectivity. That surface similarity is a trap. Three structural differences make every mobile offline-first pattern either insufficient or dangerously misleading when applied to edge systems.

Scale of Autonomous Decision Authority

The mobile offline model defers commitment: cache locally, sync when connected, let the user resolve conflicts. This works because the user is in the loop. Tactical edge systems cannot defer — the drone cannot display a spinner, the convoy cannot pause, the sensor mesh cannot wait for headquarters to resolve a merge conflict.

Mobile offline-first caches user data locally for eventual synchronization. The app can show a spinner, display stale content, or prompt the user to retry later. No permanent decisions are made without eventual confirmation.

Tactical edge systems must make irrevocable decisions without central coordination. The RAVEN swarm cannot display a spinner while waiting to confirm target classification. The CONVOY cannot defer route selection until connectivity resumes. The OUTPOST cannot pause defensive response pending approval from headquarters.

Define decision reversibility \(R(d)\) as the probability that decision \(d\) can be undone given reconnection within time horizon \(T\):

For mobile applications, \(R(d) \approx 1\) for most decisions. Cached writes can be reconciled. Optimistic updates can be rolled back. Conflicts can be resolved by user intervention.

For tactical edge systems, \(R(d) \ll 1\) for critical decisions:

The table below ranks five tactical decision types by reversibility \(R(d)\); a value of 0.0 means the action cannot be undone under any reconnection scenario, while a value of 0.7 means a later central authority can usually correct the outcome.

Decision TypeR(d)Consequence of Error
Physical intervention0.0Physical actions cannot be recalled
Route commitment0.1Fuel consumed, position revealed, time lost
Resource expenditure0.2Power, fuel, consumables depleted
Formation change0.4Coordination state diverged, reconvergence costly
Priority adjustment0.7Opportunity cost, suboptimal allocation

Irreversibility adds regret cost to the decision function. The formula below expresses total decision cost as the sum of the immediate cost and a regret term scaled by \((1 - R(d))\) — decisions that cannot be undone carry their full worst-case loss, while reversible decisions carry none.

where is the worst-case loss from decision \(d\) if it cannot be undone and proves incorrect.

Physical translation. For a reversible decision (\(R = 1\)), regret cost disappears — if you’re wrong, you can undo it. For an irreversible decision (\(R = 0\)), you carry the full worst-case loss regardless of outcome. A RAVEN drone committing to a target track (\(R \approx 0.1\)) cannot be given the same decision budget as a mobile app saving a draft (\(R \approx 1\)). The formula forces this asymmetry to be explicit in system design: high-regret decisions require higher confidence thresholds before acting.

Adversarial Environment

Mobile offline-first assumes the network fails randomly. Contested edge must assume the network fails intentionally — an adversary that selectively jams to disrupt coordination while monitoring the response, partitions strategically to isolate high-value nodes, injects false data to poison state during reconnection, and times attacks to trigger partition at maximum-consequence moments.

Every protocol must consider “what if the network is being used against us.” CONVOY in mountain transit: vehicle 2’s position updates conflict with vehicle 3’s direct observation. Software bug? GPS multipath? Adversary spoofing?

Mobile apps trust platform identity infrastructure. Tactical edge must verify peer identity continuously, detect compromise anomalies, and isolate corrupted nodes without fragmenting the fleet.

Fleet Coordination Requirements

Mobile devices operate independently; state divergence between phones is tolerable. Edge fleets must maintain coordinated behavior across partitioned subgroups. When RAVEN fragments into three clusters, each must:

The core challenge. Coordination without communication is the defining problem of tactical edge architecture. Mobile offline-first never faces it: phones that diverge during partition simply show different cached content. Drones that diverge during partition may collide, break formation, or surveil the same area twice while leaving a gap elsewhere.

Cognitive Map — Section 14. Mobile offline-first and tactical edge share only the surface problem \(\to\) three structural breaks: irrevocable decisions, adversarial intent, fleet coordination requirements \(\to\) irreversibility adds regret cost to every decision function \(\to\) adversarial environment requires continuous peer identity verification, not just platform trust \(\to\) fleet coordination under partition is the problem mobile offline-first never solved.


The Edge Constraint Triangle

Three fundamental constraints compete in every edge communication decision; the diagram below shows the triangle structure where each edge label names the mechanism by which improving one vertex degrades an adjacent one.

    
    graph TD
    B["Bandwidth
(bits per second)"] ---|"FEC overhead
reduces throughput"| R["Reliability
(delivery probability)"] R ---|"retransmissions
add delay"| L["Latency
(end-to-end delay)"] L ---|"faster = less
error correction"| B style B fill:#e3f2fd,stroke:#1976d2 style L fill:#fff3e0,stroke:#f57c00 style R fill:#e8f5e9,stroke:#388e3c

The Edge Triangle Theorem (informal): You cannot simultaneously maximize bandwidth, minimize latency, and ensure reliability in a contested communication environment. Improving any one dimension requires sacrificing at least one other.

Problem: Every communication decision requires choosing between bandwidth (move more bits), reliability (lose fewer packets), and latency (deliver faster). No protocol can maximize all three simultaneously on a constrained physical channel.

Solution: Parameterize the trade-off explicitly using \(\alpha\) (power allocation fraction). Different message classes operate at different points on the Pareto frontier — alerts at high reliability, sensor streams at high bandwidth, coordination at low latency.

Trade-off: The \(\alpha\) parameter is not a dial you set once. Mission-critical messages may switch operating point within a single operation as conditions change. Proxy mesh infrastructure is what makes per-message-class switching feasible at runtime.

Mathematical Formalization

Define the achievable operating point as a vector in : where higher is better for all dimensions. The achievable region is bounded by fundamental constraints:

Shannon-limited bandwidth-reliability tradeoff:

For a channel with capacity bits/second and target bit error rate , the achievable information rate is bounded by:

where is the binary entropy. Lower error rates (higher reliability) require more redundancy, reducing effective throughput.

Physical translation. Adding redundancy to catch bit errors reduces the information you can carry. A channel that delivers 9,600 bps raw can only carry about 8,800 bps of useful data at 1% bit error rate, because 8% of capacity goes to error detection. Pushing for 0.1% error rate costs even more capacity. You cannot get reliable bits without spending bandwidth to achieve that reliability.

Latency-reliability tradeoff (ARQ protocols):

With per-packet success probability , the expected number of transmissions until success follows a geometric distribution:

To guarantee reliability with bounded retries, the required attempt count satisfies , yielding:

Higher reliability targets require exponentially more retransmission attempts as .

Physical translation. Each retransmission adds a full round-trip delay. At 70% per-packet success ( ), you expect 0.43 extra round-trips on average. At 50% success under heavy jamming, you expect a full extra round-trip per packet — doubling base latency before the packet gets through. High reliability under bad channel conditions is expensive in time, not just in bandwidth.

Power-constrained bandwidth: The Shannon capacity bound below gives the maximum achievable bit rate as a function of transmit power \(P\), path gain \(G\), noise density \(N_0\), and channel width \(W\) — increasing transmit power yields diminishing returns due to the logarithmic relationship.

where \(P\) is transmit power, \(G\) is path gain, \(N_0\) is noise spectral density, and \(W\) is channel bandwidth.

The Pareto Frontier

These constraints define a Pareto frontier - the set of achievable operating points where no dimension can be improved without degrading another. The frontier surface can be parameterized by the power allocation \(\alpha \in [0,1]\) between error correction (improving \(R\)) and raw transmission (improving \(B\)):

where \(g_c(\alpha)\) is the error correction coding gain and is the latency overhead of forward error correction.

Concrete example: For OUTPOST with bps, , ms, ms:

The optimal operating point depends on mission requirements. For OUTPOST alert distribution, reliability dominates (\(\alpha \rightarrow 1\)). For RAVEN sensor streaming, bandwidth dominates (\(\alpha \rightarrow 0\)). For CONVOY coordination, latency dominates (minimize \(L\) subject to ).

Physical translation. At \(\alpha = 0\) (no FEC), OUTPOST transmits at maximum rate but accepts 1% bit errors — fine for sensor telemetry where occasional bad readings are tolerable. At \(\alpha = 1\) (maximum reliability), OUTPOST achieves near-perfect delivery but carries zero payload — useful only for presence beacons. The operational setting sits between: alerts near \(\alpha = 0.5\) (near-perfect delivery, acceptable latency), sensor streams near \(\alpha = 0.1\) (high throughput, tolerable error rate).

Architectural Response: Distributed Proxy Mesh

The edge constraint triangle suggests that different message types require different operating points. A distributed proxy mesh pattern addresses this by placing intelligent intermediaries throughout the network that can dynamically select operating points per-message-class.

    
    graph TB
    subgraph "Application Tier"
        A1["App Instance"]
        A2["App Instance"]
        A3["App Instance"]
    end

    subgraph "Proxy Mesh"
        P1["Proxy
Local queue
Protocol bridge"] P2["Proxy
Local queue
Protocol bridge"] P3["Proxy
Local queue
Protocol bridge"] P4["Proxy
Local queue
Protocol bridge"] end subgraph "Backend Services" S1["Service A"] S2["Service B"] end A1 --> P1 A2 --> P2 A3 --> P3 P1 <--> P2 P2 <--> P3 P3 <--> P4 P1 <--> P4 P4 --> S1 P4 --> S2 P2 -.->|"Failover
path"| S1 style A1 fill:#fce4ec,stroke:#c2185b style A2 fill:#fce4ec,stroke:#c2185b style A3 fill:#fce4ec,stroke:#c2185b style P1 fill:#e8f5e9,stroke:#388e3c style P2 fill:#e8f5e9,stroke:#388e3c style P3 fill:#e8f5e9,stroke:#388e3c style P4 fill:#e8f5e9,stroke:#388e3c style S1 fill:#e3f2fd,stroke:#1976d2 style S2 fill:#e3f2fd,stroke:#1976d2

Read the diagram. Solid arrows carry primary traffic; the dashed arrow is a failover path that activates only when P4’s primary connection to Service A is unavailable. This is not load balancing — it is the proxy mesh absorbing a topology change without application involvement.

Each proxy navigates the constraint triangle via four responsibilities:

  1. Queue management: Persistent outbound queue, accumulating when downstream unreachable
  2. Protocol translation: Bridge between verbose protocols (gRPC, HTTP/2) on local links and compact protocols (CBOR over CoAP) on tactical links
  3. Route discovery: Maintain topology, compute paths, shift to alternates when primary routes fail
  4. Load distribution: Shed load by priority during congestion - critical messages proceed, bulk defers

Message routing phases:

  1. Resolve: Look up destination in routing table; flood discovery to neighbors (TTL-limited) if not found
  2. Select path: Evaluate paths by cost ; choose minimum
  3. Transmit: Send on selected path, start ack timer
  4. Handle outcome: On ack: complete; on timeout: retry or mark path degraded, return to step 2

OUTPOST Power Optimization Problem

The OUTPOST remote monitoring station operates with severe power constraints. Solar panels and batteries provide 50W average for communications. The mesh network must support three mission-critical functions:

  1. Sensor fusion: Aggregating data from 100+ perimeter sensors
  2. Command relay: Maintaining contact with CONVOY and RAVEN when possible
  3. Alert distribution: Ensuring threat warnings reach all defended positions

Three communication channels are available:

ChannelPowerBandwidthReliabilityVulnerability
HF Radio15W4.8 kbps0.92Low (beyond line-of-sight jamming)
SATCOM25W256 kbps0.75High (contested orbital environment)
Mesh WiFi8W54 Mbps0.98Medium (local jamming effective)

Define decision variables \(x_i \in [0,1]\) as allocation fraction for channel \(i\), and let indicate whether channel \(i\) is designated for critical alerts. The optimization problem:

where \(w_i\) are importance weights and \(L_i\) is latency for channel \(i\). The alert reliability constraint requires sufficient channel diversity; the latency constraint bounds worst-case alert delivery time.

Solution structure: At optimum, OUTPOST allocates Mesh WiFi for bulk sensor fusion (high bandwidth, local reliability), HF Radio for alert distribution (unjammable, acceptable latency), and SATCOM opportunistically for external coordination when available and not contested.

Physical translation. This is the Pareto frontier applied operationally: each channel occupies a different corner of the triangle. Mesh WiFi wins on bandwidth (54 Mbps, low power per bit). HF Radio wins on survivability (unjammable at range). SATCOM wins on reach but loses on power cost and adversarial vulnerability. The optimization assigns each function to the channel whose position on the triangle best matches that function’s requirements — not to the single “best” channel.

Model limits: Reliability estimates \(R_i\) assume steady-state. An adversary observing OUTPOST ’s allocation can adapt — jamming relied-upon channels, backing off abandoned ones. The system must periodically test channel assumptions, not merely optimize on stale estimates.

Cognitive Map — Section 15. Three constraints form a triangle: bandwidth \(\leftrightarrow\) reliability \(\leftrightarrow\) latency \(\to\) improving one degrades at least one other \(\to\) Shannon bound formalizes bandwidth-reliability \(\to\) ARQ formalizes latency-reliability \(\to\) Pareto frontier parameterized by \(\alpha\) lets each message class pick its optimal point \(\to\) proxy mesh makes per-class switching feasible at runtime \(\to\) OUTPOST applies the full framework: three channels, three mission functions, optimization assigns each function to its best corner.


Latency as Survival Constraint

In cloud systems, a slow response is a UX problem. In tactical edge systems, a slow response can be a mission-ending event. The adversary’s decision loop does not pause while you wait for network acknowledgment.

Adversarial Decision Loop Model

Define the adversary’s Observe-Decide-Act (ODA) loop time as \(T_A\), and our own ODA loop time as \(T_O\). The decision advantage \(\Delta\) is:

For RAVEN conducting surveillance of a mobile threat:

Physical translation. Sensor acquisition and local classification are fixed by hardware physics — they cannot be optimized in software. Coordinated response time is fixed by formation geometry. Only swarm notification is architecture-controllable. The entire communication architecture of RAVEN exists to minimize a single variable: .

ComponentTimeNotes
Sensor acquisition50msRadar/optical capture, fixed by physics
Local classification100msOn-node ML inference, hardware-limited
Swarm notificationVariableDepends on connectivity regime
Coordinated response200msFormation adjustment, task allocation

Total ODA:

Intelligence estimates adversary anti-drone system response at . For RAVEN to maintain decision advantage:

This 450ms coordination budget is the binding constraint on RAVEN ’s communication architecture.

Latency Distribution Analysis

Mean latency tells only part of the story. For survival-critical systems, the tail distribution determines whether occasional slow responses become fatal delays.

Assume coordination latency follows an exponential distribution with rate \(\mu\) under normal conditions, but exhibits heavy tails under jamming. The composite distribution:

where \(p\) is the probability of encountering jamming conditions and .

For RAVEN with \(\mu = 10/\text{s}\) (mean 100ms), (mean 1000ms), and \(p = 0.3\):

The heavy tail means roughly 20% of coordination attempts will miss the 450ms deadline, potentially causing RAVEN to lose decision advantage during those windows.

Physical translation. 30% jamming probability sounds manageable. But the heavy tail means 20% of coordination attempts miss the 450ms deadline entirely — not by a small margin, but by \(4\times\) (1800ms vs 450ms at p95). In a 50-drone swarm running at 10 coordination cycles per minute, that is one missed deadline every 30 seconds. Architecture that ignores the tail is not degraded-resilient; it just has not been tested under the right conditions yet.

Design implications: either reduce \(p\) through better anti-jamming, or accept frequent degraded-mode operation.

Queueing Theory Application

Model swarm notification as a message distribution problem. When a node detects a threat, it must propagate this detection to \(n-1\) peer nodes. In contested environments, not all nodes are reachable directly.

Under full connectivity, epidemic ( gossip ) protocols achieve logarithmic propagation time , where \(k\) is fanout. This follows from the logistic dynamics of information spread: each informed node informs \(k\) peers per round, leading to exponential growth until saturation. For tactical parameters (\(n = 50\), \(k = 6\), ), this yields — within coordination budgets, versus for linear broadcast.

Physical translation. Gossip achieves 44ms because each round doubles the informed set — a detection that reaches 6 nodes in round 1 reaches 36 in round 2, 216 in round 3. The logarithm in is the mathematical signature of this exponential growth. Linear broadcast (one-by-one) takes 1000ms for the same 50 nodes. The \(23\times\) speedup from gossip is what keeps swarm notification inside the 450ms coordination budget.

Under partition, the swarm fragments. If jamming divides RAVEN into three clusters of sizes \(n_1 = 20\), \(n_2 = 18\), \(n_3 = 9\), intra-cluster gossip completes quickly, but inter-cluster propagation requires relay through connectivity bridges - if any exist.

Define as the probability that at least one node maintains connectivity across cluster boundaries. If , clusters operate independently with no shared awareness. The coordination time becomes undefined (or infinite).

The optimization problem: Choose swarm geometry (inter-node distances, altitude distribution, relay positioning) to maximize while maintaining surveillance coverage.

This is a multi-objective optimization with competing constraints: spread for coverage implies larger inter-node distances; clustering for relay reliability implies smaller inter-node distances; altitude variation for bridge probability increases power consumption. The Pareto frontier of this tradeoff is not analytically tractable. Numerical optimization with mission-specific parameters yields operational guidance. But once again, the model assumes a static adversary. An adaptive jammer that observes swarm geometry can target bridge nodes specifically. The anti-fragile response: vary geometry stochastically, making bridge node identity unpredictable.

Cognitive Map — Section 16. Latency is a survival constraint, not a UX metric \(\to\) adversary ODA loop sets the hard deadline (\(T_A \approx 800\)ms) \(\to\) \(T_O\) decomposition reveals that only is architecture-controllable \(\to\) heavy-tail jamming distribution pushes p95 to \(4\times\) the 450ms budget \(\to\) gossip achieves \(23\times\) speedup over linear broadcast \(\to\) partition fragments the swarm, making bridge node placement the critical architecture decision.


Central Coordination Failure Modes

Cloud architectures assume central coordinators exist and are reachable. Load balancers, service meshes, and orchestrators all depend on some node having global visibility and authority. Tactical edge architectures cannot make this assumption [1, 2] — and the math confirms it: when coordinator reachability drops below 67%, distributed coordination is cheaper than centralized, independent of fleet size.

We identify three coordination failure modes:

Failure ModeCauseDetection ChallengeRequired Response
Coordinator UnreachablePartition between coordinator and nodesDistinguish coordinator failure from network failureElect local coordinator or operate autonomously
Coordinator CompromisedAdversary has taken controlCoordinator issues plausible but malicious instructionsByzantine fault tolerance, instruction verification
Coordinator OverloadedToo many nodes requesting coordinationIncreased latency indistinguishable from degraded connectivityLoad shedding, priority queuing, hierarchical delegation

Distributed Coordination Cost Analysis

Compare the cost of centralized versus distributed coordination for achieving consistent state across \(n\) nodes.

Centralized coordination cost:

But in contested environments, we must account for reachability probability \(p_r\). If the coordinator is unreachable, nodes retry. Expected message cost:

Distributed coordination cost (consensus protocols):

The crossover condition determines when distributed coordination becomes more efficient:

The crossover is independent of fleet size \(n\) — it depends only on reachability and fault tolerance. For Byzantine fault tolerance requiring \(f = 3\) replicas (to tolerate 1 Byzantine failure per the \(3f+1\) bound), the threshold is . Derivation: P99 agreement requires \(n \geq 3f + 1\), so with \(f = 1\) tolerated failure, we need \(n \geq 4\) replicas and \(f = 3\) in our cost formula. Thus distributed coordination dominates when coordinator reachability falls below \(2/3\).

Physical translation. The crossover point is 67% reachability — independent of how many nodes are in the fleet. A 12-vehicle CONVOY and a 127-sensor OUTPOST both switch to distributed coordination at the same threshold. In contested environments where \(p_r\) ranges 0.3–0.5, you are already far below crossover. Design for distributed coordination as primary mode, with centralized coordination as an optimization when high reachability is sustained.

Hysteresis-Based Coordination Mode Selection

Naive mode switching at the crossover point causes oscillation: reachability briefly exceeds threshold, system switches to centralized, latency increases during transition, reachability appears to drop, system switches back. This thrashing wastes resources and creates inconsistent behavior.

We introduce hysteresis with distinct thresholds for mode transitions:

where \(\epsilon_h\) is the hysteresis margin (typically 0.1–0.15). The system remains in its current mode when .

Physical translation. Without hysteresis, a system at the crossover point (\(p_r \approx 0.67\)) oscillates: reachability briefly exceeds the threshold, triggering a costly transition to centralized mode, which increases coordination latency, which makes reachability appear to drop, which triggers a switch back. With \(\epsilon_h = 0.1\), the system only switches to centralized at \(p_r > 0.77\) and only switches back to distributed at \(p_r < 0.57\) — creating a 20-point wide stable band that absorbs transient fluctuations.

Coordination mode selection:

  1. Compute smoothed reachability:
  2. Detect adversarial gaming: If variance >0.04, fall back to distributed (high variance suggests connectivity manipulation)
  3. Apply hysteresis with stability requirement:
Current ModeConditionAction
CENTRALIZED Switch to DISTRIBUTED
DISTRIBUTED AND stable for 30sSwitch to CENTRALIZED
EitherOtherwiseMaintain current mode

The stability check prevents switching on transient connectivity spikes - centralized mode is only entered after sustained high reachability.

Mode transition costs must also be considered. The formula below decomposes total transition cost into three components: the cost of synchronizing state between nodes, the cost of electing a new leader, and the cost of recovering a consistent view after the mode switch.

For CONVOY , seconds of reduced capability. The algorithm only switches when expected benefit exceeds this cost over a planning horizon (typically 5 minutes).

Cognitive Map — Section 17. Three failure modes for central coordinators (unreachable, compromised, overloaded) \(\to\) cost analysis shows centralized has \(2n/p_r\) expected message cost \(\to\) crossover at \(p_r < 2/f\) is fleet-size independent \(\to\) for Byzantine tolerance with \(f = 3\), threshold is 67% — well above typical contested \(p_r\) \(\to\) design primary mode as distributed \(\to\) hysteresis prevents oscillation at the crossover point \(\to\) transition cost (8s for CONVOY ) further penalizes frequent mode switching.


Degraded Operation as Primary Design Mode

The inversion thesis implies that architects should optimize explicitly for the partition case rather than treating it as an edge condition. When more than half of operating time is spent in Intermittent or Denied regimes, “degraded” is the primary operating mode and “connected” is the bonus state. The formal design objective follows: find the architecture policy \(\pi\) that maximizes expected capability level conditioned on the system being in the Denied regime, subject to maintaining at least basic-mission capability .

Physical translation. This is not “design for failure.” It is “design for the primary mode.” When 43% of CONVOY ’s operating time is Intermittent or Denied, the Denied regime is not an edge case — it is a first-class operating mode that must be optimized on its own terms. The objective says: maximize what you can do when disconnected, not just how gracefully you degrade.

When , “degraded” is the primary operating mode.

Capability Hierarchy Framework

Define capability level s from basic survival to full integration:

LevelNameDescriptionThreshold \(\theta_i\)Marginal Value \(\Delta V_i\)
L0SurvivalAvoid collision, maintain safe state0.01.0 (baseline)
L1Basic MissionContinue patrol, maintain formation0.02.5
L2Local CoordinationSynchronized maneuver within cluster0.34.0
L3Fleet CoordinationCross-cluster task allocation0.86.0
L4Full IntegrationReal-time coordination, full sensor streaming0.98.0

Unit definition: \(\Delta V_i\) values are dimensionless mission utility scores, normalized so that maximum full-integration performance = 21.5 points. For RAVEN , each level’s \(\Delta V_i\) was calibrated as: , measured over 200 simulation runs. The span from \(\Delta V_4 = 8.0\) to \(\Delta V_0 = 0\) (coverage contribution) reflects the RAVEN mission structure: L0 alone provides no operational coverage, while L4 enables real-time cross-cluster coordination that saturates the coverage function (the table assigns \(\Delta V_0 = 1.0\) as a survival-credit baseline, not a coverage score). AUTOHAULER weights L3 at 9.0 (vs. RAVEN ’s 6.0) because hauling throughput is dominated by fleet-level task allocation. These values are scenario-specific inputs, not universal constants.

Read this table as a design contract. Each level’s threshold \(\theta_i\) is not a measurement — it is the minimum connectivity fraction at which that capability must become reliably available. Each \(\Delta V_i\) quantifies the marginal mission value gained by achieving that level. The entire column is an architecture budget: you decide which capabilities to invest in for which connectivity thresholds.

Each level requires minimum connectivity \(\theta_i\) and contributes marginal value \(\Delta V_i\). Total capability is the sum of achieved levels: a system at L3 achieves out of maximum 21.5.

Capability level evaluation (continuous, per-node):

  1. Measure: Estimate \(C(t)\) via EWMA:

    (The gain \(\alpha = 0.3\) is not a universal constant. The optimal , where is the fastest connectivity transition rate worth tracking and \(\Delta t\) is the measurement interval. At \(\Delta t = 1\)s and RAVEN ’s observed transition rate of ~0.35 transitions/min \(\approx 0.006\)/s: — very slow adaptation. At 0.35 transitions/s: \(\alpha \approx 0.30\). The value 0.3 is calibrated to a system that transitions regimes roughly once every 3 seconds; slower-changing environments should use smaller \(\alpha\).)

  2. Determine level: Find highest \(L_i\) where

  3. Check peer consensus: For L2+, verify peers report same capability; downgrade if mismatch

  4. Apply hysteresis: Maintain level unless threshold crossed for s

  5. Execute: Activate selected level’s behaviors, deactivate higher levels

Hardened Hierarchy: Dependency Isolation

The capability table above lists five levels. A critical implementation constraint governs the relationship between them: no level may depend on any level above it at runtime. Without this constraint, an L4 model-update service that L0 boot logic imports creates a circular failure — when the autonomic stack degrades, it may take L0 survival with it.

Definition 18 (Dependency Isolation Requirement). A capability stack satisfies the dependency isolation requirement if the runtime dependency set of each level is confined to equal or lower levels:

The L0 constraint is the binding one:

Operationally, L0 code must:

Proposition 8 (Hardened Hierarchy Fail-Down). If a system satisfies Definition 18, then failure of level \(L_i\) cannot cause failure of any level \(L_j\) with \(j < i\):

A crashed L3 analytics service cannot take down CONVOY ’s L0 survival heartbeat — as long as every lower level has zero upward symbol dependencies, verified statically at link time.

Proof: By Definition 18 , every component at level \(j < i\) satisfies . Since \(j < i\), we have . Failure of \(L_i\) therefore creates no failed dependency in \(L_j\). By induction over all \(j < i\), the entire stack below \(L_i\) remains operational. \(\square\)

Corollary: The CONVOY failure case in the constraint sequence article — L3 fleet analytics built before L0 survival was validated — violates Definition 18 . The analytics service imported the health-monitoring framework (L1+), which in turn depended on dynamic memory allocation not present in the bare-metal boot environment. When the stack degraded, L0 could not re-initialize because its required allocator was in a crashed L1 process. Definition 18 , verified statically before deployment, would have caught this at link time.

Physical translation. L0 must boot and run on a bare microcontroller with no network, no allocator, and no runtime from upper layers. If a crashed L4 analytics service can prevent L0 from restarting, your survival layer is not actually a survival layer. The dependency isolation requirement converts a design principle into a statically checkable property: zero upward symbol references at link time.

Multi-failure note: When power degradation (L0), connectivity partition ( ), and clock drift coincide simultaneously, Proposition 8 applies in sequence: the hardware veto ( Definition 108 in The Constraint Sequence and the Handover Boundary) fires first, freezing all actuator commands; the MAPE-K loop then operates in read-only diagnostic mode until power recovers above L1 threshold.

Watch out for: the proposition requires that dependency isolation is verified statically at link time; if any L0 component uses dynamic dispatch — function pointers resolved at runtime, vtables, or plugin interfaces — the static symbol-dependency graph cannot confirm the absence of upward references, and the fail-down guarantee is unverifiable regardless of whether the code actually contains an upward dependency.

Expected Capability Under Contested Connectivity

The expected capability under the stationary connectivity distribution takes the form:

Expected capability is the convolution of connectivity distribution \(\pi\) (environment-determined) with capability thresholds \(\theta_i\) (design-determined). The architect controls \(\theta_i\) but not \(\pi\).

For CONVOY ’s stationary distribution \(\pi = (0.32, 0.25, 0.22, 0.21)\) (illustrative value), we compute expected capability by mapping states to connectivity thresholds. Full connectivity (F) exceeds all thresholds; Degraded (D) exceeds \(\theta_2 = 0.3\) (illustrative value) but not \(\theta_3 = 0.8\) (illustrative value); Intermittent (I) and Denied (N) exceed only \(\theta_0 = 0\):

What 48% means operationally. CONVOY achieves L4 (full integration) only 32% (illustrative value) of the time — when in the Connected regime. L3 is available 57% (illustrative value) of the time (Connected + Degraded). L0 and L1 are always available. The capability gap is not a defect; it is the mathematical consequence of a contested environment. The architect’s job is to maximize mission effectiveness within that gap — not to eliminate it.

Commercial System Capability Hierarchies (domain-specific instantiation of framework):

SystemLevelCapability\(\theta_i\)\(\Delta V_i\)
AUTOHAULERL0-L1Collision avoidance, route following0.02.8
L2Segment coordination0.33.2
L3Pit optimization0.54.5
L4Fleet optimization0.85.5
GRIDEDGEL0-L1Local protection, feeder isolation0.03.2
L2Adjacent coordination0.253.8
L3Substation optimization0.54.2
L4System coordination0.855.0

Expected capability (applying the framework):

Critical insight: L0-L1 capabilities require \(\theta = 0\) - safety functions operate at zero connectivity because fog-layer controllers have complete local authority.

Capability variance: \(\sigma \approx 6.2\) for CONVOY (\(\pm 30\%\) swings) drives the graceful degradation requirement.

Threshold Optimization Problem

The \(\theta_i\) thresholds are design variables, not fixed constants. The optimization problem balances capability against implementation cost:

where \(c_i(\theta_i)\) captures the cost of achieving capability level \(i\) at connectivity threshold \(\theta_i\). Lower thresholds require more aggressive error correction protocols, weaker consistency guarantees, and more complex failure handling logic.

Physical translation. Lowering \(\theta_i\) means the system attempts to provide that capability at lower connectivity — which requires more aggressive protocols, weaker consistency, and more complex failure handling. Each point of threshold reduction has an implementation cost. The optimization finds the thresholds where marginal capability gain (from the connectivity CDF) outweighs implementation cost. Place thresholds in the CDF’s flat regions — where small threshold changes produce small probability changes — to get the most mission value per unit of implementation effort.

The cost function \(c_i\) is typically convex and increasing as , reflecting the exponentially increasing difficulty of maintaining coordination at lower connectivity levels.

Optimal threshold placement depends on the connectivity CDF derivative. Place thresholds where \(dF_C/d\theta\) is small — in the distribution tails where small threshold changes cause small probability changes.

Anti-fragility through threshold learning: A system that learns to lower its thresholds under degraded connectivity becomes more capable under stress. Adapting \(\theta_i\) based on operational experience yields measurable capability gains — a manifestation of positive anti-fragility expressed as a performance improvement ratio after stress exposure (formally defined as Anti-Fragility in Definition 79 , Anti-Fragile Decision-Making at the Edge).

Cognitive Map — Section 18. Partition is the primary mode, not an edge case \(\to\) capability hierarchy L0–L4 defines what’s achievable at each connectivity threshold \(\to\) dependency isolation requirement ( Definition 18 ) guarantees lower levels survive upper-level failures — statically verifiable \(\to\) expected capability formula convolves environment (connectivity distribution) with design choices (thresholds) \(\to\) CONVOY achieves 48% of theoretical max: the gap is the environment’s cost, not a design failure \(\to\) threshold optimization balances capability gain against implementation cost \(\to\) anti-fragile systems lower thresholds through operational learning, improving under stress.


The Edge Constraint Sequence

Which architectural problems do you solve first? The answer is not a matter of taste — dependency structure forces an order. A team that builds fleet-wide coordination before individual node survival has inverted the sequence. When the stack degrades, nothing survives.

Proposed Sequence for Edge Architecture

Based on the dependency structure of edge capabilities; each node is a prerequisite for the next, and the dashed annotations give the diagnostic question that verifies each level is satisfied before proceeding.

    
    graph TD
    A["1\. Survival Under Partition"] --> B["2\. Local Cluster Coherence"]
    B --> C["3\. Fleet-Wide Consistency"]
    C --> D["4\. Optimized Connected Operation"]

    A -.- A1["Can each node operate independently?"]
    B -.- B1["Can nearby nodes coordinate?"]
    C -.- C1["Can partitioned groups reconcile?"]
    D -.- D1["Can we exploit full connectivity?"]

    style A fill:#e8f5e9,stroke:#388e3c,stroke-width:3px
    style B fill:#fff3e0,stroke:#f57c00
    style C fill:#e3f2fd,stroke:#1976d2
    style D fill:#fce4ec,stroke:#c2185b
    style A1 fill:#fff,stroke:#ccc,stroke-dasharray: 5 5
    style B1 fill:#fff,stroke:#ccc,stroke-dasharray: 5 5
    style C1 fill:#fff,stroke:#ccc,stroke-dasharray: 5 5
    style D1 fill:#fff,stroke:#ccc,stroke-dasharray: 5 5

Priority 1: Survival Under Partition Every node must be capable of safe, autonomous operation when completely disconnected. This is the foundation on which all other capabilities build. If a RAVEN drone cannot avoid collision, maintain safe altitude, and preserve itself when alone, no amount of coordination capability matters.

Priority 2: Local Cluster Coherence When nodes can communicate with neighbors but not the broader fleet, they should be able to coordinate local actions. CONVOY vehicles in line-of-sight should synchronize movement even if the convoy commander is unreachable.

Priority 3: Fleet-Wide Eventual Consistency When partitions heal, the system must reconcile divergent state. Actions taken by isolated clusters must be merged into a coherent fleet state. This is technically challenging but not survival-critical - the fleet operated safely while partitioned.

Priority 4: Optimized Connected Operation Only after the foundation is solid should we optimize for the connected case. Centralized algorithms, global optimization, real-time streaming - these enhance capability but depend on connectivity that may not exist.

Mathematical Justification

Define the dependency graph \(G = (V, E)\) where and directed edge \((A, B) \in E\) means A is prerequisite for B.

The constraint sequence is a topological sort of \(G\), weighted by priority:

For survival under partition:

Survival is infinitely prioritized — solve it first regardless of frequency.

Physical translation. Infinite priority is not hyperbole — it is the mathematical statement that no finite capability benefit justifies deferring survival. A 10% improvement in fleet coordination efficiency does not compensate for a 0.001% probability of platform loss. Solve survival first, unconditionally.

For optimized connected operation, the binding probability is low and the cost is finite, yielding a modest priority score that confirms this problem should be addressed last:

Finite and modest. Solve after higher priorities are addressed.

Constraint Sequence Validation

Constraint sequence validation checks:

  1. Survival independence: Disable all network interfaces on a single node; verify safe operation for \(10\times\) typical partition duration
  2. Cluster degradation: Partition fleet into isolated clusters; each cluster must maintain coordinated operation at L2
  3. Reconvergence correctness: Restore connectivity; divergent state must merge with no lost updates or conflicts
  4. Connected enhancement: With full connectivity, centralized optimizations activate and exceed cluster-only performance

The Limits of Abstraction

Every model in this framework is an approximation. The Markov connectivity model, the threshold optimization, the queueing analysis — all are useful precisely because they make assumptions. Recognizing when those assumptions break is not an afterthought; it is part of the architect’s core discipline.

Model Validation Methodology

Before trusting model predictions, we must continuously validate that model assumptions hold. The Model Health Score \(H_M \in [0,1]\) aggregates validation checks:

Physical translation: A model health score below 0.5 means the anomaly detector’s outputs should be treated as advisory only. A RAVEN drone acting on a 0.3-score detector is flying on a compass with a known deviation it has not corrected for — useful for rough orientation, not for precision maneuvering.

Physical translation. Each component tests a different model assumption. tests whether history still doesn’t matter. tests whether the environment is still the same. tests whether failures are still uncorrelated. tests whether rare states are still being observed. When any component drops below threshold, the model’s predictions in that dimension are unreliable — fall back to conservative modes rather than trusting stale estimates.

Markovianity test ( ): The future should depend only on present state. Compute lag-1 autocorrelation of transition indicators:

If , history matters - consider Hidden Markov or semi- Markov model s.

Stationarity test ( ): Transition rates should be stable over time. Apply Kolmogorov-Smirnov test between early and late observation windows:

If , rates are drifting - trigger model retraining or adversarial investigation.

Independence test ( ): Different nodes’ transitions should be independent (or model correlation explicitly). Compute pairwise correlation of transition times:

If , transitions are correlated - likely coordinated jamming affecting multiple nodes.

Coverage test ( ): Observations should span the state space. Track time since last visit to each state:

If , rare states are under-observed - confidence intervals on those transition rates are unreliable.

Operational guidance when \(H_M < 0.5\) (model unreliable):

When Models Fail

Adversarial adaptation: Our Markov connectivity model assumes transition rates are stationary. An adaptive adversary changes rates in response to our behavior. The model becomes a game, not a stochastic process.

Novel environments: The optimization for OUTPOST power allocation assumed known channel characteristics. Deploy OUTPOST in a new RF environment with different propagation, and the optimized allocation may be catastrophically wrong.

Emergent interactions: The queueing model for RAVEN coordination analyzed message propagation in isolation. Real systems have interactions: high message load increases power consumption, which triggers power-saving modes, which reduce message transmission rates, which increases coordination latency beyond model predictions.

Black swan events: Capability hierarchies assign finite costs to failures. Some failures - complete fleet loss, mission compromise, cascading system destruction - have costs that no model adequately captures.

Concrete failure examples from deployed systems:

  1. CONVOY model failure: Transition rates estimated during summer operations proved wrong in winter. Ice-induced link failures occurred \(4\times\) more frequently than modeled, and the healing time constants doubled. The fleet operated in L1 (basic survival) for 6 hours instead of the designed 45 minutes before parameters could be retuned.

  2. RAVEN coordination collapse: A firmware bug caused gossip messages to include stale timestamps. The staleness-confidence model interpreted all peer data as unreliable, causing each drone to operate in isolation. Fleet coherence dropped to zero despite 80% actual connectivity.

  3. OUTPOST cascade: Solar panel degradation followed an exponential (not linear) curve after year 2. The power-aware scheduling model underestimated nighttime power deficit by 40%, causing sensor brownouts that corrupted the anomaly detection baseline, which then flagged normal readings as anomalies, which triggered unnecessary alerts, which depleted batteries further.

These failures were not edge cases - they were model boundary violations that operational testing should have caught.

The Engineering Judgment Protocol

When models reach their limits, the edge architect falls back to first principles:

  1. What is the worst case? Not the expected case, not the likely case - the worst case. What happens if every assumption fails simultaneously?

  2. Is the worst case survivable? If not, redesign until it is. No optimization justifies catastrophic risk.

  3. What would falsify my model? Identify the observations that would indicate model assumptions have been violated. Build monitoring for those observations.

  4. What is the recovery path? When the model fails - not if - how does the system recover? Fallback behaviors, degradation paths, human intervention triggers.

  5. What did we learn? Every model failure is data for the next model. The anti-fragile system improves its models from operational stress.

Cognitive Map — Section 19. Dependency structure forces a build order: survival first, cluster coherence second, fleet consistency third, connected optimization last \(\to\) priority formula confirms this mathematically (survival has infinite priority) \(\to\) model health score monitors four model assumptions in parallel \(\to\) adversarial adaptation, novel environments, emergent interactions, and black swans are the four failure modes that break each assumption \(\to\) engineering judgment protocol provides the fallback when models reach their limits.


Comparative Analysis: Edge vs. State-of-the-Art Frameworks

How do the principles developed here compare with established frameworks in edge and distributed systems?

FrameworkPartition AssumptionDecision ModelCoordinationAdversarial Handling
Cloud-Native (K8s)Transient, recoverableCentral orchestratorService meshNone (trusted network)
DTN (RFC 4838) [3] Expected, store-forwardPer-hop decisionsOpportunistic contactsIntegrity checks
MANET ProtocolsDynamic topologyDistributed routingLocal broadcastLimited (DoS resilience)
This FrameworkDefault stateHierarchical autonomyCapability-adaptiveByzantine + adversarial [12]

How to use this table. The comparison shows where each framework’s design assumptions diverge. Cloud-Native (Kubernetes) excels when partition is transient and the network is trusted — the exact opposite of contested edge conditions. DTN handles store-and-forward but has no concept of capability levels or adversarial handling. MANET handles dynamic topology but lacks the authority delegation model. This framework occupies the intersection of contested partition, Byzantine tolerance, and graded capability — a quadrant the other three don’t address.

Key differentiators:

  1. Capability hierarchy: Unlike DTN’s flat store-forward or MANET’s routing-focused approach, we define explicit capability level s tied to connectivity regimes . This enables graceful degradation with quantified trade-offs.

  2. Adversarial modeling: Most edge frameworks assume benign failures. Our Markov model explicitly incorporates adversarial state transitions and adaptation detection - essential for contested environments.

  3. Decision authority distribution: Cloud-native assumes central authority with delegation on failure. MANET assumes peer equality. Our hierarchical tier model provides structured authority with bounded autonomy at each level.

  4. Reconvergence focus: DTN optimizes for eventual delivery; MANET optimizes for route discovery. We optimize for coherent state merge after partition - ensuring that actions taken in isolation produce consistent combined outcomes.


Self-Diagnosis: Is Your System Truly Edge?

Before applying edge architecture patterns, run this five-test diagnostic. Applying partition-first design to a system that doesn’t need it adds coordination overhead, state management complexity, and Byzantine fault tolerance code — with zero benefit. The diagnostic protects against over-engineering as much as under-engineering.

TestEdge System (PASS)Distributed Cloud (FAIL)
Partition frequency>10% of operating time disconnected<1% disconnection, always eventually reachable
Decision authorityMust make irrevocable decisions locallyCan always defer to central authority
Adversarial environmentActive attempts to disrupt/deceiveFailures are accidental, not malicious
Human escalationOperators may be unreachable for hours/daysOperators always reachable within minutes
State reconciliationComplex merge of divergent actionsSimple last-writer-wins or conflict-free

Decision Rule: If your system passes \(\geq 3\) of these tests, edge architecture patterns apply. If you pass \(\leq 2\), standard distributed systems patterns may suffice.

The distinction matters because edge patterns carry costs: increased local storage and compute for autonomous operation, complex reconciliation logic for partition recovery, Byzantine fault tolerance for adversarial resilience, and reduced optimization efficiency from distributed coordination.

These costs are justified only when the operating environment demands them. A retail IoT deployment with reliable cellular connectivity does not need Byzantine fault tolerance. A tactical drone swarm operating under jamming does.


Model Scope and Failure Envelope

Each mechanism has bounded validity. When assumptions fail, so does the mechanism.

Markov Connectivity Model

Validity Domain: The Markov model applies to a deployment \(S\) only when all four conditions listed below hold simultaneously; violation of any one makes the model’s predictions unreliable.

Physical translation. The four conditions must all hold simultaneously. A contested environment where rates are stationary on average (A1 holds) but where the adversary causes correlated failures (A3 fails) still invalidates the model. Validity is a conjunction, not a majority vote: one violated assumption is enough to make predictions unreliable.

where:

Failure Envelope: The table below maps each assumption to the specific failure mode that results when it is violated, and provides the detection signal and recommended mitigation.

Assumption ViolationFailure ModeDetectionMitigation
Non-stationary adversaryTransition rates drift; predictions degradeCUSUM on rate estimatesWindowed estimation; pessimistic bounds
Correlated jammingState dependencies violate memorylessLagged correlation > 0.3Semi-Markov extension
State indistinguishabilityMisclassification; wrong policyConfusion matrix analysisWider threshold margins
Sparse observationsHigh variance estimatesConfidence intervalsBayesian priors; conservative defaults

Counter-scenario: A sophisticated adversary observes CONVOY movement patterns and varies jamming to maximize disruption. Transition rates become time-dependent and correlated with CONVOY actions. The Markov model ’s predictions diverge from reality. Detection: correlation between CONVOY actions and subsequent transitions exceeds 0.4. Response: switch to pessimistic bounds, increase randomization.

Capability Hierarchy ( - )

Validity Domain: The graded-degradation capability hierarchy applies to deployment \(S\) when three structural conditions hold; if any is violated, the hierarchy collapses and graceful degradation does not occur.

where:

Failure Envelope: The table below maps each assumption to the failure mode and recommended mitigation.

Assumption ViolationFailure ModeDetectionMitigation
Capability couplingCannot shed one without losing anotherDependency analysis shows cyclesRedesign interfaces; accept coupled shedding
Non-monotonic degradationIntermediate states worse than extremesBimodal performance distributionSkip intermediate levels
Indivisible resourcesAll-or-nothing transitionsResource granularity > capability granularityPre-allocate at boundaries

Counter-scenario: A system where coordination ( ) is required to achieve basic mission ( ) because sensors are distributed across nodes. Losing connectivity loses coordination, which loses mission capability entirely. The hierarchy collapses to binary: full operation or survival only. The graded degradation model does not apply.

Inversion Threshold (\(\tau^* \approx 0.15\))

Validity Domain: The inversion threshold result holds for deployment \(S\) when the three conditions below are satisfied; the most commonly violated is \(C_1\) — systems with external abort triggers never enter the long-partition operating mode the threshold assumes.

where:

Uncertainty Bounds:

The threshold \(\tau^* = 0.15\) is derived from tactical deployments with specific characteristics. For different contexts:

Context\(\tau^*\) RangeBasis
Tactical military0.12-0.18Decision latency constraints
Industrial IoT0.08-0.20Safety-criticality variation
Consumer applications0.05-0.15User tolerance for degradation

Counter-scenario: Urban IoT deployment with fiber backhaul where partition probability is 0.005 and mean partition duration is 30 seconds. Cloud-native patterns perform well. Applying partition-first design adds coordination overhead (estimated 15-25% latency increase) and state management complexity without commensurate benefit. The inversion is not justified.

Edge-ness Score Limitations

Validity Domain: The Edge-ness Score \(E\) assumes:

Known Limitations: The three limitations below are structural — they cannot be eliminated by parameter tuning, only mitigated by the guidance given in the right-hand column.

LimitationImpactGuidance
Weight selection is domain-dependentScore may misrank systems across domainsCalibrate weights to domain-specific deployments
Metric independence violatedInteraction effects ignoredUse as heuristic, not deterministic classifier
Threshold sensitivitySystems near boundaries may be misclassifiedAdd margin (\(\pm 0.05\)) to boundary decisions

Authority Delegation Model

Validity Domain: The authority delegation model is valid for deployment \(S\) when the three conditions below hold; \(D_1\) (scenario anticipatability) is the hardest to guarantee in practice, since novel operational situations cannot be fully predicted at design time.

where:

Failure Envelope: The table below lists each condition’s failure mode and mitigation.

Assumption ViolationFailure ModeDetectionMitigation
Unanticipated scenarioSystem blocks or violates authorityAction blocked by policyConservative fallback; defer decision
Insufficient delegationRequired action exceeds authorityAuthority check failsStaged delegation; emergency override
Byzantine cluster leadDelegation misusedAnomalous decision patternMulti-party delegation; audit trail

Counter-scenario: Novel threat type emerges during partition - not covered by delegation rules. The cluster lead faces a dilemma: take unauthorized action or accept mission degradation. Neither outcome is covered by the model. This is a fundamental limitation: delegation frameworks cannot anticipate all scenarios. Residual risk must be accepted or addressed through broader delegation bounds (with associated risk).

Summary: Claim-Assumption-Failure Table

How to use this table. For each claim you rely on in your architecture, find its row. Check whether your deployment matches the “Valid When” column. If it matches “Fails When” instead, the claim does not apply and the framework’s recommendations in that area should not be followed without modification. This is not a limitation — it is the framework being epistemically honest about its own scope.

The table below consolidates the five major claims of this article, the key assumptions each depends on, the operational contexts where the claim holds, and the conditions that cause it to break down.

ClaimKey AssumptionsValid WhenFails When
Connectivity follows CTMCStationary rates, memoryless transitionsAdversary behavior stable; natural connectivityAdversary adapts; correlated jamming
thresholdMission continues; decisions boundedTactical, industrial contextsConsumer IoT; always-connected scenarios
Capability hierarchy degrades gracefullyCapabilities separable, monotonicWell-architected systemsTightly coupled systems; binary capabilities
Authority delegation enables autonomyScenarios anticipated, cluster leads trustedKnown operational envelopeNovel scenarios; Byzantine compromise
Edge-ness Score classifies architecture needMetrics independent, weights calibratedDomain where weights derivedCross-domain comparison; metric correlation

Cognitive Map — Sections 20–2. Framework comparison shows contested-partition + adversarial + graded-capability is a distinct quadrant \(\to\) self-diagnosis test determines whether edge patterns apply to your system \(\to\) each model has a validity domain: all conditions must hold simultaneously \(\to\) the summary table maps every major claim to its failure conditions \(\to\) use the table to verify your deployment before trusting the framework’s recommendations.


Irreducible Trade-offs

No design eliminates these tensions. The architect selects a point on each Pareto front.

How to use this section. For each trade-off: (1) identify which objective your mission prioritizes, (2) find the table row matching your operating regime, (3) accept that you cannot do better than the Pareto frontier. These are physical and information-theoretic constraints, not engineering limitations.

Trade-off 1: Autonomy vs. Coordination Efficiency

Design PointAutonomyCoordinationComplexityOptimal When
Cloud-nativeLowHighLow\(P(C=0) < 0.05\)
Hybrid tierMediumMediumMedium\(P(C=0) \in [0.05, 0.20]\)
Full partition-firstHighLowHigh\(P(C=0) > 0.20\)

Constraint surface:

Physical translation. The table gives the crossover point: when \(P(C=0) > 0.20\), the coordination efficiency you sacrifice for autonomy is smaller than the coordination you would lose to connectivity failures anyway. CONVOY at sits just above the crossover — partition-first is the correct choice, not a preference.

Trade-off 2: Responsiveness vs. Consistency (CAP)

Under partition: choose availability over consistency. CRDTs provide eventual consistency without coordination. As \(C(t) \rightarrow 0\), consistency approaches zero while responsiveness remains high.

Trade-off 3: Capability Level vs. Resource Consumption

Multi-objective formulation: Raising capability level increases mission value but simultaneously increases compute, power, and bandwidth costs — the formula captures all four dimensions so no single objective is invisibly sacrificed.

CapabilityCompute (%)Power (W)Bandwidth (Kbps)Mission Value
\(\mathcal{L}_0\) (Survival)5200.1
\(\mathcal{L}_1\) (Basic)20810.4
\(\mathcal{L}_2\) (Coordination)4015100.6
\(\mathcal{L}_3\) (Fleet)6025500.8
\(\mathcal{L}_4\) (Full)80402001.0

Resource Shadow Prices

The shadow price quantifies the marginal value of relaxing constraint \(g_i\) (renamed \(\zeta_i\) to avoid collision with Weibull scale \(\lambda_i\)):

ResourceRAVEN \(\zeta\) (c.u.)CONVOY \(\zeta\) (c.u.)GRIDEDGE \(\zeta\) (c.u.)Interpretation
Bandwidth3.20/Mbps-hr2.40/Mbps-hr0.30/Mbps-hrSync capacity value
Compute0.08/GFLOP0.12/GFLOP0.05/GFLOPLocal decision value
Battery/Power12.00/kWh4.50/kWhN/AExtended operation
Latency0.50/ms0.30/ms25.00/msResponse speed

(Shadow prices in normalized cost units (c.u.) — illustrative relative values; ratios between rows convey resource scarcity ordering. GRIDEDGE compute (0.05 c.u./GFLOP) is the smallest unit; all others express how many times more valuable that resource is per unit. Calibrate to platform-specific costs.)

Investment implication: High shadow price indicates binding constraint where investment yields highest returns. RAVEN ’s high bandwidth shadow price justifies compression and priority queuing investment. GRIDEDGE ’s extreme latency shadow price justifies sub-cycle response hardware.

Trade-off 4: Tier Depth vs. Coordination Overhead

TiersGranularityOverhead (msgs/decision)Latency (hops)
2Low\(O(n)\)1
3Medium\(O(n \log n)\)2
4High\(O(n \log^2 n)\)3

Cost Surface: Coordination Under Connectivity Regimes

Physical translation. Coordination cost grows from \(O(n)\) in Full connectivity to \(O(n^2)\) in Intermittent — a quadratic penalty for designs that assume fleet-wide coordination under uncertain links. In Denied, coordination cost becomes infinite: you cannot coordinate with unreachable nodes. Design for Denied means eliminating fleet-wide coordination dependencies at L0 and L1.

Irreducible Trade-off Summary

Trade-offObjectives in TensionCannot Simultaneously Achieve
Autonomy-CoordinationIndependent operation vs. fleet optimizationBoth maximized under partition
Response-ConsistencyFast local decisions vs. fleet-wide agreementBoth under partition (CAP)
Capability-ResourcesHigh capability vs. low consumptionHigh capability with low resources
Tier Depth-OverheadFine authority vs. low coordination costBoth with large fleets

These trade-offs are irreducible: no design eliminates them. The operating environment — specifically the partition probability \(p\), mission criticality, and fleet size — determines which point on each Pareto front is correct; the framework’s value is making those trade-offs explicit and quantifiable rather than implicit.

Cognitive Map — Section 21. Four irreducible trade-offs exist: autonomy vs. coordination efficiency, responsiveness vs. consistency (CAP), capability vs. resources, tier depth vs. overhead \(\to\) each has a Pareto frontier that cannot be escaped \(\to\) shadow prices quantify which resource is most binding for each scenario \(\to\) cost surface shows coordination grows quadratically in Intermittent and becomes infinite in Denied \(\to\) accept the trade-off; design for the Pareto point that matches your mission priority.


Reference: Paradigm Positioning

How does this framework relate to fog computing, mobile edge computing, and distributed intelligence? Where do existing paradigms fail, and what gaps remain?

Fog Computing: Overlaps and Divergences

Fog computing [7] (Cisco 2012, IEEE 1934-2018) places compute closer to data sources to reduce latency and bandwidth. The comparison table shows the key structural difference: fog assumes cloud remains authoritative and reachable; autonomic edge assumes cloud may be permanently unreachable.

DimensionFog ComputingAutonomic Edge Architecture
Primary motivationLatency reduction, bandwidth optimizationOperation under contested/denied connectivity
Connectivity assumptionDegraded but available; cloud reachablePartition is normal; cloud may be unreachable indefinitely
Hierarchy purposeComputation offloading, data aggregationDelegation of authority, autonomous operation
Failure modelGraceful degradation to cloudGraceful promotion to local authority
State managementCache/sync with authoritative cloudCRDT-based eventual consistency, no authoritative source
Decision authorityCloud-delegated, fog-executedLocally originated, cloud-informed

The unique contribution here is formalizing what fog leaves implicit: who decides when cloud is unreachable, how divergent state is merged after extended partition, and whether the system improves from disconnection events.

Edge-Cloud Continuum: Shared Foundations, Different Emphases

The edge-cloud continuum (HORIZON Europe, GAIA-X, Linux Foundation Edge) treats compute resources as a spectrum from device to cloud. The table shows where the frameworks share assumptions and where they diverge structurally.

DimensionEdge-Cloud ContinuumAutonomic Edge Architecture
Resource modelFluid placement along continuumFixed placement with partition tolerance
OrchestrationCentralized (Kubernetes, OpenStack)Distributed with delegated authority
Workload mobilityDynamic migration based on conditionsStatic deployment with behavioral adaptation
Network modelVariable latency, generally availableContested, intermittent, potentially adversarial
Optimization targetResource efficiency, cost, latencyMission completion, survival, coherence
Failure recoveryRestart elsewhere, stateless preferredLocal healing, stateful by necessity

The continuum’s recognition that “edge is not a location but a capability profile” aligns with the Edge-ness Score. The structural gap: continuum orchestrators (KubeEdge, Azure IoT Edge, AWS Greengrass) require connectivity to a control plane and favor stateless workloads — neither assumption holds for tactical or industrial edge systems where physical deployment is fixed and state is mission-critical. This architecture layers partition protocols and CRDT coherence atop continuum stacks rather than replacing them.

Distributed Intelligence Frameworks: Complementary Perspectives

Distributed intelligence encompasses frameworks for distributed AI/ML, multi-agent systems, and swarm intelligence. These paradigms share our interest in decentralized decision-making but approach it from different angles.

Multi-Agent Systems (MAS)

MAS theory provides formal models for autonomous agents with local perception, communication, and action. Contract net protocols and BDI architectures are directly applicable to MAPE-K knowledge base design and task allocation during partition.

DimensionMulti-Agent SystemsAutonomic Edge Architecture
Agent modelBDI, reactive, hybridMAPE-K autonomic loop
CommunicationMessage passing, assumed reliableGossip protocols, partition-tolerant
CoordinationAuctions, voting, negotiationHierarchical authority, CRDT merge
LearningReinforcement learning, imitationAnti-fragile adaptation, bandit algorithms
Failure modelAgent crash/recoveryByzantine tolerance, adversarial

The gap: MAS assumes reliable message delivery and focuses on agent-level reasoning without system-level convergence guarantees. CRDT -based state reconciliation adds those guarantees for contested environments where MAS coordination mechanisms would otherwise produce divergent state.

Federated learning extends to contested environments through staleness-weighted aggregation and Byzantine -tolerant aggregation. Swarm intelligence inspires gossip protocols and formation maintenance, but lacks the formal convergence bounds, authority hierarchies, and reconciliation protocols that constrained-connectivity systems require.

Reference Architecture Comparison

How the architecture maps to major reference frameworks:

CapabilityOpenFog (IEEE 1934)ETSI MEC [8] Industrial IoT (IIC)Autonomic Edge
Partition protocolNot specifiedNot specifiedPartial (OPC-UA redundancy)Formal (Definition 12, Markov model)
Authority delegationImplicitApplication-definedDevice profilesExplicit hierarchy (L0-L3)
State reconciliationSync to cloudStateless preferredHistorian-basedCRDT-based (conflict-free merge)
Self-healingPlatform restartOrchestrator-drivenRedundancy failoverMAPE-K autonomic control loop
Anti-fragilityNot addressedNot addressedNot addressedCore principle (learn from stress)
Adversarial modelSecurity perimeterTLS/authenticationDefense-in-depthByzantine tolerance

Positioning Summary

The autonomic edge architecture occupies a specific niche in the edge computing landscape; the diagram maps connectivity assumptions to paradigms and decision authority, showing that autonomic edge is the only paradigm that spans both Intermittently and Usually Disconnected assumptions while distributing authority all the way to fully autonomous.

    
    graph TD
    subgraph "Connectivity Assumptions"
        CA1["Always Connected"]
        CA2["Usually Connected"]
        CA3["Intermittently Connected"]
        CA4["Usually Disconnected"]
    end

    subgraph "Paradigms"
        P1["Cloud Native"]
        P2["Edge-Cloud Continuum"]
        P3["Fog Computing"]
        P4["Autonomic Edge"]
    end

    subgraph "Decision Authority"
        DA1["Centralized"]
        DA2["Delegated"]
        DA3["Distributed"]
        DA4["Autonomous"]
    end

    CA1 --> P1
    CA2 --> P2
    CA2 --> P3
    CA3 --> P3
    CA3 --> P4
    CA4 --> P4

    P1 --> DA1
    P2 --> DA2
    P3 --> DA2
    P4 --> DA3
    P4 --> DA4

    style P4 fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    style CA4 fill:#ffcdd2,stroke:#c62828

The frameworks are complementary. KubeEdge or fog patterns handle orchestration and latency optimization when connected; autonomic edge protocols layer partition tolerance and CRDT coherence on top. The unique contribution here is formal treatment of contested connectivity — the Markov model s, authority hierarchies, convergence guarantees, and anti-fragile adaptation that other paradigms assume away or delegate to application developers.

Read the positioning diagram. The two axes are connectivity assumption (top = always connected, bottom = usually disconnected) and decision authority (left = centralized, right = autonomous). The green node (Autonomic Edge) is the only paradigm that spans both intermittent and usually-disconnected assumptions while distributing authority all the way to fully autonomous. The red node (Usually Disconnected) marks the operating regime where no other paradigm applies.

Cognitive Map — Section 22. Fog computing overlaps in latency motivation but assumes cloud remains authoritative \(\to\) edge-cloud continuum handles resource fluid placement but requires control plane connectivity \(\to\) MAS provides agent models but lacks convergence guarantees under partition \(\to\) reference architecture table shows anti-fragility and formal partition protocol are unique contributions \(\to\) autonomic edge layers atop existing continuum stacks rather than replacing them.

Edge computing emerged as a paradigm to bring computation and storage closer to data sources, reducing latency and backhaul load [9] . Fog computing extended this model toward a distributed edge-cloud continuum [7] , while ETSI MEC standardised low-latency compute at the radio access network [8] .

The connectivity disruptions modelled here connect to the Delay-Tolerant Networking (DTN) literature [3] , which addresses store-and-forward operation under intermittent links. The CAP theorem [1, 2] establishes the fundamental consistency-availability trade-off that motivates partition-aware design. Eventual consistency under partition was formalised by Vogels [13] .

Autonomic self-management — the broader context for the MAPE-K loops used throughout this series — was articulated by Kephart and Chess [5] and systematised in the IBM architectural blueprint [6] . Dependability and fault taxonomy follow Avizienis et al. [12] . Partition duration modelling with heavy-tailed Weibull distributions draws on reliability statistics methods surveyed by Meeker and Escobar [11] .


Closing

This article established three interlocking results. The inversion thesis ( Proposition 2 ) showed that when partition probability exceeds \(\tau^* \approx 0.15\), designing for disconnection as baseline outperforms designing for connectivity — not marginally, but categorically. The Markov connectivity model ( Definition 12 ) provides a tractable quantitative framework: estimate transition rates from telemetry, compute the stationary distribution, derive architectural choices from that distribution. The capability hierarchy ( ) translates the stochastic connectivity picture into a graceful degradation contract: every capability level has an explicit connectivity threshold and resource requirement, and transitions between levels are governed by well-defined rules.

Applied to AUTOHAULER , the framework explains why ore-pass blackouts are non-events: the tier architecture places safety-critical decisions at the fog layer, which never requires upward connectivity. Applied to GRIDEDGE , it explains why fault isolation must complete within 500 ms with zero cloud dependency: the stationary distribution means the highest-consequence decisions correlate with lost connectivity.

The framework specifies what properties emerge under what assumptions. Every model has a validity domain — the Markov assumptions, the capability separability conditions, the inversion threshold ’s bounded-latency requirement — and the Model Scope section mapped each claim to its failure envelope. Operational deployment requires verifying that those assumptions hold before trusting the prescriptions.

The architecture so far addresses the structural question: how should an edge system be organized? But a correctly structured system can still fail silently if it cannot tell whether it is healthy. A RAVEN drone operating autonomously has no external reference point — it cannot call home to check whether its sensor readings are anomalous, its battery model is drifting, or its peers’ state has diverged beyond safe bounds. The next problem in the constraint sequence is therefore measurement: how does an edge node assess its own health and the health of its cluster without central observability infrastructure? That question, and the gossip protocols, anomaly detection formulations, and Byzantine fault tolerance that answer it, is the subject of Self-Measurement Without Central Observability.

Series roadmap — each article solves the next constraint in the sequence:

  1. Why Edge Is Not Cloud Minus Bandwidth (this article) — contested connectivity model, inversion threshold, capability hierarchy
  2. Self-Measurement Without Central Observability — anomaly detection, gossip health propagation, Byzantine tolerance without a central collector
  3. Self-Healing Without Connectivity — MAPE-K autonomic control loops, gain scheduling under stochastic delay, mode-invariant stability
  4. Fleet Coherence Under Partition — CRDTs, vector clocks, authority tiers, NTP-free split-brain resolution
  5. Anti-Fragile Decision-Making at the Edge — adversarial Markov games, EXP3-IX bandit algorithms, stress-information duality
  6. The Constraint Sequence and the Handover Boundary — prerequisite graph, phase gates, certification completeness, zero-tax autonomic stack

References

[1] Brewer, E.A. (2000). “Towards Robust Distributed Systems.” Proc. PODC. ACM. [acm]

[2] Gilbert, S., Lynch, N. (2002). “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services.” ACM SIGACT News, 33(2), 51–34. [doi]

[3] Fall, K. (2003). “A Delay-Tolerant Network Architecture for Challenged Internets.” Proc. SIGCOMM, 27–84. ACM. [doi]

[4] Cerf, V., Burleigh, S., Hooke, A., Torgerson, L., Durst, R., Scott, K., Fall, K., Weiss, H. (2007). “Delay-Tolerant Networking Architecture.” RFC 4838, IETF. [doi]

[5] Kephart, J.O., Chess, D.M. (2003). “The Vision of Autonomic Computing.” IEEE Computer, 36(1), 41–50. [doi]

[6] IBM Research (2006). “An Architectural Blueprint for Autonomic Computing.” IBM White Paper, 4th Ed.

[7] Bonomi, F., Milito, R., Zhu, J., Addepalli, S. (2012). “Fog Computing and Its Role in the Internet of Things.” Proc. MCC Workshop on Mobile Cloud Computing, 13–91. ACM. [doi]

[8] ETSI GS MEC 003 V2.1.1 (2019). “Multi-access Edge Computing (MEC): Framework and Reference Architecture.” ETSI. [pdf]

[9] Satyanarayanan, M. (2017). “The Emergence of Edge Computing.” IEEE Computer, 50(1), 30–39. [doi]

[10] Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L. (2016). “Edge Computing: Vision and Challenges.” IEEE Internet of Things Journal, 3(5), 637–646. [doi]

[11] Meeker, W.Q., Escobar, L.A. (1998). Statistical Methods for Reliability Data. Wiley-Interscience. [wiley]

[12] Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C. (2004). “Basic Concepts and Taxonomy of Dependable and Secure Computing.” IEEE Transactions on Dependable and Secure Computing, 1(1), 11–81. [doi]

[13] Vogels, W. (2009). “Eventually Consistent.” CACM, 52(1), 40–65. [doi]


Back to top