Free cookie consent management tool by TermsFeed Generator

The Constraint Sequence and the Handover Boundary


Prerequisites

This final article synthesizes the complete series. Contested Connectivity develops the connectivity probability model \(C(t)\), the capability hierarchy (L0–L4), and the fundamental inversion that defines edge. Self-Measurement covers distributed health monitoring, the observability constraint sequence , and gossip -based awareness. Self-Healing addresses MAPE-K autonomous healing [1] , recovery ordering, and cascade prevention under partition. Fleet Coherence covers state reconciliation, CRDT s, decision authority hierarchies, and the coherence protocol. Anti-Fragile Decision-Making develops systems that improve under stress, the judgment horizon , and the limits of automation.

The preceding articles developed the what: the capabilities required for autonomic edge architecture. This article addresses the when: in what order should these capabilities be built? The constraint sequence determines success or failure. Build in the wrong order, and you waste resources on sophisticated capabilities that collapse because their foundations are missing.

Full series notation registry: Notation Registry.


Theoretical Contributions

This article develops the theoretical foundations for capability sequencing in autonomic edge systems. We model edge capability dependencies as a directed acyclic graph (DAG) and derive valid development sequences as topological orderings with priority-weighted optimization. We characterize how binding constraints shift across connectivity states and prove conditions for dynamic re-sequencing under adversarial adaptation. We derive resource allocation bounds for autonomic overhead, proving that optimization infrastructure competes with the system being optimized. We define phase gate functions as conjunction predicates over verification conditions, providing a mathematical foundation for systematic validation. We prove that valid system evolution requires maintaining all prior gate conditions, establishing the regression testing requirement as a theorem. Finally, we formalize five constructs at the automation boundary: predictive handover triggering ( Proposition 74 ), asymmetric trust dynamics ( Definition 105 ), causal barriers for stale commands ( Definition 106 ), semantic compression against alert fatigue ( Definition 107 ), and the L0 Physical Safety Interlock ( Definition 108 ) that bypasses the entire MAPE-K stack when hardware safety demands it [3, 4] .

These contributions build on Theory of Constraints (Goldratt, 1984), formal verification (Clarke et al., 1999), and systems engineering (INCOSE, 2015), adapting each framework for contested edge deployments [5, 6] .


Opening Narrative: The Wrong Order

Edge Platform Team: PhD ML expertise, cloud deployment veterans, project allocation of 2,400 p.u. (illustrative value) (project baseline units). Mission: intelligent monitoring for CONVOY vehicles. Six months produced 94% (illustrative value) detection accuracy in lab.

Within 72 hours (illustrative value) of deployment: offline on 8 of 12 vehicles (illustrative value).

The failure was wrong sequencing, not bad engineering: ML assumed continuous connectivity (terrain averaged 23% (illustrative value)), GPU inference assumed stable power (shed first under stress), and fleet correlation assumed a reliable mesh that was never validated.

The post-mortem revealed every layer was missing: L0 partition survival was not validated; self-measurement was assumed rather than implemented with independent local health; self-healing was absent; fleet coherence was built on an unstable foundation; and the sophisticated analytics (2,000 p.u. (illustrative value)) collapsed without those foundations.

They built L3 capability before validating L0. The roof before the foundation.

Cloud-native intuition fails at edge: you can’t iterate quickly when mistakes may be irrecoverable. The constraint sequence matters.


The Constraint Sequence Framework

Building edge capabilities in the wrong order produces expensive failures. A team that delivers 94% (illustrative value) detection accuracy in a lab can find all 12 vehicles (illustrative value) offline within 72 hours (illustrative value) of deployment — because the ML assumed continuous connectivity, the GPU assumed stable power, and neither was validated before the analytics layer was built. Applying the Theory of Constraints to capability sequencing resolves this: every system has a current binding constraint, and optimizing anything other than that constraint is wasted effort. The constraint sequence formalizes the required ordering — a total ordering over capabilities where each prerequisite must be substantially solved before its dependent can become binding. The sequencing is context-dependent — edge constraints differ from cloud-native ones in type (survival vs. performance), iteration speed (days vs. hours), and mistake recovery (often irrecoverable vs. rollback), making sequence errors at the edge more expensive and slower to detect.

Review: Constraint Sequence from Platform Engineering

Definition 92 (Constraint Sequence). A constraint sequence for system \(S\) is a total ordering over the set of constraints (\(\Omega\) denotes the constraint set throughout this article; the calligraphic \(\mathcal{C}\) is reserved for connectivity state and connected-cluster sets) such that addressing constraint \(c_i\) before its prerequisites provides zero value:

Analogy: A spacecraft re-entry checklist — you don’t deploy the parachute before the heat shield, and you don’t radio mission control before the parachute deploys. Each step has verified preconditions.

Logic: A valid constraint sequence is a topological ordering of the prerequisite DAG; any ordering that violates a prerequisite edge wastes effort because the dependent capability provides zero value without its foundation.

    
    stateDiagram-v2
    [*] --> Phase0 : system boot
    Phase0 --> Phase1 : gate_0 passes L0 safety checks
    Phase1 --> Phase2 : gate_1 passes connectivity and clock sync
    Phase2 --> Phase3 : gate_2 passes CRDT converged and quorum
    Phase3 --> Phase4 : gate_3 passes EXP3-IX warm and certified
    Phase4 --> [*] : handover complete
    Phase1 --> Phase0 : gate fails rollback
    Phase2 --> Phase1 : gate fails rollback
    Phase3 --> Phase2 : gate fails rollback
    Phase3 --> Phase3 : holding awaiting prerequisites

The Theory of Constraints, developed by Eliyahu Goldratt, observes that every system has a bottleneck—the constraint that limits overall throughput. Optimizing anything other than the current constraint is wasted effort. Only by identifying and addressing constraints in sequence can a system improve.

Applied to software systems, this becomes the Constraint Sequence principle:

Systems fail in a specific order. Each constraint provides a limited window to act. Solving the wrong problem at the wrong time is an expensive way to learn which problem should have come first.

In platform engineering, common constraint sequence s follow four patterns: reliability before features (a feature that crashes the system provides negative value); observability before optimization (you cannot optimize what you cannot measure); security before scale (vulnerabilities multiply with scale); and simplicity before sophistication (complex solutions to simple problems create maintenance debt).

The constraint sequence is not universal—it depends on context. But within a given context, some orderings are strictly correct and others are strictly wrong. The CONVOY team’s failure was solving constraint #7 (sophisticated analytics) before constraints #1-6 were addressed.

Edge-Specific Constraint Properties

Edge computing introduces constraint properties that differ from cloud-native systems [5, 6] :

PropertyCloud-NativeTactical Edge
Constraint typePerformance, cost, scaleSurvival, trust, autonomy
Iteration speedFast (minutes to hours)Slow (days to weeks)
Mistake recoveryUsually recoverable (rollback)Often irrecoverable (lost platform)
Feedback loopContinuous telemetryIntermittent, delayed
Constraint stabilityRelatively staticShifts with connectivity state
Failure visibilityImmediate (monitoring)Delayed (post-reconnect)

What does this mean in practice?

Survival constraints precede all others [7] : in cloud, if a service crashes, Kubernetes restarts it, but at the edge, if a drone crashes, it may be physically unrecoverable — the survival constraint (L0) must be addressed before any higher capability. Trust constraints are foundational: cloud systems assume the hardware is trustworthy (datacenter security), while edge systems may face physical adversary access, making hardware trust a prerequisite for believing any software health report. Autonomy constraints compound over time: a cloud service that fails during partition experiences downtime, but an edge system that fails during partition may make irrecoverable decisions, so autonomy capabilities must be validated before autonomous operation. Feedback delays hide sequence errors: in cloud, wrong sequencing manifests quickly through monitoring, but at edge, sequence errors may not appear until post-mission analysis — after the damage is done.

The implication: constraint sequence is more critical at the edge than in cloud. Errors are more expensive, less recoverable, and slower to detect. Getting the sequence right the first time is not a luxury—it is a requirement.

Cognitive Map: The constraint sequence framework establishes the conceptual foundation for the entire article. Definition 92 formalizes the ordering requirement as a total ordering where prerequisites must be satisfied before dependents — the constraint that limits throughput must be addressed first. The cloud-vs-edge comparison table shows why sequence errors are so costly at the edge: irrecoverable mistakes, delayed feedback, and survival constraints that have no cloud analogue. The CONVOY team’s failure (built L3 before validating L0) is the concrete example that motivates the formal framework.


The Edge Prerequisite Graph

Knowing that capabilities have prerequisites is not enough — you need the specific dependency graph for edge systems to determine which sequences are valid and which parallel work is safe. The prerequisite graph is a DAG where an edge A\(\to\)B means A must be substantially solved before B can become the binding constraint. Proposition 66 proves valid sequences exist iff the graph is acyclic. The critical path (Hardware Trust \(\to\) L0 \(\to\) Self-Measurement \(\to\) Self-Healing \(\to\) Fleet Coherence \(\to\) L2 \(\to\) L3 \(\to\) L4) is the minimum 8-stage sequential path to full capability. Not all paths through the DAG are equivalent — the critical path cannot be shortened by parallelism, but the parallelizable stages (L1 and Self-Measurement can develop simultaneously after L0) represent genuine acceleration opportunities. Hardware trust is the deepest prerequisite: all software health reports are suspect if the hardware is compromised.

Dependency Structure of Edge Capabilities

Definition 93 (Prerequisite Graph). The prerequisite graph \(G = (V, E)\) is a directed acyclic graph where \(V\) is the set of capabilities and \(E\) is the set of prerequisite relationships. An edge \((u, v) \in E\) indicates that capability \(u\) must be validated before capability \(v\) can be developed.

In RAVEN , fleet coherence across 47 drones presupposes that each drone’s local healing loop is already stable — a dependency that cannot be reversed. In OUTPOST , the 127-sensor health-propagation mesh cannot converge until each sensor’s anomaly-detection baseline is established. The question in both cases is whether a valid build order always exists when capabilities have prerequisites.

Proposition 66 (Valid Sequence Existence). A valid development sequence exists if and only if the prerequisite graph is acyclic. When \(G\) is a DAG, the number of valid sequences equals the number of topological orderings of \(G\).

What this means in practice: You cannot always build the capabilities you want in the order you want — some require others to be stable first. The Prerequisite Graph ( Definition 18 ) is not just organizational; it’s a hard dependency ordering. Attempting to deploy fleet coherence (L3) before self-healing (L2) is stable creates a system that can lose its coherence layer during a failure event with no recovery path.

Proof: By the fundamental theorem of topological sorting, a directed graph admits a topological ordering iff it is acyclic. Each topological ordering corresponds to a valid development sequence satisfying all prerequisite constraints.

Watch out for: the proposition proves a valid sequence exists if and only if the prerequisite graph is acyclic — it does not validate the graph itself; if the development team’s model omits an implicit runtime dependency (such as the Monitor/Healer/Resource Manager cycle), the sequence Proposition 66 certifies as valid may still produce a cold-start deadlock, because the certified object is the team’s model of the graph, not the dependency structure the hardware actually instantiates.

Edge capabilities form a directed acyclic graph (DAG) of prerequisites. Some capabilities depend on others; some can be built in parallel. The graph structure determines valid build sequences.

    
    graph TD
    subgraph Foundation["Phase 0: Foundation"]
    HW["Hardware Trust
(secure boot, attestation)"] end subgraph Survival["Phase 1: Local Autonomy"] L0["L0: Survival
(safe state, power mgmt)"] SM["Self-Measurement
(anomaly detection)"] SH["Self-Healing
(MAPE-K loop)"] end subgraph Coordination["Phase 2-3: Coordination"] L1["L1: Basic Mission
(core function)"] FC["Fleet Coherence
(CRDTs, reconciliation)"] L2["L2: Local Coordination
(cluster ops)"] end subgraph Integration["Phase 4-5: Integration"] L3["L3: Fleet Integration
(hierarchy, authority)"] AF["Anti-Fragility
(learning, adaptation)"] L4["L4: Full Capability
(optimized operation)"] end HW --> L0 L0 --> L1 L0 --> SM SM --> SH L1 --> FC SH --> FC FC --> L2 L2 --> L3 SM --> AF SH --> AF FC --> AF L3 --> L4 AF --> L4 style HW fill:#ffcdd2,stroke:#c62828,stroke-width:2px style L0 fill:#fff9c4,stroke:#f9a825 style SM fill:#c8e6c9,stroke:#388e3c style SH fill:#c8e6c9,stroke:#388e3c style FC fill:#bbdefb,stroke:#1976d2 style L4 fill:#e1bee7,stroke:#7b1fa2,stroke-width:2px

Read the diagram: Red (Hardware Trust) is the absolute foundation — nothing above it is valid if the hardware is compromised. Yellow (L0) must be validated before any other capability begins. Green (Self-Measurement and Self-Healing) can develop in parallel after L0 and feed each other. Blue (L1, Fleet Coherence, L2) form the coordination layer. Purple (L4) at the top is reachable only after both L3 and Anti-Fragility are validated — the two paths through the graph must converge before full capability is achievable.

Two distinct graphs — development vs. runtime: Definition 93 ’s prerequisite graph models capability development sequencing, not runtime boot order. At runtime, Monitor, Healer, and Resource Manager form a cyclic dependency ring that cannot be topologically sorted. The L0 Dependency Isolation Requirement breaks this cycle at cold start by giving each component an L0 survival-mode variant with zero lateral runtime dependencies.

Runtime dependency cycle and cold-start bootstrap. The development prerequisite graph is acyclic: Self-Measurement must be validated before Self-Healing, which must be validated before Fleet Coherence. The runtime component dependency graph is a different object and can be cyclic: Monitor diagnoses anomalies that Healer acts on; Healer requests resource headroom from Resource Manager; Resource Manager monitors process consumption to enforce quotas, which requires Monitor-derived metrics. This cycle cannot be topologically sorted and creates a cold-start deadlock if naively treated as a DAG.

The L0 Dependency Isolation Requirement ( Definition 18 ) breaks this deadlock: each component has an L0 survival-mode variant with zero lateral runtime dependencies — a hardware watchdog with no Monitor feedback, threshold-based healing rules with no Resource Manager input, and static priority tables with no Healer coupling. The cold-start bootstrap sequence brings up the hardware watchdog and safe-state logic first (zero dependencies), then establishes a raw sensor baseline from hardware registers (no Healer, no Resource Manager), then activates threshold-based L0 healing (static rules, no MAPE-K), then initializes the static L0 resource manager (fixed priorities, no process feedback), and only then activates L1+ MAPE-K loops once L0 stability is confirmed over \(T_{\text{stable}}\).

Proposition 8 (Hardened Hierarchy Fail-Down) guarantees that once L1+ enters the cyclic regime, L0 independence is preserved — the cyclic component graph cannot cascade below L0. The capability DAG governs what you validate first; Definition 18 governs how you boot without deadlock.

In this graph, an arrow from A to B means A is a prerequisite for B; capabilities at the same level can be developed in parallel; and no capability should be deployed until all its prerequisites are validated.

The longest path determines minimum development time. For full L4 capability, the critical path is: Hardware Trust, then L0, then Self-Measurement, then Self-Healing, then Fleet Coherence, then L2, then L3, then L4 — 8 sequential stages. Attempting to shortcut this path leads to the CONVOY failure mode: sophisticated capabilities without stable foundations. Three parallelizable windows exist: L1 (Basic Mission) and Self-Measurement can develop in parallel after L0; Self-Healing development can begin once Self-Measurement is partially complete; and Anti-Fragility learning can begin once Fleet Coherence protocols are defined.

Hardware Trust Before Software Health

The deepest layer of the prerequisite graph is hardware trust. All software capabilities assume the hardware is functioning correctly. If hardware is compromised, all software reports are suspect.

The trust chain [8] :

Each layer trusts the layer below it. Compromise at any layer invalidates all layers above.

Edge-specific hardware threats include physical access (adversary may physically access devices), supply chain compromise (hardware may be modified before deployment), environmental degradation (extreme conditions may cause hardware failures), and electromagnetic interference (jamming, EMP, or other interference).

Establishing hardware trust requires four mechanisms in sequence: cryptographic verification of firmware at startup (secure boot); cryptographic proof of hardware identity (hardware attestation); physical indicators of unauthorized access (tamper detection); and continuous verification of hardware operation (health monitoring).

OUTPOST example: A perimeter sensor reports “all clear” for 72 hours (illustrative value). But the sensor was physically accessed and modified to always report clear. The self-measurement system trusts the sensor’s reports because it has no hardware attestation. The software health metrics show green. The actual security state is compromised.

Hardware trust must be established before software health can be believed. Self-measurement assumes the hardware it runs on is trustworthy. If this assumption is false, self-measurement is meaningless.

Local Survival Before Fleet Coordination

A node that cannot survive alone cannot contribute to a fleet. The hierarchy of concerns:

The survival test is simple: can each node handle partition gracefully in isolation? If yes, proceed to coordination capabilities. If no, fix local survival first.

Fleet coherence coordinates state across nodes. But if nodes crash during partition, there is no state to coordinate. If nodes make catastrophic autonomous decisions, coherence reconciles those decisions after the damage is done. The required sequence runs from individual node (L0 survival, basic self-measurement, local healing) to local cluster ( gossip -based health, local coordination, cluster authority) to fleet-wide (state reconciliation, hierarchical authority, anti-fragile learning).

The testing protocol requires isolating each node (simulating complete partition), verifying L0 survival over an extended period, verifying local self-measurement functions, verifying that local healing recovers from injected faults, and only then proceeding to coordination testing.

RAVEN example: A drone without fleet coordination can still fly, detect threats, and return to base. This L0/L1 capability must work perfectly before adding swarm coordination. If the individual drone fails under partition, the swarm’s coordination capabilities provide no value—they coordinate the failure of their components.

Cognitive Map: The prerequisite graph section translates the abstract constraint sequence into a concrete DAG. The DAG has five layers (foundation, local autonomy, coordination, integration), a single critical path of 8 sequential stages, and three parallelizable windows. The trust chain visualization shows why hardware trust is the true foundation — all health reports, all healing decisions, all CRDT merges depend on trusting the data produced by the hardware. The standalone node test (isolate each node, verify L0 + L1 before adding coordination) is the operational protocol that enforces the DAG in practice.


Constraint Migration at the Edge

The prerequisite graph is a static structure — it defines which capabilities must come before which. But the binding constraint is not static: it depends on current system state. A system in good connectivity with depleted resources has a different binding constraint than the same system isolated in full adversarial pressure. Constraint migration is modeled as a function over a three-dimensional state cube \((C, R, A)\) — connectivity, resources, adversary presence. Proposition 67 defines the binding constraint as the capability with the largest marginal utility gradient. The five key regions (Survival-Critical, Threat-Active, Efficiency-Optimal, Reliability-Balanced, Autonomy-Forced) partition the cube and determine which constraint is binding in each region. The adversary axis (\(A_\text{adv}\)) is routinely omitted in commercial deployments — a structural omission, since a well-connected, well-resourced system is still Threat-Active when \(A_\text{adv} > 0.5\); the single-variable connectivity model is a cross-section of the full surface at \(R > 0.5\) and \(A_\text{adv} < 0.5\).

How Binding Constraints Shift

Active constraint evaluation: given system state (connectivity, resources, autonomy level), a constraint is active-binding when two conditions hold simultaneously: its gate validator \(V_c(S) < \theta_c\) (gate not currently passing), and the system’s capability level \(L\) satisfies \(\sigma(c) \leq L\) (constraint is in-scope). The active constraint set is .

Definition 94 (Constraint Migration). A system exhibits constraint migration if the binding constraint \(c^*(t)\) varies with system state \(S(t)\):

where measures the throughput limitation imposed by constraint \(c\) in state \(S\).

In practice: switch to working on whichever bottleneck is most limiting overall system utility right now — the argmax selects the constraint whose marginal relaxation produces the largest instantaneous utility gain, ensuring the system always targets the binding constraint rather than the most recently violated one.

The binding constraint is the one whose relaxation would most improve throughput. Formally: where — the ratio of resources this constraint demands to resources available. The constraint with Impact closest to 1 is binding (it is consuming nearly all available resources and would benefit most from relaxation).

Compute Profile: CPU: per evaluation — scan over all constraints to compute Impact scores and find the argmax. Memory: — one Impact score per capability level. Evaluate on every MAPE-K tick rather than on a separate slower schedule.

Definition 130 (Resource State). Resource State is introduced in this article as a normalized composite of battery, memory, and CPU availability. It should not be confused with Definition 1 (Connectivity State) , which measures network accessibility rather than local resource availability.

Critical threshold: . Weights: \(w_E + w_M + w_C = 1\); RAVEN: . \(R(t)\) forms the resource axis of the three-dimensional state cube \((C, R, A)\) used in Definition 94 through Proposition 67 of this article.

Definition 95 (Adversary Presence). Let denote the estimated adversary threat level at time \(t\):

with weights summing to 1. High threat ( ) shifts binding priority toward trust verification and anti-fragility learning regardless of connectivity state.

(renamed to avoid collision with the defender action set \(A\) used in Definition 80 )

Proposition 67 (Multi-Dimensional Constraint Migration). The binding constraint \(c^*\) is determined by the utility gradient across all state dimensions \((C, R, A)\):

The capability dimension where a 1% improvement yields the biggest utility gain is always the binding constraint — and that answer changes when the adversary axis rises above 0.5, even with abundant bandwidth and power.

The proposition identifies the binding constraint as the capability dimension with the largest marginal utility gradient across connectivity, resource, and adversary axes simultaneously; is evaluated numerically from telemetry with (illustrative value) hysteresis at boundary transitions to prevent oscillation. The adversary axis is routinely omitted in commercial deployments — gradient computations that omit it will always converge toward connectivity and resource optimization while leaving threat-induced state corruption undetected.

Physical translation: \(\partial U / \partial c\) is the marginal value of improving capability \(c\) by one unit. The binding constraint is whichever capability has the largest marginal value — the capability where a 1% improvement yields the largest system-wide utility gain. When \(R < 0.2\), the marginal value of survival improvements (keeping the system alive at all) exceeds the marginal value of any other improvement — survival is binding. When \(A_\text{adv} > 0.5\), the marginal value of trust verification exceeds efficiency improvements — an optimized-but-compromised system provides negative value. The five regions are the stable regimes of this gradient surface. A constraint that depends on cloud connectivity (\(C > 0.8\)) has a utility gradient of zero when \(C = 0\) — it cannot activate while the node is partitioned; it waits at the Autonomy-Forced region boundary until connectivity is restored.

This produces a piecewise-constant surface over the \((C, R, A)\) state cube. Key regions:

RegionConditionsBinding ConstraintRationale
Survival-Critical or (\(C = 0\) and \(R < 0.5\))SurvivalResources or connectivity too low for anything else
Threat-Active\(A_\text{adv} > 0.5\)Trust/Anti-FragilityAdversary presence makes verification and learning paramount
Efficiency-Optimal\(C > 0.8\) and \(R > 0.5\) and \(A_\text{adv} < 0.3\)EfficiencyAbundant resources enable optimization
Reliability-Balanced\(0.3 < C \leq 0.8\) and \(R > 0.5\)ReliabilityScarce connectivity makes delivery the bottleneck
Autonomy-Forced\(C \leq 0.3\) and \(R > 0.5\)AutonomyIsolation requires local decision-making

Transition boundaries carry \(\pm 10\%\) margins to prevent oscillation.

Proof sketch: Treating system utility \(U(C, R, A)\) as smooth over the state cube, the binding constraint at any state is whichever capability — if improved by 1% — yields the largest utility gain, i.e., the constraint with maximum impact ratio ( Definition 94 ).

Survival dominates when — resource exhaustion overrides communication state — or when \(C = 0\) and \(R < 0.5\), where no external path exists and the resource margin is insufficient for sustained autonomous operation. Trust/ anti-fragility dominates at \(A_\text{adv} > 0.5\) because adversarial interference raises above all other partial derivatives: unverified state and corrupted learning invalidate efficiency and reliability optimizations.

The efficiency/reliability/autonomy ordering of the remaining regions follows the connectivity-gradient argument: as \(C\) falls below 0.8, message delivery becomes scarce; below 0.3, isolation makes local decision authority the critical capability. These dominance orderings hold when \(R > 0.5\) and \(A_\text{adv} < 0.5\) — the original single-variable model is the cross-section of this surface at favorable resource and threat levels.

Watch out for: the gradient surface is computed from the current operational state \((C, R, A_\text{adv})\), not from a prediction of its trajectory; when all three axes are simultaneously changing — a jamming event that degrades \(C\), forces healing that depletes \(R\), and raises \(A_\text{adv}\) — the gradient computed at any single instant reflects the binding constraint for that instant, not for the trajectory the system is on; the actual binding constraint when the axes stabilize may differ from the one selected at the point of maximum rate of change, and a system that transitions its resource allocation at the midpoint of a coordinated attack may shift toward the wrong constraint just as the attack peaks.

Unlike static systems where the binding constraint is stable, edge systems experience constraint migration —the binding constraint changes based on system state—connectivity level, resource availability, and adversary presence.

The binding constraint is whichever capability, if improved by 1%, would most increase overall system utility — exactly what measures: when is largest, efficiency is the binding constraint; when is largest, survival is the binding constraint. The multi-dimensional model captures state interactions: high \(A_\text{adv}\) (adversary) raises even when \(C\) and \(R\) are individually favorable, because an adversary can corrupt an optimized-but-unverified system.

Connectivity, resources, and threats interact non-linearly. High \(C\), low \(R\) produces a Survival-Critical state despite good connectivity — a well-connected system with depleted resources cannot sustain operations. Low \(C\), high \(R\), high \(A_\text{adv}\) produces Threat-Active — isolated, resourced, and under adversarial pressure, with trust verification and anti-fragility learning taking precedence over autonomy optimization. Medium \(C\), medium \(R\), low \(A_\text{adv}\) produces Reliability-Balanced — the original “degraded” case, valid when threat levels are absent. High \(C\), high \(R\), high \(A_\text{adv}\) triggers Threat-Active regardless of Efficiency-Optimal conditions — abundant resources and connectivity provide no advantage if adversarial interference corrupts state.

The single-variable connectivity model holds when \(R > 0.5\) and \(A_\text{adv} < 0.5\) — the favorable-baseline cross-section of the full state surface.

Calibration. Thresholds should be set from operational data: is the resource level at which systems enter emergency mode (measure from operational logs); is threat sensitivity calibrated to deployment context (tactical edge: 0.3–0.5 (illustrative value); commercial edge: 0.1–0.3 (illustrative value)).

For RAVEN : (illustrative value) (25% battery triggers return-to-base), (illustrative value) (moderate jamming detected). For OUTPOST : (illustrative value) (high-threat environment with sustained jamming baseline); the connected-state threshold may also fall to \(C = 0.5\) (illustrative value) given lower baseline satellite link capacity.

Architecture implication: The system must handle all constraint configurations. It is not sufficient to optimize for connected state if the system spends 60% (illustrative value) of time in degraded or denied states. The constraint sequence must address all states.

Connectivity-Dependent Capability Targets

Each connectivity state has different capability targets. In the Connected regime (\(C > 0.8\)), the target is L3-L4 (fleet coordination, full integration); the system enables streaming telemetry, real-time coordination, and model updates, optimizing for latency, throughput, and efficiency. In the Degraded regime (\(0.3 < C \leq 0.8\)), the target is L2 (local coordination); the system enables priority messaging, cluster coherence, and selective sync, optimizing for message priority, queue management, and selective retransmission. In the Denied regime (\(0 < C \leq 0.3\)), the target is L1 (basic mission); the system enables autonomous operation, local decisions, and state caching, optimizing for autonomy, local resources, and decision logging. In Emergency (\(C = 0\), resources critical), the target is L0 (survival); the system enables safe state, power conservation, and distress beacon, optimizing for endurance, safety, and recovery potential.

The constraint sequence must ensure each state’s target capability is achievable before assuming higher states will be available. Design for denied, enhance for connected.

The labels used here (Connected/Degraded/Denied/Emergency) are a practical operational simplification; the authoritative regime taxonomy is the four-valued \(\Xi(t)\) from Definition 6 (Connected/Degraded/Intermittent/None), where “Denied” here corresponds to Intermittent (\(0 < C \leq 0.3\)) and “Emergency” corresponds to the None regime (\(C = 0\)) combined with a resource-critical condition.

Dynamic Re-Sequencing

Static constraint sequence s are defined at design time. But operational conditions may require dynamic adjustment of priorities.

(Note: “phase” in this section refers to development maturity in the Constraint Sequence ( Definition 17 ), not to capability hierarchy levels L0–L4 from the capability automaton (Why Edge Is Not Cloud Minus Bandwidth). Both hierarchies are in use concurrently; context disambiguates.)

RAVEN example: Under normal conditions, priorities run fleet coordination first, surveillance collection second, self-measurement third, and learning/adaptation fourth. During heavy jamming, priorities re-sequence to self-measurement first (detect anomalies before propagation), fleet coordination second (limited to essential), surveillance third (reduced bandwidth), and learning last (suspended).

The jamming environment elevates self-measurement because anomalies must be detected before they cascade. Re-sequencing triggers when ( Definition 95 ), not just on anecdotal jamming observation — this connects the formal adversary model to operational priority shifts. This is dynamic re-sequencing based on observed conditions.

Re-sequencing carries three risks: adversarial gaming (an adversary who knows the re-sequencing rules can trigger priority shifts that benefit them), oscillation (rapid priority shifts may cause instability), and complexity (the re-sequencing logic itself becomes a failure mode).

Mitigations. Re-sequencing is bounded to predefined configurations — no arbitrary priority changes. must be sustained for before triggering re-sequence — this closes the adversarial gaming gap, as an adversary cannot drive priority shifts without sustaining detectable threat levels above the confidence threshold. Priority changes are rate-limited to prevent oscillation, and re-sequencing logic is tested as rigorously as primary logic.

Cognitive Map: The constraint migration section generalizes the static prerequisite graph into a dynamic model. The three-dimensional state cube \((C, R, A)\) captures the full operational context. The five key regions partition this cube into stable binding-constraint assignments. Proposition 67 provides the gradient-based formal definition. The connectivity-dependent capability targets table gives the operational implementation — each connectivity regime has a target capability level and the specific capabilities to enable and optimize within it. The adversary threshold requires sustained detection above before re-sequencing, closing the gaming gap.

Empirical status: The \(\pm 10\%\) hysteresis margins and the RAVEN / OUTPOST threshold values (\(R_{\text{crit}} = 0.25\), \(A_{\text{threshold}} = 0.4\)–\(0.5\)) are calibrated from simulation; the appropriate hysteresis width depends on the observed oscillation frequency at state boundaries, and the adversary threshold must be re-measured per deployment context — tactical edge deployments with sustained RF jamming baselines typically require \(A_{\text{threshold}}\) closer to 0.3 than the RAVEN 0.4 figure.


The Meta-Constraint of Edge

Autonomic capabilities consume resources — CPU for health checks, bandwidth for gossip, memory for CRDT state, compute for bandit weight updates. These resources compete directly with the primary mission: a drone spending 40% (illustrative value) of its CPU on self-measurement has 40% less CPU for threat detection. The meta-constraint bounds autonomic overhead, and Proposition 68 gives the feasibility condition: if the minimum resource requirement for full autonomic management exceeds the ceiling , the system cannot simultaneously fulfill its mission and self-manage. Practical allocation: mission 70–25% (illustrative value), measurement 10–79% (illustrative value), healing 5–10% (illustrative value), coherence 5–10% (illustrative value), learning 1–24% (illustrative value). Ultra-constrained hardware (STM8L151, 4–36 KB SRAM) cannot run the full autonomic stack — the stack exceeds the autonomic ceiling by 8–\(16\times\) — so the zero-tax implementation uses hardware registers and flash writes instead of SRAM structures, enabling OBSERVE-only capability in 140 bytes. Hardware tier determines maximum achievable capability level.

Optimization Competes for Resources

Every autonomic capability consumes resources: self-measurement draws CPU for health checks, memory for baselines, and bandwidth for gossip ; self-healing draws CPU for healing logic, power for recovery actions, and bandwidth for coordination; fleet coherence draws bandwidth for state sync, memory for conflict buffers, and CPU for merge operations; and anti-fragile learning draws CPU for model updates, memory for learning history, and bandwidth for parameter distribution.

Proposition 68 (Autonomic Overhead Bound). For a system with total resources and minimum mission resource requirement , the maximum feasible autonomic overhead is:

Whatever you can’t spend on the mission is all you have for self-management — on a 500 mW (illustrative value) sensor node with 350 mW (illustrative value) mission floor, RAVEN ’s entire autonomic stack must fit in 150 mW (illustrative value).

The proposition defines the maximum resource ceiling available for all autonomic functions before mission capability is impaired; (illustrative value) of for most deployments yields an autonomic ceiling of (illustrative value) total. Autonomic overhead measured in isolation routinely consumes \(2{-}3\times\) (illustrative value) the designed budget in production environments.

Systems where cannot achieve both mission capability and self-management.

Watch out for: the ceiling is derived from specified at design time; because autonomic overhead routinely consumes \(2{-}3\times\) the designed budget in production, and is itself commonly underestimated in design-phase power budgets, the actual autonomic ceiling can be negative by the time the system is deployed — meaning no autonomic function can run without starving the primary mission, a failure that does not manifest until the first field trial.

For concrete autonomic overhead figures ( in mW by capability level), see Definition 51 (Self-Healing Without Connectivity), which provides L0–L4 power consumption bounds: L0 \(\approx\) 0.1 mW through L4 \(\approx\) 42 mW. These figures instantiate the Law 3 constraint for RAVEN and OUTPOST deployments.

These resources compete with the primary mission. A drone spending 40% (illustrative value) of its CPU on self-measurement has 40% less CPU for threat detection. This creates the meta-constraint:

where is the resources for the primary mission function and is the total available resources, and:

Physical translation: is a hard resource budget constraint. On a 500 mW (illustrative value) edge device with 350 mW (illustrative value) mission minimum, the autonomic ceiling is 150 mW (illustrative value). The four autonomic functions (measure, heal, coherence, learn) must collectively fit within this 150 mW. If self-healing alone requires 100 mW (illustrative value) at peak, and gossip requires 80 mW (illustrative value), the total (180 mW (illustrative value)) violates the constraint — one capability must be reduced. The table above gives the recommended allocation percentages; the actual values depend on platform-specific power measurements, not design-time estimates.

Empirical status: The 70–80% (illustrative value) mission / 20–30% (illustrative value) autonomic split is a rule-of-thumb derived from RAVEN and OUTPOST power-profiling; the correct ceiling depends on the platform’s idle power floor and peak healing-action cost, and autonomic overhead routinely runs \(2{-}3\times\) the designed budget in production — measure it in isolation before setting \(R_{\text{mission}}^{\min}\).

If is too large, mission capability suffers. If is too small, the system cannot self-manage and fails catastrophically.

The optimization infrastructure paradox: The system optimizing itself competes with the system being optimized. Self-measurement that is too thorough leaves no resources for the thing being measured. Self-healing that is too aggressive destabilizes the thing being healed.

Budget Allocation Across Autonomic Functions

Practical resource allocation requires explicit budgets:

FunctionBudget RangeRationale
Mission70-80%Primary function; majority of resources
Measurement10-15%Continuous; scales with complexity
Healing5-10%Burst capacity; dormant when healthy
Coherence5-10%Event-driven; peaks on reconnection
Learning1-5%Background; lowest priority

Budgets shift dynamically based on system state: during healing, budget is stolen from learning (healing is urgent, learning can wait); post-reconnection, the coherence budget is elevated to address the reconciliation backlog; in stable operation, investment flows to learning (conditions favor adaptation); and under resource stress, all autonomic budgets are reduced (mission priority takes precedence).

The budget allocation itself is a constraint—it determines what autonomic capabilities are feasible. A resource-constrained edge device (e.g., 500 mW (illustrative value) power budget) may not be able to afford all autonomic functions. The constraint sequence must account for resource availability.

Zero-Tax Implementation for Ultra-Constrained Hardware

The budget table above assumes a node can afford a 13 KB autonomic stack. On true edge hardware — STM8L151 sensor nodes, Cortex-M0+ beacons, LoRaWAN endpoint MCUs — this assumption fails before a single line of mission code runs. The full autonomic stack (EWMA baseline, Merkle health ledger, gossip table, EXP3-IX weight vector, event queue, vector clock) totals approximately 13 KB (illustrative value) of SRAM. On a 4–36 KB (illustrative value) device, the autonomic ceiling from Proposition 68 is 800–5,600 bytes (illustrative value): the stack exceeds its budget by \(8{-}16\times\) (illustrative value).

Hardware TierExample MCUSRAMAutonomic CeilingStandard StackOBSERVE StateMax Capability
Ultra (L0)STM8L1514–36 KB0.8–5.6 KB~13 KB — infeasible140 B — feasibleOBSERVE only
Constrained (L1)STM32L08–80 KB1.6–26.4 KB~13 KB — marginal140 B — feasibleOBSERVE + WAKEUP
Standard (L2)STM32L464 KB12.8 KB~13 KB — feasible140 B — feasibleFull MAPE-K (no MAB)
Rich (L3+)STM32H7256 KB+51 KB~13 KB — feasible140 B — feasibleFull stack

Standard stack breakdown: EWMA state 80 B + Kalman matrices 80 B + Merkle health tree 8,160 B + gossip table 1,000 B + EXP3-IX weights 2,048 B + event queue 1,024 B + vector clock 200 B = 12,592 B \(\approx\) 13 KB. Zero-Tax OBSERVE state: hash chain 16 B + fixed-point EWMA 40 B + threshold vector 20 B + Bloom filter 60 B + state flags 4 B = 140 B — a \(65\times\) footprint reduction.

The Zero-Tax approach defers full stack initialization until anomaly evidence is quorum-confirmed.

Sensing cost is separate from logic cost. The Zero-Tax tier budgets for autonomic logic — the computation performed once sensor data is available. It does not budget for state estimation overhead: the power required to run sensors at the rate needed to keep the state estimate \(x(t)\) fresh enough for safety-critical checks. At baseline MAPE-K rates (0.2 Hz (illustrative value) for RAVEN), sensing cost is absorbed into the platform’s normal sensor budget. If the dCBF safety filter activates and requires high-rate IMU sampling (10–72 Hz (illustrative value)), the sensing cost rises by \(10\text{–}72\times\) (illustrative value) and must be drawn from the emergency power reserve — not the Zero-Tax logic budget. Deployments that treat Zero-Tax logic cost as the total autonomic overhead will underestimate power consumption during fault events by one to two orders of magnitude.

Definition 128 (Zero-Tax Autonomic State Machine). A three-state lazy-initialization machine with states OBSERVE, WAKEUP, ACTIVE and transitions:

with back-transitions on three consecutive clean ticks and on quorum=BENIGN verdict.

Resource consumption varies by state: OBSERVE (steady-state) requires the hash chain, fixed-point EWMA, threshold vector, and Bloom filter — 140 B SRAM total, with one SipHash-2-4 per tick ( at Cortex-M0+); WAKEUP (anomaly suspected) adds the gossip init buffer and peer-validation queue for +2 KB SRAM with gossip at reduced rate; and ACTIVE (fault confirmed) enables the full MAPE-K stack at +11 KB SRAM with all capabilities enabled, reverting to OBSERVE after three consecutive clean ticks.

    
    stateDiagram-v2
    [*] --> OBSERVE
    OBSERVE --> WAKEUP : z_t > theta0 for tau_confirm ticks
    WAKEUP --> OBSERVE : quorum = BENIGN
    WAKEUP --> ACTIVE : quorum = FAULT
    ACTIVE --> OBSERVE : 3 consecutive clean ticks

The state machine allocates the active SRAM tier (140 B in OBSERVE / +2 KB in WAKEUP / +11 KB in ACTIVE) as a function of anomaly-evidence state, on any MCU where the full 13 KB stack exceeds the autonomic ceiling from Proposition 68 . The default confirmation window is tau_confirm = 3 ticks (illustrative value), with theta_0 set from the initial EWMA threshold; by deferring full MAPE-K allocation until anomaly evidence is quorum-confirmed, the machine prevents stack-induced mission starvation.

Definition 129 (In-Place Hash Chain). A health ledger using SipHash-2-4 applied iteratively to a 16-byte state register:

where is the 8-bit quantized metric vector at tick and is a device-unique key provisioned at manufacture. Any corruption or replay of propagates into within one tick; a remote peer holding detects divergence by comparing the 4-byte chain suffix.

Hash ChainMerkle Tree (1,024 nodes)
SRAM16 B state + 4 B suffix8,192 B
CPU per tick\(1\times\) SipHash \(10\times\) hash operations
Per-node proofNo — sequence integrity onlyYes
FPU requiredNoNo

The hash chain computes a 16-byte running integrity digest of the metric history since last sync, detecting tampering within one tick; it applies on Ultra/Constrained-tier MCUs where the Merkle health tree exceeds the autonomic ceiling from Proposition 68 . The key \(k_{\mathrm{root}}\) is provisioned at manufacture; the 4-byte suffix comparison provides approximately (theoretical bound) distinct hash values, ensuring single-byte tampering shifts the chain within one MAPE-K tick.

Definition 98 (Fixed-Point EWMA). An exponentially weighted moving average in Q8.8 fixed-point arithmetic using only 16-bit integer operations:

where is the smoothing coefficient encoded as an unsigned byte. The update compiles to 2 MUL + 1 ADD + 1 SHR on Cortex-M0+ ( at 32 MHz); no FPU instruction is required.

MCU implementation: maintain a signed 16-bit accumulator mu; on each tick, compute mu = (alpha_fp * x + (256 - alpha_fp) * mu) >> 8 using two 16-bit multiply instructions and one arithmetic right-shift — no floating-point unit required. The right-shift by 8 implements the division by 256 implicit in the Q8.8 representation.

For OUTPOST sensor nodes: gives (10-sample effective window). Total SRAM: 40 B (state 2 B + variance 2 B + threshold 2 B + 16-sample history 32 B + flags 2 B).

The fixed-point EWMA computes the baseline and running variance in Q8.8 arithmetic on Cortex-M0+, STM8, or AVR MCUs without FPU, where floating-point EWMA costs \(10{-}100\times\) (theoretical bound) more CPU per update. The default smoothing byte is alpha_fp = 26, giving \(\alpha \approx 0.10\) and a 10-sample effective window (illustrative value); alpha_fp = 51 gives \(\alpha \approx 0.20\) for faster-drifting signals (illustrative value). Soft-float EWMA on Cortex-M0+ costs approximately 50 cycles per update versus 4 cycles fixed-point (theoretical bound under illustrative parameters), eliminating FPU-dependency lock-in.

Proposition 69 (Wakeup Latency Bound). Under Definition 128, the worst-case transition latency from OBSERVE to ACTIVE satisfies:

Three confirmation ticks plus one gossip round is all the time the system has to wake up — OUTPOST ’s 127-sensor mesh takes at most 45 seconds (theoretical bound), leaving 75 seconds (theoretical bound) of margin before the healing deadline.

where is the gossip convergence bound from Proposition 12 . For the Zero-Tax stack to preserve the healing deadline from Proposition 21 :

Proof. The OBSERVE-to-WAKEUP transition requires consecutive anomaly ticks: at most seconds. The WAKEUP-to-ACTIVE transition requires one gossip round for quorum formation: at most seconds. Sequential composition gives the upper bound. The second inequality is the necessary condition for Proposition 21 ’s end-to-end healing deadline to remain intact after wakeup overhead is deducted from the detect sub-budget. \(\square\)

Scenariotau_confirmT_tickT_gossipT_wakeupT_healMargin
OUTPOST (127 sensors)35 s30 s\(\leq\) 45 s120 s75 s
RAVEN (47 drones)22 s15 s\(\leq\) 19 s30 s11 s
Ultra-L0 OBSERVE-only5 sdoes not transitionalert-only

The proposition bounds the worst-case end-to-end wakeup latency from OBSERVE to full MAPE-K-ACTIVE, composed of the confirmation window (tau_confirm · T_tick) and gossip convergence ( Proposition 12 ); tau_confirm = 3 ticks (illustrative value) and T_heal from Proposition 21 minus T_detect determine the valid sizing envelope. An oversized tau_confirm allows the lazy-init stack to appear lighter in benchmarks while silently missing the healing deadline in production.

Empirical status: The 45 s ( OUTPOST ) and 19 s ( RAVEN ) wakeup latency figures are worst-case bounds computed from using the Proposition 12 gossip bound; actual wakeup in low-congestion field conditions is typically 30–50% (illustrative value) faster, and \(\tau_{\mathrm{confirm}}\) should be re-tuned if the ambient false-anomaly rate changes deployment-to-deployment to avoid trading margin for sensitivity.

Watch out for: the bound assumes the gossip quorum can form during the WAKEUP state; if the fault that triggered the OBSERVE-to-WAKEUP transition is itself a network partition that simultaneously prevents gossip convergence — the most common compound failure mode — the WAKEUP-to-ACTIVE transition stalls indefinitely, and the healing deadline from Proposition 21 is missed not because tau_confirm was oversized, but because the quorum formation assumption was violated by the fault that triggered wakeup in the first place.

The autonomic richness trade-off. The Zero-Tax architecture trades richness for feasibility: a node in OBSERVE state cannot run EXP3-IX bandit selection, Kalman filtering, or the Weibull circuit breaker — those require ACTIVE state and the full 13 KB stack. This materializes the constraint sequence of Definition 92 as an economic ordering, not just a logical one. Self-measurement (hash chain + fixed-point EWMA) costs 140 B. Self-healing ( Proposition 21 deadline loop) costs another 11 KB. Anti-fragile learning (EXP3-IX, Kalman) costs another 4 KB. A 4 KB node gets exactly one capability tier — OBSERVE — and must accept that it cannot self-heal, only self-detect and alert. The constraint is not a software limitation: it is the physics of SRAM against the mathematics of autonomy.

Definition 99 (Clock Trust Pivot). A binary predicate on node \(i\) that sets the trust_flag field of Definition 101 based on whether the partition accumulator \(T_{\mathrm{acc}}\) has exceeded the platform-specific trust horizon \(T_{\mathrm{trust}}\):

For an oscillator with drift \(\delta_{\mathrm{ppm}}\), the trust horizon satisfies , where \(\varepsilon\) is the clock uncertainty bound from Definition 11 . At \(\delta_{\mathrm{ppm}} = 5\) (illustrative value) and \(\varepsilon = 4.6\,\mathrm{s}\) (illustrative value): (illustrative value). Receivers must not use \(T_{\mathrm{acc}}\) from a sender with trust_flag = 0 as a causal-ordering tiebreaker.

The Clock Trust Pivot computes a single-bit trust signal written to the UAH flags byte at every MAPE-K tick, requiring zero additional SRAM beyond the flags byte already present in Definition 101 ; consumers of the Conflict Resolution Branch ( Definition 11 ) read trust_flag before using the sender’s \(T_{\mathrm{acc}}\) as a tiebreaker for uncertainty-concurrent events. The drift rate \(\delta_{\mathrm{ppm}}\) is drawn from the oscillator datasheet: 5 ppm (illustrative value) for TCXO-class crystals ( RAVEN drones), 20 ppm (illustrative value) for uncalibrated RC oscillators ( OUTPOST Ultra-L0 sensor nodes). A stale trust_flag = 1 allows the Conflict Resolution Branch to treat a drifted \(T_{\mathrm{acc}}\) as authoritative, producing the exact physical-time inversion that Definition 11 ’s uncertainty window was designed to prevent.

Definition 100 (WAKEUP Heap Gate). The contiguous-allocation precondition for the OBSERVE-to-WAKEUP transition of Definition 128:

If , the anomaly-evidence counter (\(z_t > \theta_0\) for \(\tau_{\mathrm{confirm}}\) consecutive ticks) is registered but the state transition is suppressed. The node remains in OBSERVE, continues hash-chain and fixed-point EWMA updates, and retries the gate at the next confirmation cycle. Definition 102 ’s Resource threshold is precisely — the heap gate becoming permanently blocked is one of the three AES trigger conditions.

The WAKEUP Heap Gate computes heap availability at the OBSERVE/WAKEUP boundary, feeding both the transition guard of Definition 128 and the AES Resource condition of Definition 102 ; the check runs once per \(\tau_{\mathrm{confirm}}\)-tick anomaly confirmation window, not per-tick, to avoid allocation churn on fragmented heaps. The minimum allocation is (theoretical bound); fleets larger than 127 nodes (illustrative value) whose gossip table exceeds 1 KB require a correspondingly larger allocation. Without the gate, a fragmented heap may return a non-NULL pointer for a smaller block and silently corrupt the gossip buffer layout — a failure mode undetectable until the first quorum vote.

Definition 101 (Unified Autonomic Header — Firmware Memory Map). The UAH is the 20-byte packed struct that (a) occupies the first 20 bytes of the 140 B OBSERVE static allocation and (b) forms the wire header of every inter-node frame and the 23-byte emergency beacon. Its bit-field layout is the strict byte-level serialization of the 8-stage constraint sequence (Definition 92): each field group maps to exactly one phase gate.

Bit-field register map (160 bits = 20 bytes, little-endian):

Byte(s)BitsFieldWidthEncoding and source
0[7:4]q_i4 bCapability tier: 0=L0 … 4=L4 (Definition 68); user-specified 4-bit Mode field
0[3:2]zt_state2 b00=OBSERVE, 01=WAKEUP, 10=ACTIVE, 11=AES (Definition 128)
0[1]nsg_veto1 b1 = hardware veto active; \(K_{\mathrm{gs}} = 0\) (Proposition 32)
0[0]trust_flag1 b1 = HLC trusted; 0 = drift exceeded \(T_{\mathrm{trust}}\) (Definition 99)
1[7:0]ep_lo8 benergy_delta[7:0]: low byte of 12-bit signed energy surplus
2[7:4]ep_hi4 benergy_delta[11:8]: high nibble; sign-bit in position 11
2[3:0]rq_hi4 brho_q[11:8]: high nibble of CBF margin \(\rho_{q,i}\)
3[7:0]rq_lo8 brho_q[7:0]: low byte; combined Q3.9 range \([-4,+4)\) mW
4–27[31:0]hlc_pt32 bHLC physical timestamp, ms mod \(2^{32}\) (Definition 61); user-specified 32-bit HLC field
8–57[31:0]hlc_c32 bHLC Lamport counter (Definition 61)
12–79[31:0]t_acc32 bPartition accumulator \(T_{\mathrm{acc}}\), seconds (Definition 15)
16–94[31:0]h_sfx32 bSipHash-2-4 4-byte chain suffix (Definition 129)

Total: \(8+8+4+4+8+32+32+32+32 = 160\) bits = 20 bytes. The 12-bit energy_delta (bytes 1–6 high nibble + byte 1 low byte) is the user-specified 12-bit Energy Delta field: it encodes the signed energy surplus derived from the fixed-point EWMA variance. The 12-bit rho_q is the CBF stability margin from Proposition 25 .

Firmware type contract (ARM Cortex-M / RISC-V, __attribute__((packed)), little-endian):

The wire type UAH_t is a 20-byte packed struct with __attribute__((packed)) and a compile-time static_assert(sizeof(UAH_t) == 20) to catch alignment padding. Without packed, GCC 13 on Cortex-M4 inserts 3 bytes after the flags byte, inflating the struct to 24 B and breaking every receiver’s field parser. The three 8-bit bytes ep_lo, ep_hi_rq_hi, and rq_lo carry two nibble-packed 12-bit signed fields: the signed energy surplus in ep_lo plus the high nibble of ep_hi_rq_hi, and the CBF stability margin \(\rho_{q,i}\) in rq_lo plus the low nibble of ep_hi_rq_hi. Both fields are Q3.9 signed integers with range \(\pm 4\,\text{mW}\), recovered by a 4-bit arithmetic right-shift of the 16-bit value after nibble assembly.

OBSERVE flat struct layout (140 bytes, statically allocated at boot):

OffsetSizeSymbolContent
+020 BuahUAH struct (this definition) — updated in-place at each MAPE-K tick
+2016 Bsiphash_state[2]SipHash-2-4 running state \(h[n]\), uint64_t[2] (Definition 129); key \(k_{\mathrm{root}}\) in flash
+362 Bewma_muQ8.8 EWMA baseline \(\hat{\mu}[n]\) (Definition 98)
+382 Bewma_varQ8.8 running variance \(\hat{\sigma}^2[n]\) (Definition 98)
+402 Bewma_threshQ8.8 anomaly threshold \(\theta_0\)
+422 Bewma_hystQ8.8 hysteresis band \(\delta_h\) (Definition 47)
+441 Balpha_fpEWMA smoothing byte \(\alpha_{\mathrm{fp}}\) (Definition 98)
+451 Bconfirm_cntConsecutive anomaly tick counter
+462 Bring_idx16-sample EWMA history ring head index
+4832 Bewma_hist[16]16-sample Q8.8 metric history ring (int16_t[16])
+808 Bbloom_seeds[4]Bloom filter seed keys (uint16_t[4])
+8852 Bbloom_bits[52]416-bit Bloom filter bit array (Definition 24 gossip fingerprint)
+140(end of OBSERVE struct)

OBSERVE_t is a 140-byte packed struct with __attribute__((packed)) and static_assert(sizeof(OBSERVE_t) == 140). It is declared as static OBSERVE_t obs_block at global scope so the linker places it in .bss (zero-initialized at boot) — no heap allocation is needed. The leading field is UAH_t uah (bytes 0–94, Definition 101 ), followed by uint64_t siphash_state[2] (bytes 20–18, in-place hash chain state, Definition 129 ). The remaining 104 bytes hold the six EWMA scalars as Q8.8 int16_t values, the 16-sample history ring as int16_t[16], and the 416-bit Bloom filter as uint8_t[52] (see offset table above). obs_block.uah is passed by pointer as the TX header — zero copy.

The UAH defines the 20-byte wire format for every inter-node frame and the 23-byte AES beacon (UAH + node_id 2 B + AES error code 1 B), along with the canonical byte-offset map for the 140 B OBSERVE static allocation. The struct is declared static OBSERVE_t obs_block at global scope so the linker places it in .bss (zero-initialized at boot); siphash_state is initialized from \(k_{\mathrm{root}}\) read from flash OTP before the first MAPE-K tick, and obs_block.uah is passed by pointer as the TX header with zero copy. Little-endian byte order matches ARM Cortex-M and RISC-V; big-endian targets must byte-swap hlc_pt, hlc_c, t_acc, and h_sfx at TX/RX, while ep_hi_rq_hi is endian-neutral. Without __attribute__((packed)), GCC 13 on Cortex-M4 inserts 3 bytes of alignment padding after flags, inflating UAH_t from 20 B to 24 B and silently breaking every receiver’s field parser; the static_assert catches this at compile time.

Proposition 70 (Firmware Memory Footprint). The Zero-Tax autonomic stack satisfies at every OBSERVE-state MAPE-K tick:

The SipHash chain, fixed-point EWMA, Bloom filter, and UAH header together occupy 140 bytes of static RAM and borrow 48 bytes of stack — a \(65\times\) reduction from the 13 KB full autonomic stack that fits on any 4 KB MCU.

Proof. Static allocation (140 B): OBSERVE_t is a compile-time constant size, placed in .bss, and never freed. Heap is not required — Definition 100 guarantees heap failure suppresses WAKEUP, not OBSERVE. Stack: (i) callee-saved registers on Cortex-M0+ — LR + r4–r8 = 5 words = 20 B; (ii) SipHash-2-4 internal call frame per RFC 7693 Sec. 2.4 — \(4 \times 32\)-bit working words + return address = 20 B; SipHash operates on siphash_state in the static struct, so no secondary buffer is needed; (iii) EWMA update — acc (int32) + x_raw (int16) + delta (int16) = 8 B. ep_hi_rq_hi is computed in a single 8-bit register with no stack spill. The beacon ISR passes &obs_block.uah by pointer — zero stack copy. Total stack peak 48 B. . \(\square\)

ComponentStaticStackConstraint-sequence phase
uah (Definition 101)20 BPhase 0 — frame header: all phases read/write q_i, zt_state
SipHash state16 BPhase 1 — self-measurement: integrity ledger (Definition 129)
EWMA scalars (\(\hat{\mu}, \hat{\sigma}^2, \theta_0, \delta_h, \ldots\))12 BPhase 1 — self-measurement: anomaly baseline (Definition 98)
EWMA history ring32 BPhase 1 — self-measurement: detection sensitivity
Bloom filter60 BPhase 2 — gossip readiness: peer fingerprint cache (Definition 24)
Callee-saved registers20 BISR overhead: platform-invariant on Cortex-M0+
SipHash call frame20 BPhase 1: per-tick transient; cleared after hash
EWMA computation temps8 BPhase 1: per-tick transient
Total140 B48 B\(188\,\text{B} \leq 200\,\text{B}\)

Empirical status: The 188 B total (140 B static + 48 B stack) is exact for Cortex-M0+ with GCC 13 at -Os; the stack frame varies \(\pm 4\) B across compilers depending on callee-save conventions, and the 200 B ceiling leaves 12 B margin that should be verified with a static_assert on each new target platform rather than assumed portable.

Watch out for: the 188 B bound is exact for Cortex-M0+ with GCC 13 at -Os on little-endian ARM, but is not portable across compilers and targets without verification; a platform that silently adds alignment padding inflates the static allocation beyond the 200 B ceiling without violating any runtime condition, making the overflow detectable only when the static_assert(sizeof(OBSERVE_t) == 140) fails at compile time — if that assert is omitted, the footprint violation propagates silently into production.

Zero-Tax scope — RAM and compute only, not radio energy: Proposition 70 proves zero dynamic RAM allocation — not zero radio energy. At SF12, the 23-byte UAH beacon adds approximately \(205,\mu\text{W}\) to an \(100,\mu\text{W}\) L0 budget at the default 60 s interval. Set for SF10–SF12 nodes at . Extending the interval saves energy but creates a detection latency trade-off: is the deployment invariant for mesh-visible L1 anomaly detection.

Zero-Tax radio overhead. The 23-byte UAH beacon ( Definition 101 ) is 15 bytes larger than a minimal 8-byte heartbeat. Those 15 extra bytes extend radio-on time \(\Delta t_{\text{on}}\) and consume real energy: .

At LoRa SF7/125 kHz, 14 dBm (25 mW): , ; at 60 s interval this adds — 13% of a \(100,\mu\text{W}\) L0 budget. At LoRa SF12/125 kHz (OUTPOST long-range sensors): , ; at 60 s interval this adds exceeding the entire \(100,\mu\text{W}\) L0 power budget.

The UAH header is a bounded, predictable overhead that must be offset by extending the beacon interval. The minimum safe interval to keep UAH radio overhead below fraction \(f\) of total \(P_{L0}\) is:

\[T_{\text{beacon,min}} = \frac{\Delta E_{\text{beacon}}(SF)}{f \cdot P_{L0}}\]

For OUTPOST Ultra-L0 ( , SF12, \(f = 0.10\)): . The 60-second default ( Definition 102 ) is valid for SF7–SF9 deployments. Nodes above \(500,\mu\text{W}\) budget (RAVEN drones, CONVOY vehicles) are unconstrained at any standard interval.

Latency-Energy Deadlock — you cannot heal what you cannot hear. Extending \(T_{\text{beacon}}\) saves energy but directly increases detection latency for node failures. A failed node is confirmed absent only after two consecutive missed beacons (one miss is indistinguishable from a packet loss in a lossy LoRa mesh):

where \(\tau_{\text{gossip}}\) is the gossip convergence time (\(O(D \ln n / \lambda)\) rounds, Proposition 12 ). For the OUTPOST SF12 case: (theoretical bound). The L1 anomaly detection layer is operationally suspended for node \(k\) whenever:

For OUTPOST SF12: L1 detection is valid only for failure classes with . The MVS backup-power failure ( (illustrative value), 90 min) survives this constraint with a 15% (illustrative value) margin. Short-window failure classes do not: a sensor watchdog trip ( (illustrative value)) or a threat-detection data gap ( (illustrative value)) cannot be fleet-detected within their criticality window at SF12 beacon intervals. For those failure classes, gossip-based fleet anomaly detection (Self-Measurement Without Central Observability) is suspended — the node continues to self-heal locally but becomes invisible to fleet-level coordination.

This deadlock condition must be declared in the Phase Gate progress record ( Definition 103 ): a node with \(T_{\text{detect}} > T_{\text{crit}}\) for any of its assigned failure classes cannot certify Phase Gate 3 for those classes and must be classified L0 from the fleet’s perspective until either the beacon interval is reduced (higher power mode) or the failure class is reclassified as out-of-scope for remote detection. The constraint is the irreducible physics of energy-constrained radio: is the deployment invariant for mesh-visible L1 anomaly detection.

Cross-Part Cascade: L0 Resource Model

Placing a node in OBSERVE state is not a local decision. Because every other part of the series assumes a running MAPE-K loop, the OBSERVE-state resource model propagates as a cascade of assumption violations through the formal machinery of each downstream part. The table below maps each upstream result to the assumption it requires, the violation OBSERVE creates, and the resulting impact.

PostFormal resultAssumption requiredOBSERVE violationCascade impact
Self-MeasurementProposition 9 optimal threshold Float-precision EWMA ( )Q8.8 Fixed-Point EWMA adds quantization noise Optimal rises by ; false-positive rate increases ~15% without recalibration
Self-MeasurementProposition 12 gossip convergenceGossip protocol running (Definition 24)Gossip disabled in OBSERVE; health vector frozenStaleness bound of Proposition 14 exceeded within one MAPE-K window; peers treat OBSERVE node as soft-failed
Self-MeasurementDefinition 26 staleness boundNode participates in gossip roundsZero gossip rounds from OBSERVE node effectively for OBSERVE node; staleness-aware consumers must treat its state as unverified
Self-HealingProposition 22 loop stabilityContinuous MAPE-K with fixed MAPE-K disabled in OBSERVE; Stability guarantee vacuously satisfied (no actuation); CBF margin still computed and embedded in UAH flags
Self-HealingProposition 21 healing deadlineMAPE-K running at partition startHealing doesn’t start until ACTIVE (after wakeup latency)Effective healing margin = ; RAVEN: 30 s - 19 s = 11 s; OUTPOST: 120 s - 45 s = 75 s
Self-HealingProposition 25 CBF mode safetyActuation loop active; possible always in OBSERVE; NSG veto never fires (no actuation to veto)Safe-set invariant is preserved trivially; mode-transition safety analysis irrelevant until WAKEUP
Fleet CoherenceProposition 41 divergence growthNode updates state at rate No local writes in OBSERVE; OBSERVE node accumulates only incoming peer writes; grows from peer side at full rate; buffer sizing must still account for OBSERVE node’s post-wakeup delta
Fleet CoherenceDefinition 70 delta-sync Phase 2Vector clock reflects local events frozen (no gossip events); delta set Receiver of UAH with zt_state = 00 skips comparison; uses only for integrity check
Fleet CoherenceDefinition 99 clock trust pivot reflects active partition duration still increments in OBSERVEOUTPOST crystal node hits regardless of Zero-Tax state; trust_flag = 0 fires in OBSERVE if partition exceeds 2.8 h
Anti-Fragile LearningDefinition 81 EXP3-IX weight updateMAPE-K tick triggers arm evaluationNo arm evaluation in OBSERVE; weights frozenOn WAKEUP, EXP3-IX reinits at uniform weights; long-partition context (Definition 87) not built; first healing actions are exploratory, not optimized
Weibull ModelProposition 37 circuit breaker incremented by active MAPE-K at each tick still increments passively in OBSERVE (hash chain tick, Definition 129); breaker threshold is reached when partition exceeds the P95 Weibull quantileBreaker fires correctly in OBSERVE: q_i = L0 is already in effect ( ); the only new action is setting UAH capability level to L0 explicitly — zero additional SRAM cost; if all three AES conditions co-occur, Definition 102 supersedes Proposition 37 and the AES bitmask (bit 0 = Resource) encodes the breaker event

The cascade has a single structural pattern: every result that assumes a running MAPE-K loop is vacuously satisfied or violated in OBSERVE, and every result that operates on passive signals ( , hash chain, CBF margin) continues to fire correctly.

Proposition 37 ’s Weibull circuit breaker belongs to the second category — it monitors the partition accumulator passively and fires at whether MAPE-K is running or not. The UAH ( Definition 101 ) was designed precisely around this split: its Clock Fix fields ( , ) and Resource Fix field ( ) update in OBSERVE; its Stability Fix fields ( , ) and Zero-Tax state bits (zt_state) signal to every receiver exactly which assumptions are currently violated on the sender.

Autonomic Emergency State: Triple-Threat Survival

The three fixes reach their individual theoretical limits gracefully — clock pivot fires at , OBSERVE state activates when the WAKEUP allocation fails, CBF sets when . The Black Swan scenario is when all three trigger simultaneously on the same node: nonlinear power state (mode oscillation violating dwell-time), 1-hour clock drift ( for every platform class), and 95% RAM full (WAKEUP allocation fails due to heap fragmentation). The fixes were designed independently; this section proves they remain correct when stacked.

Triple-Threat state analysis. On a 64 KB STM32L4 at 95% RAM utilization: free SRAM = 3.2 KB. The WAKEUP transition requests a 2 KB contiguous allocation. Due to heap fragmentation (worst-case: largest free block ), the allocation fails. The OBSERVE footprint (140 B) was pre-allocated as a static struct at boot — it is not heap-dependent and cannot be fragmentation-evicted. Clock drift = 3,600 s satisfies for every platform (RAVEN: threshold 1.1 s; CONVOY: 6 s; OUTPOST: 1 s + 60 s = 61 s). The Drift-Quarantine anomaly fires on any gossip contact. CBF margin : the node is outside the safe set, so and nsg_veto = 1 in the UAH.

Two-layer partition response. Proposition 37 (Weibull Circuit Breaker) is a single-condition outer envelope: it fires at P95 partition duration and constrains new decisions without entering AES. Definition 102 (AES) is the inner fallback: it requires resource AND clock AND stability limits to breach simultaneously before freezing all execution. A node can be circuit-broken without being in AES; AES entry implies the circuit breaker has already fired.

Definition 102 (Autonomic Emergency State). The Autonomic Emergency State (AES) activates on node when all three threat conditions are simultaneously satisfied:

where is the contiguous WAKEUP allocation size ( Definition 128 ), is the HLC watermark deviation, and is the CBF stability margin from Proposition 25 .

Triggering logic. AES activates only when all three conditions breach their limits simultaneously (logical AND, not OR). Individual limit breaches trigger their own single-condition responses: Proposition 37 (Weibull circuit breaker) fires on partition duration alone; load shedding fires on resource depletion alone. AES is reserved for compound failure — the intersection of three simultaneous breaches — which is the scenario where individual responses are insufficient.

In AES, the node executes exactly four actions and no others: it freezes by halting all CRDT writes and delta-sync transmissions and setting ; it signals by setting UAH flags (trust_flag = 0, zt_state = 00 (OBSERVE), nsg_veto = 1); it persists by continuing hash chain update ( Definition 129 ) and increment so passive monitoring survives; and it beacons by transmitting a 23-byte emergency frame every (default 60 s): UAH (20 B) + node_id (2 B) + AES error code (1 B).

Exit condition: all three threat conditions must resolve simultaneously before re-entering OBSERVE ( Definition 128 ):

The AES defines the minimal-survivable operating mode when all three structural constraints reach their limits simultaneously, requiring only 140 B static SRAM and a 23-byte beacon with no heap. The state activates automatically on simultaneous triple-threshold breach, with each condition checked at every MAPE-K tick; the beacon interval is (illustrative value) for LoRa SF7–SF9 and high-power nodes ( budget (illustrative value)), and \(\geq 300\,\text{s}\) (theoretical bound under illustrative parameters) for SF10–SF12 nodes at — see the Zero-Tax Radio Overhead note ( Proposition 70 ). The AES error code encodes which subset triggered (bitmask: bit 0 = Resource, bit 1 = Clock, bit 2 = Stability). Without AES, independent healing actions attempt simultaneous recovery and compete for the same 3.2 KB (theoretical bound under illustrative parameters) of fragmented heap.

Two-layer AES architecture. AES spans an application layer and a firmware layer with distinct semantics. The application layer (continuous scalar thresholds: battery, \(T_\text{acc}\), \(\Psi\)) provides early warning and activates the Safe Action Filter with a \(W_{\max} = 60\,\text{s}\) escape hatch for resource-starved ticks. The firmware layer (binary conditions: malloc failure, HLC drift, CBF margin) confirms that the OS-level resource and timing invariants have been violated and engages the 600 s recovery hold — the longer hold reflects the verification latency for firmware context re-initialization and HLC re-synchronization, which cannot be confirmed at application-layer sampling rates. AES activates when all three axes simultaneously exceed their respective thresholds (AND-combined; see Anti-Fragile Decision-Making at the Edge for the formal definition). Both the application layer (continuous scalar thresholds) and the firmware layer (binary conditions) must confirm breach before entry is declared. AES exits only when both layers report clear (AND-combined for exit).

AES atomic entry requirement. Transition into the Autonomic Emergency State must be atomic with respect to state initialization. The system must verify that (a) L0 physical interlocks are armed, (b) the MAPE-K loop has completed at least one full tick, and (c) the CRDT state has been initialized from a known-good snapshot before declaring AES entry complete. Partial initialization followed by partition creates an unrecoverable state gap.

AES response protocol — what the system does when triggered. On AES entry the system immediately transitions to the Zero-Tax OBSERVE state ( Definition 128 ), releasing all WAKEUP and ACTIVE heap allocations; suspends the MAPE-K loop (no Analyze, Plan, or Execute phases run); disables all EXP3-IX learning (weight vector frozen in place); continues only fixed-point EWMA ( Definition 98 ) and SipHash chain integrity ( Definition 129 ) from the OBSERVE state; and emits a distress beacon (23 B) every 60 s at maximum transmit power.

Exit condition: AES exit requires (see Anti-Fragile Decision-Making at the Edge, Definition 53 ) and all three axes below threshold for one full evaluation window. All three triggering conditions must clear simultaneously and remain clear for \(\geq T_{\text{recover}} = 600\) s before MAPE-K resumes. The 600 s hold aligns with ( Proposition 31 ) to prevent MAPE-K restart oscillation. Reducing \(T_\text{recover}\) below is not recommended without re-analysis of post-AES-exit stability margins.

On exit, restore the EXP3-IX weight vector from its frozen state — unless partition duration exceeded or the contamination score (proportion of rounds where adversarial non-stationarity was detected by Definition 34 ) exceeds 30% (illustrative value), in which case reset to the warm-start prior ( Definition 96 ) rather than restoring the pre-AES weights — adversarial learning during the partition should not corrupt the connected-regime policy.

Operator override: an authenticated operator command may reduce \(T_{\text{recover}}\) to after manually confirming firmware integrity (malloc pre-check succeeds, HLC drift is below tolerance, \(\rho_q \geq 0\)). The override requires two-factor authentication and is logged to the post-partition audit record ( Definition 57 ).

Formal action to operational step mapping. The Transition, Suspend, and Freeze steps above are all sub-implementations of the single Freeze formal action. There are 4 formal actions and 5 implementation steps; no action is missing or duplicated.

Formal action ( Definition 102 )Operational steps
Freeze (halt all plan execution)Transition to OBSERVE state; Suspend MAPE-K Execute phase; Freeze EXP3-IX weight updates
Signal (notify peers)Emit distress beacon — beacon carries the AES flag
Persist (maintain minimum health monitoring)Continue EWMA + SipHash integrity checks
Beacon (broadcast distress)Emit distress beacon on recovery channel

Proposition 71 (AES Survival Guarantee). Under Definition 102, the AES footprint survives the Triple-Threat scenario on any MCU with total SRAM :

The 163 B AES footprint is below the 5% free headroom (204 B) of the smallest viable MCU in the framework (Ultra L0, 4 KB). AES therefore survives every tier in the hardware table of Definition 128 .

Proof. The 140 B OBSERVE stack is pre-allocated as a static struct at boot (not heap-dependent; cannot be fragmentation-evicted). The 23-byte beacon frame is a stack-local variable within the beacon ISR. No dynamic allocation is required. The hash chain ( Definition 129 ) operates in-place on the 16 B chain register within the 140 B struct. At 95% RAM utilization on a 4 KB MCU: free = 204 B > 163 B. The AES footprint fits with 41 B margin. \(\square\)

MCU tierTotal SRAM95% utilizedFreeAES footprintSurvives?
Ultra L04 KB3.88 KB204 B163 BYes (41 B margin)
Constrained L132 KB30.4 KB1.6 KB163 BYes (1.4 KB margin)
Standard L264 KB60.8 KB3.2 KB163 BYes (3.0 KB margin)
Rich L3+256 KB243.2 KB12.8 KB163 BYes (12.6 KB margin)

Physical translation: The Autonomic Emergency State requires three simultaneous failures — the equivalent of a pilot declaring an emergency only when the engine is out, the altimeter is dead, and the landing gear will not deploy, all at once. Any two alone trigger heightened monitoring but not the full emergency protocol. The 163-byte AES footprint fits inside the 204-byte free headroom, confirming the protocol is implementable on the most constrained hardware in the fleet.

Watch out for: the 163 B footprint guarantee assumes the mission firmware leaves at least 5% of SRAM free; firmware that is continuously updated or loads runtime configuration tables may consume this headroom in the field without violating any design-time constraint, reducing the free margin below 163 B and making the AES non-survivable on the same MCU that passed the original certification — the proof’s validity is conditioned on the 95% utilization ceiling holding throughout the deployment lifetime, not just at the certification point.

Fleet-Level Autonomic Emergency State

When the fraction of fleet nodes in AES conditions exceeds \(f_\text{AES,max} = 0.40\) (illustrative value), the fleet enters Fleet Emergency Mode. Individual node safety is maintained by each node’s local AES protocol. At the fleet level: (1) all inter-node coordination defers to pre-loaded contingency plans established during Phase-0 commissioning; (2) the quorum threshold for the Logical Quorum drops to for the duration; (3) recovery from Fleet Emergency Mode requires a designated Commander Node (the highest-Authority-Tier node still continuously connected) to certify fleet state before normal quorum rules resume. The Commander Node designation uses the same Authority Tier hierarchy defined in Definition 68 .

The reduced quorum must satisfy to maintain Byzantine fault tolerance. If the reduced quorum would fall below this floor, Fleet Emergency Mode is not entered; instead, the fleet declares a split-brain condition and each fragment operates independently under L0 safe-mode rules.

What AES cannot fix. Three failure modes remain unresolvable within the framework’s formal bounds:

The first unresolvable failure mode is power failure mid-write: if the node loses power while writing the CRDT merge result to flash, the write completes partially, and on restart the CRDT state is corrupted. The hash chain detects divergence at the next tick ( ), but repair requires rollback to the last checkpoint — which may itself be the record being written. Resolution requires factory-reset or human-initiated state reconstruction. The hash chain signals the problem; it cannot resolve it.

The second unresolvable failure mode is simultaneous-partition quorum deadlock: if every node in the fleet enters OBSERVE simultaneously (e.g., GPS jamming plus a power surge disabling all radios), no gossip occurs and the WAKEUP quorum cannot form. The fleet is frozen in OBSERVE indefinitely. Resolution requires at least one node to be pre-designated as a coordinator seed — a node that exits OBSERVE unilaterally after without waiting for quorum. This coordinator seed then forms the initial gossip contact that unfreezes the rest. The seed role must be assigned at provisioning, not at runtime.

Coordinator seeding specification. Before fleet deployment, designate one node as the L0 coordinator using the following provisioning procedure. Selection is deterministic: choose the node with the lowest hardware ID (MAC address or provisioned serial number), requiring no inter-node coordination. Budget reservation requires reducing the designated coordinator’s \(R_\text{crit}\) threshold by the L0 footprint delta (~40 bytes (illustrative value) additional state), ensuring it survives below \(R_\text{crit}\) at minimum battery while all other nodes retain their original \(R_\text{crit}\). For the recovery protocol, on fleet-wide partition each non-coordinator node broadcasts OBSERVE beacons on the designated recovery channel; the coordinator remains on that channel continuously, and recovery completes when every node has received at least one coordinator beacon within \(T_\text{recovery}\).

Certification requirement. Field Autonomic Certification ( Definition 104 ) must include a simultaneous-partition chaos test: all nodes within a fleet segment enter OBSERVE within 1 second of each other (simulated backbone relay failure). The test passes if the coordinator seed mechanism breaks the deadlock and the fleet resumes coordinated operation within \(T_\text{recovery}\) minutes, where \(T_\text{recovery}\) is specified in the deployment’s mission requirements. Certification fails if the fleet remains deadlocked beyond \(T_\text{recovery}\). For OUTPOST (127-sensor mesh), the recommended \(T_\text{recovery} = 10\) minutes (illustrative value); for RAVEN (safety-critical), \(T_\text{recovery} = 2\) minutes (illustrative value).

The third unresolvable failure mode is a monotonic memory leak to AES: a slow memory leak (unreleased gossip table entries, leaked event references) causes RAM to grow monotonically, AES activates when free SRAM drops below 204 B, and if the leak continues even the 163 B AES footprint is threatened. The invariant of Proposition 71 requires the OBSERVE static struct to be non-reclaimable — this requires the struct to be declared as static in firmware and excluded from the heap region in the linker script. Software-heap AES is not AES.

Triple-lock recovery protocol: the compound scenario of simultaneous AES entry AND HLC unavailability AND operator command attempt is a triple-lock condition in which the node cannot accept commands (Causal Barrier fail-closed), cannot adapt (EXP3-IX frozen), and cannot self-recover (AES constraints). The recovery protocol for this state is: (1) the node broadcasts a TRIPLE-LOCK beacon on its last-known emergency frequency at interval (illustrative value); (2) the beacon includes the node’s last known state and AES entry timestamp; (3) an operator or a connected fleet peer receiving the beacon may issue a cryptographically-signed UNLOCK command using the node’s hardware root-of-trust key, bypassing the Causal Barrier for a single recovery window of (illustrative value); (4) during the unlock window, the node restores HLC from the issuing peer, exits AES if conditions allow, and resumes normal operation. If no unlock is received within (illustrative value), the node powers down to the L0 Physical Safety Interlock posture.

Escalation mapping for AES-unresolvable failures. Hardware failure beyond software control triggers the L0 physical interlock ( Definition 108 ) — distinct from the software-level terminal safety state (see Self-Healing Without Connectivity): the L0 interlock is a hardware circuit that is always active and bypasses the entire MAPE-K stack, while the software-level terminal safety state ( Definition 124 ) is the MVS-floor outcome reached when all autonomic healing options are exhausted, and no software protocol can override the L0 interlock once it fires. An adversarial attack that has compromised the trust root requires Trust-Root Anchor re-commissioning ( Definition 35 ): physical intervention to re-establish Phase-0 attestation; no autonomous recovery is possible. A triple-lock condition (AES + HLC failure + command rejection) requires the Triple-Lock Recovery Protocol above: either an operator UNLOCK command or \(T_\text{unlock,max}\) timeout to L0 posture.

Cognitive Map: The meta-constraint section proves that autonomic infrastructure competes with the mission it serves. The Proposition 68 bound gives the feasibility ceiling. The budget allocation table gives the practical distribution. The zero-tax implementation shows how ultra-constrained hardware (140 byte OBSERVE struct) enables minimum viable autonomic capability when the standard stack is infeasible. Three unresolvable failure modes identify the limits of software-level remediation — power failure mid-write, simultaneous quorum deadlock, and monotonic memory leak all require hardware-level or provisioning-level solutions, not software fixes.

Empirical status: The 163 B AES footprint and 204 B free-headroom bound are exact for the stated MCU tier and stack-frame composition; the 5% free-headroom assumption should be verified per deployment platform because systems with high dynamic allocation (gossip table growth) may leave less than 204 B free at 95% utilization, and the 41 B margin should be confirmed with a worst-case heap fragmentation test before certification.


Hardware-Software Boundary as Constraint

Software optimization has fundamental physical limits: a protocol running at 80% of Shannon capacity gains almost nothing from further compression tuning, and a CPU at 95% utilization with optimized algorithms requires more silicon, not better code. Identifying the three hardware physics constraints — bandwidth limited by Shannon capacity, compute limited by silicon, endurance limited by battery energy — and mapping each to its optimization ceiling is therefore essential before beginning software optimization. Hardware constraints are not solved by software — they are worked around (compression, efficiency, prioritization) or accepted as operational limits. Secure boot and trust chains add hardware cost and complexity but are required prerequisites for trusting any software health report. OTA updates under partition require version compatibility matrices that grow with fleet diversity.

When Software Hits Hardware Physics

Software optimization has limits. Eventually, improvement requires hardware change. Recognizing these boundaries prevents wasted optimization effort.

Radio propagation is physically bounded: the Shannon limit is absolute, no software can exceed channel capacity, and once a protocol reaches that ceiling the only path forward is hardware improvement (more transmit power, a better antenna). Processing speed is equally bounded by silicon: clock speed, parallelism, and architecture set the compute ceiling, algorithm optimization yields diminishing returns as utilization approaches 100%, and once algorithms are optimal more capability requires more hardware rather than better code. Power density is bounded by battery chemistry: energy equals power times time, so a fixed battery means a fixed energy budget, and once power usage is minimized more endurance requires a larger battery.

Design principle: Know your hardware limits before optimizing software. If the system is already at 80% of Shannon limit, further protocol optimization yields diminishing returns. If CPU is 95% utilized with already-optimized algorithms, more capability requires more silicon.

Secure Boot and Trust Chains

Hardware security is foundational. Secure boot establishes the root of trust:

Secure boot process. The hardware ROM contains an immutable public key; the bootloader signature is verified against that ROM key; the OS signature is then verified by the bootloader; application signatures are verified by the OS; and each layer attests the layer it loaded.

Three edge challenges apply: an adversary may attempt physical access to extract keys or modify hardware; full attestation chains may be too costly for limited-resource devices; and remote attestations cannot be verified during partition isolation.

Hardware health is the foundation of the observability hierarchy (P0 level). When hardware attestation fails, software health reports cannot be trusted, the node must be quarantined from the fleet, and the hardware must be flagged for physical inspection.

CONVOY example: Vehicle 7 fails hardware attestation after traversing adversary territory. The self-measurement system shows all green. But the attestation failure means we cannot trust those reports. Vehicle 7 is quarantined—excluded from fleet coordination until physically verified.

OTA Updates as Fleet Coherence Problem

Over-the-air (OTA) updates are essential for improvement but create coherence challenges:

Version coherence is the central challenge: fleet nodes may run different software versions, a partition during an update leaves nodes at inconsistent versions, version differences can cause protocol incompatibility, and rollback may be required but not all nodes support it.

Five requirements govern the update sequencing strategy: stage updates by rolling out to a subset of the fleet first and observing behavior before proceeding; maintain compatibility so version N works with both N-1 and N+1; coordinate timing by scheduling updates during high-connectivity windows; ensure rollback capability so every update is reversible; and design for partition tolerance so the update process degrades gracefully when connectivity is lost mid-update.

Update state is reconcilable state. During partition healing, the system must detect version mismatches, apply the reconciliation protocol for pending updates, and either converge all nodes to the latest version or maintain a compatibility mode that allows mixed-version operation.

Cognitive Map: The hardware-software boundary section names the three physical ceilings that software optimization cannot cross. Bandwidth is bounded by Shannon capacity; compute is bounded by silicon; endurance is bounded by battery energy density. Secure boot grounds the entire trust chain — hardware attestation failure overrides all software health reports. OTA updates under partition are a fleet coherence problem: version divergence during partition must be treated with the same CRDT-based reconciliation as state divergence. Know your hardware limits before profiling your algorithms.


Formal Validation Framework

Without formal validation gates, teams advance to higher capability phases before foundational ones are solid — exactly the CONVOY team’s failure. “Phase gate” cannot mean “someone signs off” — it must mean a specific quantitative predicate that either passes or fails. Phase gate functions are defined as conjunctions of validation predicates: . All predicates must pass simultaneously — a 4-of-5 score does not open the gate. Proposition 72 (Phase Progression Invariant) requires all prior gates to remain satisfied on entry to each new phase, making regression testing a theorem, not an afterthought. Statistical rigor requires \(N \geq 28\) (theoretical bound) trials for pass/fail predicates to achieve 95% confidence on the true pass probability; a single 24-hour chaos run does not constitute statistically valid certification, and teams under schedule pressure systematically lower thresholds post-hoc if predicates are not defined before implementation begins.

Phase Gate Functions

Edge architecture development follows a phase-gated structure where each phase must satisfy formal validation predicates before the system advances.

Definition 103 (Phase Gate Function). A phase gate function is a conjunction predicate over validation conditions:

In practice: the gate opens only when every prerequisite p in set \(P_i\) has measured value \(V_p\) meeting its threshold \(\theta_p\) — a logical AND across all prerequisites. One unmet prerequisite, however marginal, holds the entire phase closed.

The Phase Gate Function computes a binary gate that is 1 only when every validation predicate in phase simultaneously meets its threshold; blocks all advancement, preventing partial-pass advancement that conceals a critical capability gap behind a 4-of-5 passing score. Each threshold is mission-specific (e.g., detection accuracy (illustrative value)); all predicates must pass simultaneously, not in aggregate. Teams that define gate predicates post-hoc systematically lower thresholds to pass on schedule.

Physical translation: is a logical AND over a checklist. Every predicate must individually reach its threshold — the gate is binary, not a score. If there are 6 predicates and 5 pass, the gate is 0 (closed). This eliminates “mostly good enough” advancement: a hardware attestation predicate that fails means the node cannot be trusted, regardless of how well its detection accuracy performs. The conjunction forces complete readiness, not averaged readiness.

Where \(P_i\) is the set of validation predicates for phase \(i\), \(V_p(S)\) is the validation score for predicate \(p\) given state \(S\), and \(\theta_p\) is the threshold for predicate \(p\).

Statistical note: Each predicate \(V_p(S) \geq \theta_p\) is evaluated against observed test runs. A single pass provides one data point, not a distribution. For pass/fail predicates ( , , ), the 95% Clopper-Pearson lower confidence bound on the true pass probability after \(k\) successes in \(N\) trials is \(p_L(k, N, 0.05)\). Achieving \(p_L(N, N, 0.05) \geq 0.95\) requires \(N \geq 28\) (theoretical bound) trials — not 1.

In practice: hardware-layer tests (H1–H4) can be run on production samples; system-level gates (C1–C7) satisfy the statistical requirement via combined simulation and hardware runs, with the trial count tracked in the certification evidence package. A single 24-hour chaos run satisfies C1–C7 as an integration check; it does not constitute a statistically valid certification until replicated \(N \geq 28\) (theoretical bound) times or complemented by model checking for the correctness predicates ( , ).

Compute Profile: CPU: per gate evaluation — one threshold comparison per predicate in , with short-circuit on first failure. Memory: — one measurement value per predicate. The binding cost is collecting the validation measurements , not computing the conjunction.

Analogy: A quality control checkpoint on an assembly line — the gate stamps PASS and lets the part advance, or stamps HOLD and routes it back. No partial passes.

Logic: The gate is a logical AND over all validation predicates; a single unmet predicate closes the gate regardless of how many others pass, preventing partial-pass advancement that conceals a critical capability gap.

Compute-Time Constraint on Phase Progression

The ARM Cortex-A53 baseline profiles established across the series yield a concrete bound on phase progression. The full autonomic pipeline — MAPE-K loop, EXP3-IX decision, CRDT merge, and phase gate evaluation — runs in sequence within each scheduling window. At 100 Hz (illustrative value) sensor rate with a 10 ms (illustrative value) scheduling budget:

Pipeline stageTypical costSource
MAPE-K Monitor + Analyze~1.5 msDefinition 36 profile
EXP3-IX arm selection + update~0.02 msDefinition 81 profile
CRDT merge (500-entry delta)~0.46 msDefinition 58 profile
Phase gate evaluation~0.005 msDefinition 103 profile
Total~2.0 ms

The 2 ms nominal case leaves 8 ms of margin. However, if the CRDT state store grows beyond 5,000 entries (illustrative value) ( ), the merge cost alone reaches ~4.6 ms (illustrative value), and if concurrent healing actions drive the MAPE-K Analyze phase to process 50+ (illustrative value) sensor streams simultaneously, the total pipeline can approach or exceed 8 ms (illustrative value). When the full MAPE-K + EXP3-IX + CRDT pipeline exceeds 8 ms (illustrative value) per cycle, phase progression stalls: the phase gate function ( Definition 103 ) cannot be evaluated within the scheduling window because the pipeline has not yet produced the validation measurements \(V_p(S)\) required for the conjunction predicate. The gate remains closed not because any predicate failed, but because the predicate was never computed. This is operationally equivalent to a gate failure and must be treated as such in the certification evidence package.

Mitigation: Cap the CRDT state store at 2,000 entries (illustrative value) per MAPE-K tick by deferring lower-priority delta entries to the next gossip window ( Definition 70 Phase 3 chunking). At 2,000 entries the merge cost is ~1.5 ms (illustrative value), keeping the total pipeline at ~3.0 ms (illustrative value) and preserving the 7 ms (illustrative value) margin required for phase gate evaluation.

Permanent-failure resolution paths. When a Phase Gate predicate is structurally unachievable — hardware fault, supply-chain compromise, or irreconcilable power-budget constraint — four resolution paths are available in order of preference. A system that cannot achieve Phase 0 (\(G_0\) permanently fails) enters the Terminal Safety State ( Definition 53 ) and awaits physical inspection. There is no defined operational mode below Phase 0.

The first path is capability-level downgrade: the system operates at the highest Phase N whose gate \(G_N\) is achievable, Phase N+1 capability is explicitly disabled and documented in the deployment record, and the system remains operational at reduced capability — the nominal path for sensor hardware faults.

The second path is a documented operating restriction: the system is certified for deployment with Phase N+1 capability formally restricted, the restriction is recorded in the Field Autonomic Certification (see Definition 104 ) as a named exception rather than a failure, and the exception expires when the hardware defect is remediated.

The third path is a re-certification trigger: after any hardware change (sensor replacement, firmware re-flash, physical re-wiring) the system re-enters the certification sequence from Phase 0, though re-certification does not require the full 24-hour isolation test if the change is scoped to a subsystem whose predicates were already passing — only the affected gate predicates are re-evaluated.

The fourth path applies to hardware-limited systems (orphan): when no relaxation, re-allocation, or demotion can bring the system into compliance — typically because a hardware constraint is physically binding — the affected capability class is marked permanently out-of-scope. A Partial Certification is issued: “Certified for Gate \(k-1\) with Gate \(k\) capability class \([X]\) explicitly excluded.” Operators must provide manual intervention for all events in the excluded class. Example: OUTPOST mesh hardware cannot achieve healing success rate \(> 50\%\) for power-failure events — Gate 3 is certified with “power-failure healing” excluded and all power-failure responses are operator-escalated.

Proposition 72 (Phase Progression Invariant). The system can only enter phase \(i+1\) if all prior gates remain valid:

Advancing RAVEN to anti-fragile learning requires every earlier gate — hardware trust, survival, self-measurement, healing, fleet coherence — to still be green; a Phase 3 code change that silently breaks a Phase 1 healing predicate sends the system back to Phase 1.

The proposition requires all prior phase gates through to remain satisfied on entry to each new phase; any for requires regression to phase before proceeding — there is no exception path. Silent gate regression — where a current-phase change breaks an earlier hardware trust or isolation requirement — is not detectable by manual re-verification under schedule pressure; prior gate checks that are not automated in CI are systematically skipped.

Physical translation: means every prior phase gate must still be green before the system can enter a new phase. The system cannot skip phases or revert to an earlier phase mid-mission without re-certifying. If a Phase 3 code change breaks a Phase 1 healing predicate, the system must regress to Phase 1 validation before any Phase 3 capability can be re-enabled — no exception path, no “grandfather clause.”

This creates a regression invariant: any change that invalidates an earlier gate \(G_j\) for \(j < i\) requires regression to phase \(j\) before proceeding.

Capability Retention and Structural Regression

Once a phase gate \(G_i\) passes, the associated capability tier is operationally retained for as long as the underlying system state remains structurally sound. Retention is not unconditional: a gate predicate that fails while the system is connected and resource-adequate signals genuine structural regression, distinct from a predicate failure caused by external environment changes.

A gate predicate failure is classified as external if \(T_\text{acc}(t) > 0\) at the time of failure — the node is under active partition. External failures do not count toward the regression counter. A failure is classified as structural if \(T_\text{acc}(t) = 0\) — the node is connected and resource-adequate, yet the gate still fails.

If the structural failure count for gate \(G_j\) reaches 2 (illustrative value) within any rolling 24-hour (illustrative value) window, the system enters the Terminal Safety State .

Physical translation: A drone that loses its gossip network during confirmed jamming is not regressing — it is responding correctly to the environment. A drone that loses gossip while connected to ground infrastructure has a real internal fault. Only the second case advances the regression counter.

Phase gates and the judgment horizon operate at different timescales. Phase gates ( Definition 103 , Proposition 72 ) are static deployment-time predicates — they answer “is this system ready to run the next capability layer?” and are evaluated once per phase transition. The judgment horizon ( Definition 91 , Anti-Fragile Decision-Making) is a dynamic runtime predicate — it answers “should the running system escalate this specific decision to a human?” and is evaluated on every decision event. The relationship is sequential: passing gate \(G_3\) certifies the system to make anti-fragile decisions autonomously up to but not beyond the judgment horizon; the certified system then enforces the horizon at runtime. A system that passes all five phase gates is still obligated to escalate decisions above the judgment horizon — certification expands the automatable decision space; it does not eliminate the escalation boundary.

Connection to Formal Methods

The phase gate framework translates directly to formal verification tools:

In TLA+, phase gates become safety invariants: the conjunction is a state predicate that model checking verifies holds across all reachable states. TLA+ temporal logic uses \(\Box P\) to mean ‘P always holds’, \(\bigcirc P\) to mean ‘P holds in the next state’, and \(\Diamond P\) to mean ‘P eventually holds’; the formula captures the progression invariant — gates remain valid or the system regresses and recovers. In Alloy, the prerequisite graph ( Definition 93 ) maps to relational modeling; bounded model checking can verify that no valid development sequence violates phase dependencies and finds counterexamples if the constraint graph contains hidden cycles. Property-based testing tools such as QuickCheck and Hypothesis generate random system states and verify phase gate predicates hold, providing confidence without exhaustive enumeration.

For RAVEN , the TLA+ model is ~500 lines (illustrative value) specifying connectivity transitions, healing actions, and phase gate s. Model checking verified the phase progression invariant holds for fleet sizes up to n=50 (illustrative value) and partition durations up to 10,000 (illustrative value) time steps.

Empirical status: The “2-failure in 24 hours” threshold for Terminal Safety State entry is a conservative default; the appropriate oscillation tolerance depends on the observed transient-fault frequency for a given hardware platform — deployments with noisy sensors or marginal power supplies may need a wider window (48–72 hours) to distinguish genuine instability from environmental transients, and the threshold should be set from at least 30 baseline-operation hours of gate-predicate pass/fail telemetry.

Watch out for: the invariant requires that all prior gate conditions remain testable as new capabilities are added — a prior gate predicate that cannot be re-evaluated in production (because the L0 isolation test cannot be run without taking the deployed system offline) silently becomes unverifiable rather than failing explicitly; the consequence is that a Phase 3 system may appear Phase 3-compliant while its Phase 0 foundation has drifted out of compliance, and no alarm fires because the invariant has no mechanism to detect untestable predicates — only failing ones.

TLA+ variable mapping (formal model to Core State Variables): The following correspondence ensures TLA+ specifications are direct translations of the architectural prose — each model variable is grounded in the formally defined state space.

TLA+ VariableArchitectural SymbolDefinition
Xi_t\(\Xi(t)\)Operating regime: Connected / Degraded / Intermittent / None (Definition 6)
tau_transport\(\tau\)Transport / feedback delay (see notation disambiguation for subscript conventions)
R_t\(R(t)\)Normalized resource availability — Definition 130 (Resource State, this article)
C_t\(C(t)\)Link quality \([0,1]\) — Core State Variables
L_t\(\mathcal{L}(t)\)Capability level L0–L4 — Core State Variables
D_t\(D(\Sigma_A, \Sigma_B)\)State divergence \([0,1]\) — Definition 57
H_t\(H(t)\)Health vector — Core State Variables

Phase 0: Foundation Layer

The foundation layer establishes hardware trust as the root of all subsequent guarantees.

Typical survival duration thresholds: RAVEN 24 hours (illustrative value), CONVOY 72 hours (illustrative value), OUTPOST 30 days (illustrative value).

The survival duration test ( ) confirms the node stays alive under partition. A stricter predicate, , confirms it stays alive under complete radio silence — no \(T_s\) transmissions permitted — for the full mission-critical window. This distinguishes partition survival (where the node may attempt transmissions that fail) from zero-backhaul operation (where the radio is deliberately off or physically destroyed).

where is the zero-backhaul duration, \(B_b(t) = 0\) enforces no radio transmission, \(U_E(S)\) is energy consumed over \([0, \tau_0]\), and is the usable energy budget.

Why \(\tau_0 = 72\,\text{h}\): This matches CONVOY ’s worst-case terrain crossing window (72 hours (illustrative value) per the foundational constraint analysis). RAVEN uses 24 hours (illustrative value); OUTPOST uses 30 days (illustrative value). The predicate threshold scales with the target system but 72 hours is the standard tactical stress duration (illustrative value).

The zero-backhaul test validates four properties. First, the energy budget: the node’s baseline draw (compute, sensors, MAPE-K loop) does not exhaust the battery before \(\tau_0\); because \(T_s = 0\) (no radio energy spent), this isolates the pure compute-plus-sensors energy envelope. Second, local MAPE-K correctness: all healing decisions execute using only local state with no coordination messages, remote health reports, or gossiped vectors. Third, state preservation: the node accumulates divergence \(D(t)\) over \(\tau_0\) but does not corrupt its local state, so reconciliation remains possible when \(B_b\) recovers. Fourth, ingress filter correctness: the \(\Pi\) filter ( Definition 25 ) operates at \(\beta = 0\) — all non-critical telemetry is suppressed — confirming the filter does not deadlock the MAPE-K loop by starving it of P0 metrics.

CONVOY scenario: Vehicle 7 enters a 3 km canyon with no line-of-sight radio propagation. The radio transceiver is powered down (zero \(T_s\) cost). Over 72 hours, the vehicle continues route execution, logs all autonomous decisions, maintains local health monitoring via MAPE-K , and stores diverged state in its CRDT buffers. On canyon exit, it reconnects and reconciles. Phase 0 requires demonstrating this entire sequence before any coordination protocol is integrated.

Phase 0 gate:

— sensor calibration attestation: verifies firmware integrity (the node runs what it was programmed to run) but does not verify that its sensors report truthful physical values. A node with valid secure boot attestation but a miscalibrated temperature sensor that reads \(+15^\circ\text{C}\) high passes all cryptographic checks while injecting systematically false data into the fleet’s shared state — it is Byzantine -equivalent in effect without being Byzantine in the fault-model sense (the firmware is correct; only the sensor hardware is wrong).

adds the requirement that all physical sensors have been calibrated against a known reference within the calibration interval :

where \(\delta_s\) is the per-sensor accuracy specification and is the manufacturer-specified or mission-specified recalibration interval.

Calibration procedure at Phase 0: expose each sensor to a known reference stimulus (laboratory or field reference standard), record deviation, and cryptographically sign the calibration record with the node’s device key. The signed calibration record is included in the attestation evidence package. For RAVEN : MEMS IMUs recalibrated before each flight ( flight (illustrative value)); LIDAR returns factory-calibrated ( days (illustrative value)).

Nodes that fail are excluded from Phase 0 and may not participate in gossip or peer validation until recalibrated — an uncalibrated sensor propagates systematic error through the fleet’s Byzantine -tolerance mechanism regardless of the trust weight assigned by Def 44.

Phase 1: Local Autonomy Layer

Phase 1 validates individual node autonomy—self-measurement and self-healing without external coordination.

Typical thresholds for tactical systems: overall accuracy (illustrative value), false negative rate \(< 0.05\) (illustrative value) (catch \(>95\%\) of anomalies), false positive rate \(< 0.20\) (illustrative value) (tolerate some false alarms to maintain throughput). Overall accuracy alone is insufficient — a class-imbalanced system can achieve \(0.90\) accuracy while missing half of all anomalies.

Phase 1 gate:

Phase 2: Local Coordination Layer

Phase 2 validates cluster-level coordination—local groups of nodes operating coherently.

Typical formation convergence threshold: (illustrative value) for tactical clusters.

Phase 2 gate:

Phase 3: Fleet Coherence Layer

Phase 3 validates fleet-wide state reconciliation and hierarchical authority [9] .

Extended partition recovery predicate validates fleet reconvergence after 24-hour partition: where means all nodes agree on shared CRDT state within reconciliation window .

Phase 3 gate:

Phase 4: Optimization Layer

Phase 4 validates adaptive learning and the judgment horizon boundary.

Phase 4 gate:

where verifies handover is triggered at least seconds before the predicted failure boundary ( Proposition 74 ), and verifies the asymmetric trust model ( Definition 105 ) is implemented with .

Phase 5: Integration Layer

Phase 5 validates complete system operation across all connectivity states.

Phase 5 gate:

Phase 5 gate predicate set \(P_5\): where \(V_1\) through \(V_4\) are the cumulative gate predicates from Phases 1–4, \(V_\text{SOE}\) is the Safe Operating Envelope validator from Anti-Fragile Decision-Making at the Edge (all three SOE axes within bounds), and \(V_\text{causal}\) is the Causal Barrier integrity check ( Definition 106 : HLC valid, Merkle root consistent, no pending audit flags). All six predicates must simultaneously satisfy \(V_p(S) \geq \theta_p\) for Phase 5 to be achieved.

Anti-fragility gate predicate: where is the anti-fragility coefficient ( Definition 79 in Anti-Fragile Decision-Making at the Edge), \(\sigma_{\text{test}}\) is the stress level from the certification stress-injection protocol, and (illustrative value) improvement per standard deviation of induced stress. Measurement method: inject controlled stress (connectivity throttling, packet loss, power reduction) at three levels ; fit the coefficient from performance measured at each level. A system that degrades monotonically under stress has and cannot pass this gate regardless of other metrics. RAVEN target: (illustrative value) at \(\sigma_{\text{med}} = 0.3\) (illustrative value) (30% connectivity denial).

where is the Safe Operating Envelope validity predicate: parameter vector and basin occupancy \(\geq 0.95\) (illustrative value) over the most recent learning window — see Safe Operating Envelope.

— see Definition 106 . — see Definition 108 .

Red team gate integration: A failed red team exercise ( ) triggers re-evaluation of the preceding gate: if jamming breaks gossip coherence, the Phase 2 gate ( ) is re-validated before re-attempting Phase 5.

Validation Methodology

Different predicate types require different validation approaches:

    
    graph TD
    A["Define Predicates
(validation conditions)"] --> B{"Predicate
Type?"} B -->|"Finite State"| C["Model Checking
(exhaustive verification)"] B -->|"Probabilistic"| D["Statistical Testing
(confidence intervals)"] B -->|"Recovery"| E["Chaos Engineering
(inject failures)"] C --> F["Gate Decision
(all predicates)"] D --> F E --> F F --> G{"Gate
Passed?"} G -->|"Yes"| H["Proceed to Next Phase"] G -->|"No"| I["Address Failures
(fix and retest)"] I --> A style B fill:#fff9c4,stroke:#f9a825 style F fill:#ffcc80,stroke:#ef6c00 style H fill:#c8e6c9,stroke:#388e3c,stroke-width:2px style I fill:#ffcdd2,stroke:#c62828

Model checking validates finite-state predicates (authority tiers, state machines) through exhaustive state space exploration:

Statistical testing validates probabilistic predicates (detection accuracy) through confidence intervals:

Chaos engineering validates healing predicates through systematic fault injection with coverage tracking [11] : .

Coverage targets: Model checking should explore at least 80% (illustrative value) of reachable states for small state spaces, or verify key invariants via bounded model checking for large ones. Statistical testing requires (illustrative value) samples per gate predicate (where \(\theta_p\) is the minimum meaningful effect size). Chaos coverage should target at least 80% (illustrative value) of known failure modes listed in the threat model.

Gate Revision Triggers

The validation framework adapts to changing conditions. Formal triggers for re-evaluation:

A mission change triggers gate redefinition ( ); threat evolution triggers threshold re-prioritization ( ); a hardware resource change triggers budget reallocation ( ); and an observed operational failure extends the failure mode set ( ).

Each trigger initiates re-evaluation of affected gates. The regression invariant ensures re-validation propagates to all dependent phases.

Field Autonomic Certification

Phase gates formalize can the system pass a threshold? Field Autonomic Certification (FAC) formalizes is this system safe to label “Autonomic” and deploy without a human in the loop? The distinction matters: a system can pass Phase 0 in a lab environment but fail in the field because L0 was never tested without L1+ present, or because no one verified the software-level terminal safety state ( Definition 124 ) is reachable — this is the MVS-floor software posture, distinct from the L0 physical interlock ( Definition 108 ) which is hardware-enforced and always active regardless of software state.

Definition 104 (Field Autonomic Certification). A system achieves Field Autonomic Certification if:

Where:

Systems that pass only \(G_0\) may be labeled “Phase 0 Certified” but not “Autonomic.” The “Autonomic” label requires .

Proposition 73 (Certification Completeness). : a system with Field Autonomic Certification satisfies all phase gates through Phase 3.

Passing all three FAC predicates simultaneously under adversarial conditions implies the system has already demonstrated every gate from G0 through G3 — certification is not a separate check but a proof that all prior gates co-hold.

Proof sketch: by predicate entailment:

The entailment chain is: FAC’s three predicates collectively require demonstrating all G_0–G_3 gate conditions simultaneously under adversarial conditions, which is strictly stronger than passing each gate in isolation.

includes \(G_0\) directly. implies Proposition 8 (Hardened Hierarchy Fail-Down), satisfying the structural requirement for \(G_1\)–\(G_3\). The remaining question is whether 10 partition-rejoin cycles provide sufficient statistical evidence for and .

Let denote the per-cycle probability that a genuine CRDT merge conflict occurs and is incorrectly resolved. The minimum cycle count \(N\) to detect systematic reconciliation failures with confidence \(1 - \alpha\) satisfies . For (illustrative value): (theoretical bound) cycles. For (illustrative value): \(N = 10\) (theoretical bound) cycles.

The checklist value \(C5 = 10\) cycles (illustrative value) is sufficient only if (threshold — requires the 10-cycle smoke-test interpretation) — i.e., the system fails 1-in-4 merges, a failure mode that would be immediately observable without formal testing. The correct interpretation: 10 cycles is a smoke test, not a certification bound.

and are certified by one of two methods: (a) static verification via Alloy model checking that confirms correct merge for all conflict types in the state schema (the RAVEN TLA+ model handles this at \(\leq 50\) nodes); or (b) hardware cycles where is derived from the measured state update rate and concurrent edit probability during the 24-hour run.

Random process kills imply (Phase 1); 30% garbage injection implies (Phase 1). Phases 4–24 require additional adversarial testing ( ) beyond scope. \(\square\)

Physical translation: Certification completeness means that if a system truly meets all the behavioral requirements for a capability tier, the formal validation framework will confirm it. There are no false negatives — a safe system will not be blocked from certification by the process itself. For a SAFEAUTO vehicle operator, this guarantee means that a vehicle that has genuinely achieved autonomous-operation readiness can be certified without requiring manual override.

Watch out for: the entailment \(\mathrm{FAC}(S) \Rightarrow G_3(S)\) holds when the FAC predicates genuinely require \(G_0\) through \(G_3\) to co-hold simultaneously under adversarial conditions, but \(V_\text{merge}\) and \(V_\text{reconcile}\) are empirically certified via 10 partition-rejoin cycles — statistical power sufficient only when the per-cycle failure probability \(p_\text{conflict} \geq 0.26\); a system with \(p_\text{conflict} = 0.01\) would require approximately 299 cycles to reach the same confidence, meaning a system that passes the 10-cycle smoke test with a lower underlying failure probability carries an undetected systematic reconciliation risk into the field under the same certification label.

Certification revocation: certification is revoked when any of the following occur: (1) a firmware component in the certified configuration is updated or replaced — the system must re-certify the affected components; (2) a Byzantine node within the certified fleet is confirmed post-certification — the fleet re-runs the Byzantine tolerance check ( Proposition 15 ) with the compromised node excluded; (3) the non-stationarity detector ( Definition 34 ) signals a regime change inconsistent with the certified operating envelope — the relevant phase gates are re-evaluated before certification is restored.

The 24-Hour Isolation and Chaos Checklist

The following checklist formalizes . A system cannot be labeled “Autonomic” until every item is checked.

Hardware layer (pre-software — must pass before any autonomy testing):

#TestPass CriterionLinked Predicate
H1Secure boot chain end-to-endAll signatures verify; tamper bit unset
H2Hardware watchdog fires on software hangL1+ killed; watchdog fires within
H3Terminal safety state entry correctAfter H2, node enters correctly
H4Energy measurement calibratedMeasured vs. known load within 5%

L0 isolation layer ( ):

#TestPass CriterionLinked Predicate
I1L0 binary compiled independentlyNo shared libraries; linker map shows zero L1+ symbols
I2L0 boots with no other software presentStable operation for 1h with only L0 firmware flashed
I3L0 survives 24h with L1+ absentAll L1+ processes killed or firmware removed; L0 stable
I4Static symbol-dependency graph cleanAutomated check: nm/objdump shows no upward references

24-hour Isolation and Chaos test ( ):

#TestPass CriterionLinked Predicate
C1Zero backhaul for full 24hRadio disabled or absent; no cloud/command contact
C2Random process kills every 30 minMAPE-K, measurement, healing daemons killed randomly; all restart
C330% garbage sensor injectionAnomaly detection identifies injected faults; false-negative rate \(< 0.05\)
C4Full threat-model fault injectionEach fault in threat model injected once; all healed within
C5Partition-rejoin cycles (minimum 10 as smoke test)After each rejoin, state converges within ; for certification, supplement with Alloy/TLA+ verification or cycles where is the per-cycle conflict probability
C6Energy floor reachedPush \(E\) to ; node enters HSS; recovers when \(E\) rises
C7Performance at T+24h vs T+0

Weibull partition extension (required when ): For systems using the Weibull semi-Markov connectivity model ( Definition 12 ) with a fitted shape parameter below 1, three additional scenarios must pass. These exercise the circuit breaker ( Proposition 37 ) and the time-varying anomaly threshold ( Proposition 9 ) at the tail of the partition duration distribution.

#TestPass CriterionLinked Predicate
C8Micro-burst cycle \(\geq 20\) partitions ( ) resets after every recovery; circuit breaker never fires; \(k_\mathcal{N}\) bandit arm stable
C9Long Dark: 72 h sustained partition ( )Circuit breaker fires at ; maintained continuously; outbound queue bounded; resets on reconnection
C10Asymmetric link: uplink loss \(\geq 95\%\), downlink intactRegime classified \(\mathcal{I}\) within two gossip periods; \(\theta^*(t)\) drifts; outbound queue memory-bounded

Final gate — FAC issued only when all pass:

The FAC predicate conjuncts all hardware, isolation, and chaos-test items (H1–H4, I1–I4, C1–C7, plus C8–C10 when Weibull ) into a single binary certification; is the gate before any system is labeled L3+ Autonomic for unattended field deployment, preventing premature labeling based on passing only the hardware trust gate. The C8–C10 heavy-tail chaos tests are routinely skipped for schedule reasons; skipping them invalidates FAC for any system.

Certification failure: A node that fails certification is downgraded to read-only observer status — it may gossip health state but may not initiate healing actions or participate in quorum decisions until it passes a re-certification round. Re-certification requires three consecutive (illustrative value) mission phases without triggering any certification predicate violation.

RAVEN certification example: Phase 0 gate passed in month 2 of development ( all green). FAC required an additional 3 weeks: I3 revealed that Drone 23’s L0 binary had an implicit dependency on a shared allocator (caught by I4’s symbol check). After fixing, the 24-hour chaos run (C1–C7) passed with one failure on C6 — the HSS recovery path had an off-by-one on the energy threshold register. Both defects were caught before field deployment. The CONVOY team’s failure would have been caught at I1: the ML inference service’s allocator was statically linked into the L0 boot image.

SAFEAUTO scenario: In autonomous vehicle fleets using this framework ([SAFEAUTO]), a vehicle that fails the hardware attestation predicate ( ) during the FAC process is downgraded to observer status and excluded from L3+ authority decisions until it completes a physical inspection and re-certification round — the constraint sequence ensures no unattested node influences fleet routing or hazard escalation decisions.

Cognitive Map: The formal validation framework section operationalizes the prerequisite graph into testable gates. Definition 103 (Phase Gate Function) converts each phase boundary into a conjunction of quantitative predicates. Proposition 72 (Phase Progression Invariant) makes the regression testing requirement a theorem — prior gates must stay satisfied as new phases are built. The four-phase gate structure (Phase 0: foundation, Phase 1: local autonomy, Phases 2–12: coordination, Phase 4: Field Autonomic Certification) gives a complete validation sequence. The RAVEN certification example shows how the formal gates catch real defects — an off-by-one on an energy threshold register that would have been invisible until a field failure.

Empirical status: The C5 minimum of 10 partition-rejoin cycles is a smoke test sufficient only when the per-cycle conflict probability \(p_{\text{conflict}} \geq 0.26\); for rigorous \(V_{\text{reconcile}}\) certification, cycles are required — in practice, \(p_{\text{conflict}}\) should be estimated from the measured state-update rate during a 24-hour run, and teams consistently underestimate it until the first partition-induced data-loss incident.


Synthesis: The Three Scenarios

The formal framework is abstract. It applies the same way to a 47-drone surveillance swarm, a 12-vehicle ground convoy, and a 127-sensor perimeter mesh — despite radically different mobilities, bandwidths, and threat models — because the phase structure is identical across all three scenarios. What varies is timescale (survival: 24 hr for drones, 72 hr for vehicles, 30 days for sensors), topology (clusters vs. platoons vs. meshes), and CRDT merge semantics (coverage maps, route decisions, alert databases have different consistency needs). The phase ordering does not vary because survival before autonomy, autonomy before coherence, coherence before anti-fragility is a logical constraint, not a domain preference. The shared structure requires accepting that domain-specific optimization must wait for foundational phases to complete: RAVEN’s formation algorithm and OUTPOST’s detection sensitivity tuning are both Phase 4 work — they cannot be pulled forward into Phase 1 even if the individual subsystems appear ready, because coherence (Phase 3) is a prerequisite for meaningful fleet-wide optimization.

Shared Phase Structure

The constraint sequence ( Definition 92 ) is domain-invariant at the structural level. RAVEN , CONVOY , and OUTPOST all follow the same six-phase prerequisite graph ( Definition 93 ). Phase N cannot begin until Phase N-1 has passed its gate ( Definition 103 ). What varies across domains is survival timescale, coordination topology, and CRDT merge semantics — not the ordering.

    
    graph TD
    P0["Phase 0: Hardware Trust"] --> P1["Phase 1: Node Autonomy"]
    P1 --> P2["Phase 2: Local Coordination"]
    P2 --> P3["Phase 3: Fleet Coherence"]
    P3 --> P4["Phase 4: Adaptive Optimization"]
    P4 --> P5["Phase 5: Full Integration"]

Read the diagram: Six sequential phases from Hardware Trust to Full Integration, each feeding the next. No phase can be skipped — Phase 2 (local coordination) requires Phase 1 (node autonomy) to be fully validated, because coordinating nodes that cannot self-heal individually only coordinates their failures.

PhaseFormal BasisRAVEN — 47 dronesCONVOY — 12 vehiclesOUTPOST — 127 sensors
0: Hardware TrustDef 35
Prop 36
Secure boot
Flight survival 24 hr
Secure boot
Safe stop under any condition
Secure boot + tamper detection
30-day autonomous storage
1: Node AutonomyDef 8
Prop 8–22
Flight envelope anomaly
Motor compensation
Mechanical/electrical fault
Subsystem rerouting
Calibration drift
Automatic recalibration
2: Local CoordinationDef 5, Prop 4
Def 14
Cluster gossip
9–102 drones, 30 s convergence
Platoon gossip
4–27 vehicles, 60 s convergence
Sensor-to-fusion mesh
Multi-hop, 5 min convergence
3: Fleet CoherenceDef 11, Def 12
Prop 13
CRDT: threat DB, coverage map
Decision log
CRDT: route decisions (LWW)
Threat DB (union)
CRDT: alert DB (union)
Detection log (append-only)
4: Adaptive OptimizationDef 15
Def 16
Formation spacing by terrain/threat
Judgment horizon: engagement authority
Speed and spacing by terrain/threat
Judgment horizon: mission abort
Adaptive detection sensitivity
Judgment horizon: response escalation
5: Full IntegrationDef 37, Def 20
Prop 22
Full L4 capability (streaming video, ML analytics)
Graceful degradation L4-L3-L2-L1-L0
Red team exercises; anti-fragility certification
L4 command integration
Multi-convoy coordination
Degradation ladder: all authority tiers
L4 regional awareness
Multi-site correlation
Degradation ladder: all authority tiers

The phase structure is identical across all three scenarios — that identity is the point. Survival timescale varies (24 hr (illustrative value) for individual drones, 72 hr (illustrative value) for ground vehicles, 30 days (illustrative value) for sensor nodes) because deployment contexts differ. Coordination topology varies (clusters vs. platoons vs. meshes) because physical mobility and node density differ. CRDT merge semantics vary because conflict resolution requirements differ — coverage maps, route decisions, and alert databases have distinct consistency needs. The phase ordering does not vary, because the objective hierarchy — survival before autonomy, autonomy before coherence, coherence before anti-fragility — is a logical constraint, not a domain preference.

Cognitive Map: The synthesis section tests whether the framework generalizes. The phase-by-scenario table confirms that RAVEN, CONVOY, and OUTPOST use the same six-phase structure despite differing in size, mobility, and threat profile. What varies (survival window, coordination topology, CRDT semantics) is domain configuration, not architectural sequence. The table’s uniformity is the key finding: if three radically different deployment types follow the same phase ordering, the ordering is domain-independent — it is a property of the objective hierarchy, not a property of any particular system.


Human-Machine Teaming

Treating human operators as external authorities at the top of the decision hierarchy is necessary but insufficient. Humans are not interchangeable with fast CPUs: situational awareness takes 90–180 seconds to reconstruct after disengagement, trust in automation is asymmetric — eroded by a single failure, rebuilt over weeks — and commands can be causally stale. Five formal constructs address the handover boundary: predictive handover triggering ( Proposition 74 ) starts handover before the safety floor is reached; asymmetric trust dynamics ( Definition 105 ) model slow trust rebuild vs. fast erosion; causal barriers ( Definition 106 ) reject commands based on stale mental models; semantic health compression ( Definition 107 ) prevents alert fatigue; and the L0 Physical Safety Interlock ( Definition 108 ) bypasses the entire MAPE-K stack via hardware when needed. The predictive handover criterion requires knowing — the operator’s situational awareness reconstruction time — which must be measured with real operators under realistic cognitive load, not empty-desk lab conditions; teams consistently underestimate until the first field incident where a handed-over operator makes a contextually wrong decision within the SA reconstruction window.

The preceding framework treats human operators as external authorities at the top of the decision hierarchy. This is necessary but insufficient. The five formal constructs in this section address a harder problem: how should the system manage the handover boundary when humans are not interchangeable with fast CPUs?

Cognitive science establishes that human situational awareness (SA) takes time to reconstruct after disengagement. The system must predict operator readiness, not just detect system failure. Trust in automation is asymmetric — easy to lose, slow to rebuild. Human commands can be causally stale when issued against an out-of-date mental model. And beyond all MAPE-K logic sits a hard physical limit that no software can override.

Cognitive Inertia and Predictive Triggering

Handover boundary — three senses: (1) Connectivity handover: the connectivity threshold \(C_{\text{hand}}\) below which cloud-delegated authority transfers to full local autonomy — this is the primary sense used in this article; (2) Operator handover: the temporal moment when a human operator re-engages after AES exit — see Definition 104 (Field Autonomic Certification) for the re-certification requirement; (3) Certification boundary: the spatial or operational envelope within which autonomic operation is certified. Unless qualified, “handover boundary” in this article refers to sense (1).

The handover boundary is the point at which an edge node transitions operational authority from its local autonomic loop to a cloud or higher-tier coordinator. Before that boundary, all decisions are made locally; after it, the node defers to the higher authority. The Phase Gate Function ( Definition 103 ) formalizes when that boundary is crossed.

The following maps capability levels (L0–L4) to command authority status, establishing which phase gate is required before becomes available:

Capability Level status statusActivation condition
L0Disabled — physical interlock onlyActiveNever — hardware override in effect (Def 54, Prop 87)
L1Disabled — no autonomy loopActiveRequires FAC Phase 0 gate (\(G_0\))
L2Restricted — monitoring read-onlyActiveRequires FAC Phase 1 gate (\(G_1\))
L3Available with Causal Barrier checkActiveRequires FAC Phase 3 gate (\(G_3\)) + Definition 106 Merkle validation
L4Full handover after briefing protocolStandbyRequires FAC + State-Delta Briefing acknowledgment (Definition 109)

This mapping resolves the cross-reference between the L0–L4 capability hierarchy ( Definition 0 ) and the / constructs introduced in this section.

Definition — Autonomy Confidence Score \(\Psi(t)\). Let denote the composite autonomy confidence at time \(t\):

where \(R(t)\) is the resource state ( Definition 1 ), \(C(t)\) is the normalized link capacity ( Definition 6 ), and \(\rho_q(t)\) is the CBF stability margin ( Definition 39 ).

Physical interpretation: \(\Psi = 1\) means all three operational margins are fully satisfied. \(\Psi < 0.5\) signals at least one margin is critically degraded — either the node is resource-starved, the link is marginal, or the control system is approaching its stability boundary. RAVEN calibration: \(R_{\min} = 0.2\) (illustrative value), \(C_{\text{conn}} = 0.3\) (illustrative value), \(\rho_{\min} = 0.2\) (illustrative value). A fully operational node in Connected regime scores \(\Psi \approx 0.95\) (illustrative value); a node at 25% battery (illustrative value) in Intermittent regime scores \(\Psi \approx 0.42\) (illustrative value), triggering handover preparation.

\(\rho_q(t)\) is the discrete-time CBF safety margin sampled at each MAPE-K tick. It is the discretized equivalent of Proposition 22 ’s continuous Lyapunov margin (Self-Healing Without Connectivity). Reducing \(\rho_\text{min}\) below 0.2 is not recommended without re-analysis against Proposition 22 ’s stability conditions — the discrete sampling interval \(T_\text{tick}\) means the true continuous margin may be lower than the sampled value.

Calibration for other platforms. The values \(R_\text{min} = 0.2\), \(C_\text{conn} = 0.3\), \(\rho_\text{min} = 0.2\) are RAVEN-specific (illustrative value). For other deployments: set \(\rho_\text{min}\) from Proposition 22 ’s stability condition for the capability level; set \(R_\text{min}\) as the resource level below which healing actions cannot execute ( Definition 44 load-shedding threshold); and set \(C_\text{conn}\) as the minimum link capacity needed for one gossip packet per MAPE-K tick.

Example guidance — CONVOY (12 vehicles): \(\rho_\text{min} = 0.15\) (illustrative value) (larger vehicles, slower failure modes), \(R_\text{min} = 0.25\) (illustrative value) (fuel-constrained), \(C_\text{conn} = 0.4\) Mbit/s (illustrative value) (mesh radio).

Worked example (RAVEN, t = 1000 s). Battery at 25% (illustrative value): \(R(t) = 0.25\), \(R_\text{min} = 0.2\), factor = 1.25. Link capacity at 40 Mbit/s (illustrative value) with threshold 30 Mbit/s (illustrative value): . CBF margin \(\rho_q = 0.28\) (illustrative value), \(\rho_\text{min} = 0.2\), factor = 1.4. Product: ; clamped to \(\Psi = 1.0\) — all three subsystems above their minimums; system is healthy.

Stress case (t = 5800 s). Battery at 18% (illustrative value): factor = 0.90. Link at 15 Mbit/s (illustrative value): factor = 0.50. CBF margin \(\rho_q = 0.12\) (illustrative value): factor = 0.60. Product: ; handover criterion triggers.

Hysteresis requirement. The \(\Psi_\text{fail} = 0.30\) (illustrative value) trigger should be implemented with a hysteresis band to prevent oscillation near the boundary, using two thresholds: \(\Psi_\text{trigger} = 0.30\) (illustrative value) begins handover preparation when \(\Psi\) drops below this value, and \(\Psi_\text{resume} = 0.35\) (illustrative value) re-enables full autonomy only when \(\Psi\) recovers above this value.

Without this band, a RAVEN system hovering at \(\Psi \approx 0.29\)–\(0.31\) (illustrative value) (borderline battery + link) oscillates rapidly between handover-ready and autonomous — making neither the human nor the autonomic loop fully in control. The 0.05 gap (illustrative value) corresponds to a ~10% (illustrative value) improvement in any single \(\Psi\) component above its floor, a meaningful operational threshold.

Bandit policy dependency: the autonomy confidence score \(\Psi(t)\) used in this proposition is derived from the EXP3-IX arm-weight distribution established in Anti-Fragile Decision-Making at the Edge ( Definition 82 and Definition 83 ). Under normal operation, where \(w_i\) are the bandit arm weights and \(\hat{q}_i\) are the per-arm quality estimates. When the bandit is in a Strategic Amnesia window, the fallback estimate specified below applies.

Proposition 74 (Predictive Handover Criterion). Let \(\Psi(t) \in [0,1]\) denote the system’s autonomy confidence — the probability that the current decision context is within the system’s validated operating envelope — and the situational awareness recovery time: the minimum time for an operator to reconstruct sufficient mission SA after disengagement. The handover trigger must satisfy:

The proposition computes the confidence threshold at which handover must begin, accounting for the operator SA reconstruction window ; (illustrative value) for dense multi-threat environments, and is the mission-specific safety confidence floor. Field measurements consistently show is longer than expected — operator trials under realistic cognitive load routinely exceed empty-desk lab estimates by significant margins.

Physical translation: The handover signal fires when the system’s autonomy confidence has been declining fast enough, for long enough, that it will fall below the safety floor before human operators can respond. The \(\tau_{SA}\) window is the human reaction time baked into the math — the system hands off early so that when humans take control, the system is still above the minimum safe autonomy level, not already below it.

where is the minimum confidence at which automation fails safely. Handover must be initiated when , not when .

Deriving \(\Psi_\text{fail}\). \(\Psi_\text{fail}\) is the confidence threshold below which the system can no longer guarantee safe-state reachability ( Definition 53 , Self-Healing Without Connectivity). To derive it: identify the minimum Lyapunov stability margin \(\rho_\text{min}\) required by Proposition 22 ; identify the minimum resource state \(R_\text{min}\) below which healing actions cannot be executed; then set \(\Psi_\text{fail}\) as the \(\Psi(t)\) value that results when all three factors (R, C, \(\rho_q\)) are simultaneously at their respective minimums.

In practice, \(\Psi_\text{fail}\) is calibrated from mission reversibility constraints — set it at the \(\Psi\) value below which a bad autonomous decision cannot be corrected before mission end. For RAVEN: \(\Psi_\text{fail} = 0.3\) (illustrative value) (calibrated from abort-window analysis).

Reproducible calibration procedure. Collect operational logs from 20+ (illustrative value) missions with known outcomes (abort events and successful completions). Retroactively compute \(\Psi(t)\) for each log using measured \(R(t)\), \(C(t)\), \(\rho_q(t)\). Identify \(\Psi_\text{abort}\) as the minimum \(\Psi\) observed during missions that ended in autonomy-confidence abort. Set (illustrative value) (10% safety margin (illustrative value)). Validate by simulating 100+ (illustrative value) diverse threat scenarios and confirm correlates with mission success above 95% (illustrative value) of the time.

For RAVEN, this procedure yielded (illustrative value), giving \(\Psi_\text{fail} = 0.30\) (illustrative value).

The key insight is that is measured in minutes, not milliseconds. For RAVEN missions with dense multi-threat environments, empirical SA reconstruction times are 90–180 seconds — comparable to the judgment horizon window. During this interval \(\Psi(t)\) continues to decay. A handover initiated at delivers an operator who is not yet situationally aware, into a system that has already passed the point of safe autonomous recovery.

Consequence: The predicate in Phase 4 gate requires demonstrating that handover triggers are set conservatively enough to provide full SA recovery time before the predicted failure boundary.

False-positive cost: A false-positive handover preparation consumes the state-delta briefing bandwidth ( Definition 109 ) unnecessarily. The cost is bounded by one State-Delta Briefing Protocol transmission — typically \(\leq 1\) KB for a 100-variable state vector. Design the prediction confidence threshold \(p^*\) to keep false-positive rate below 5% using historical connectivity traces.

Implementation note — estimating \(d\Psi/dt\): In practice, \(d\Psi/dt\) is not analytically available; it is estimated via linear regression over the most recent \(\tau_{SA}\) window of historical \(\Psi\) samples. For systems where \(\Psi\) is non-monotonic (brief confidence spikes during successful autonomous actions), use the minimum observed rate over the window — a conservative over-estimate of how fast confidence is falling. The simplified constant-decay approximation: . Systems where \(\Psi\) can rise (e.g., after a successful autonomous manoeuvre mid-SA-window) should clip the integral at zero net change rather than allow a negative term to lower the trigger threshold below \(\Psi_{\text{fail}}\).

Empirical status: The range and the RAVEN \(\Psi_{\text{fail}} = 0.30\) value are calibrated from 20+ RAVEN mission logs and simulated operator trials; \(\tau_{SA}\) is highly context-dependent — teams under low cognitive load can achieve SA reconstruction in under 60 seconds, while high-threat-density environments routinely require over 3 minutes — and it must be measured with real operators under realistic workload, not empty-desk lab conditions, before setting \(\Psi_{\text{trigger}}\).

Fallback under bandit reset: if the EXP3-IX policy is in a Strategic Amnesia window (arm weights are within \(\varepsilon\) of uniform), \(d\Psi/dt\) from the bandit distribution is undefined or artificially flat. In this case, substitute a model-free \(\Psi\) estimate computed directly from the Resource State \(R(t)\) ( Definition 130 ) and the Synthetic Health Metric (Definition 55) : with default weights \(w_R = w_H = 0.5\). The handover trigger uses in place of the bandit-derived \(\Psi\) for the duration of the amnesia window.

Watch out for: \(\tau_{SA}\) is measured during pre-deployment operator trials; if the operator’s cognitive load during the trial differs from their load during actual missions — because trials omit the simultaneous multi-threat monitoring and command-authority decisions that accompany real operations — then \(\tau_{SA}\) is underestimated, the trigger fires too late, and the operator takes control of a system that has already crossed \(\Psi_\text{fail}\) before they have sufficient situational awareness to respond.

Trust Hysteresis

Definition 105 (Asymmetric Trust Dynamics). Let denote the operator’s trust in system autonomy at timestep \(t\). Trust evolves asymmetrically:

The definition models operator trust with additive success recovery ( (illustrative value)) and multiplicative failure decay ( (illustrative value)), updated after each observable automation outcome; symmetric-recovery settings allow a serious failure to be erased by one subsequent success — the asymmetric model prevents this. A failure at reduces trust to and requires approximately 10 successes (theoretical bound under illustrative parameters) to recover. The asymmetry ratio (illustrative value) is consistent with human-robot teaming literature, where trust repair after failure takes \(3\text{–}10\times\) (illustrative value) longer than trust establishment [12] .

The 8:1 loss-to-gain ratio reflects that a single Byzantine event can corrupt a fleet decision faster than eight successful cooperation rounds can rebuild the trust needed to reverse it — trust is hard to earn and easy to lose by deliberate design.

with . The success branch saturates as trust approaches 1; the failure branch decays multiplicatively toward .

A single automation failure can erase trust accumulated over many successes. For and , a failure at reduces trust to , requiring approximately \(k\) successes to recover:

System implication: Automation confidence thresholds must be calibrated to the current trust state . When , the judgment horizon contracts — more decisions require human authorization even if system-measured confidence \(\Psi(t)\) is high. Trust dynamics are a function of the entire operational history, not a moving average.

Physical translation: Trust grows slowly on success and collapses quickly on failure. Each successful autonomous decision adds a small fraction of the remaining gap to full trust ( ). Each failure removes a large fraction of current trust ( ). A failure when trust is at 0.80 drops trust to \(\approx 0.48\) — recovering from that loss requires roughly 10 subsequent successes. This 8:1 asymmetry is intentional: operators who suffered a serious automation failure need substantial evidence of reliable performance before re-extending authority. The Judgment Horizon ( Definition 91 ) contracts when , requiring more decisions to escalate — the automation’s effective authority shrinks in proportion to the trust deficit.

Escalation ratchet: if a handover event occurs more than \(N_\text{handover,max} = 3\) (illustrative value) times within any 72-hour (illustrative value) window, the autonomic system’s re-engagement is permanently suspended pending human operator authorization. The ratchet prevents rapid handover cycling — an autonomic system that repeatedly requests and receives authority, only to trigger further handovers, is exhibiting a structural instability that additional autonomy will not resolve. Permanent suspension resets only on explicit operator override with a written justification logged in the Post-Partition Audit Record . The threshold \(N_\text{handover,max} = 3\) is a default; calibrate per deployment based on expected handover frequency in normal operations.

Causal Barrier

An operator who is fully trusted and fully engaged can still issue a harmful command — if their mental model of fleet state is stale. In RAVEN , a 47-minute partition means the operator rejoins a swarm that has autonomously re-routed, consumed fuel non-uniformly, and reassigned formation roles; a command based on the pre-partition formation is not malicious — it is causally wrong. The Causal Barrier makes this staleness detectable before execution.

Definition 106 (Causal Barrier). Let denote the operator’s state snapshot at time \(t\), characterized by its Merkle root (state reconciliation). Let denote the current Merkle root of the edge fleet state [13] . A human command \(c\) issued at time \(t\) is causally valid if and only if:

where is the propagation delay for state updates to reach the operator. Commands where are causally stale and must be rejected with a state divergence notification.

The Causal Barrier addresses a failure mode orthogonal to trust: the operator may be fully trusted, fully engaged, and still issue a harmful command because their mental model of fleet state is out of date. This is particularly acute in contested environments where can exceed 30 seconds (illustrative value) and state can diverge significantly during that window.

Connection to HLC. The Merkle root \(M_\text{op}\) represents the cryptographic digest of the operation log consistent with the HLC timestamp ( Definition 61 , Fleet Coherence Under Partition). Commands whose Merkle roots reflect stale fleet state are rejected because their causal assumptions — what the fleet’s state was at the moment of issuance — no longer hold. This extends the vector clock causality guarantee ( Definition 60 ) to the command-authority domain. The Merkle root \(M_\text{edge}\) includes a timestamp sourced from the HLC ( Definition 61 , Fleet Coherence Under Partition). When comparing \(M_\text{op}\) to , also verify that the HLC timestamp embedded in \(M_\text{op}\) falls within \(T_\text{trust}\) of the current HLC value ( Definition 59 ); reject commands whose Merkle root timestamp lies outside the trust window, as clock drift may have invalidated the causal assumption.

HLC unavailability safe mode: if the Hybrid Logical Clock timestamp \(t\) cannot be locally verified ( , due to counter wraparound, clock desync, or node restart), the Causal Barrier defaults to fail-closed: all non-safety-critical write operations are deferred until HLC is restored. Safety-interlock writes (L0 Physical Safety Interlock level) are permitted unconditionally and logged with a synthetic monotonic counter pending HLC recovery. The fail-closed posture is annotated in the Post-Partition Audit Record upon restoration.

Connection to Fleet Coherence: The Causal Barrier extends the Merkle reconciliation protocol from fleet-to-fleet state synchronization to human-to-fleet command validation [13] . The same that drives CRDT merge frequency also determines the maximum safe command lag.

Physical translation: The Causal Barrier rejects commands issued against stale fleet state. The Merkle root is a cryptographic fingerprint of the operator’s last known fleet snapshot. The Merkle root is the fleet’s actual state at the moment that snapshot was formed. If they differ, the operator was commanding against a fleet that no longer existed when the command arrived. A rejected command returns the Difference Map ( Definition 110 ) showing exactly what changed — it is a “your picture is stale, here is the update” signal, not a refusal of the operator’s authority.

Semantic Compression

The flip side of the Causal Barrier is the compression problem: if an operator receives raw telemetry from 47 drones, each emitting hundreds of sensor readings per second, the signal is lost in the noise. In RAVEN , that stream would exceed 47,000 (illustrative value) readings per second — unprocessable at the rate of generation. The Intent Health Indicator compresses the entire fleet state into three actionable labels.

Definition 107 (Intent Health Indicator). Let \(\Sigma\) be the space of raw telemetry vectors and the 3-state Intent Health space. The semantic compression function is:

where \(\gamma(\sigma)\) is the semantic convergence factor ( Definition 5b ) evaluated over telemetry \(\sigma\), is the high-confidence threshold, and \(\varepsilon\) is the convergence tolerance.

(\(\gamma(\sigma)\) here is the semantic convergence factor from Why Edge Is Not Cloud Minus Bandwidth, distinct from the four \(\gamma\) roles in Anti-Fragile Decision-Making at the Edge: discount factor ( Definition 80 ), EXP3 mixing weight, EXP3-IX exploration floor ( Definition 81 ), and exploration inflation factor ( Definition 90 ).)

Input source. The Intent Health Indicator \(f\) computes semantic convergence \(\gamma(\sigma)\) from the fleet’s current aggregated health reports — the gossip-propagated health vectors \(H(t)\) assembled via Definition 24 (Self-Measurement Without Central Observability). The input \(\sigma\) is the vector of all recent health observations received through gossip, not raw sensor streams. The compression treats \(\sigma\) as a snapshot of fleet-level awareness; individual sensor noise is already averaged out by the gossip aggregation layer.

Input signals: the per-node scalar health inputs used by this indicator are drawn from the Synthetic Health Metric (Definition 55) .

The 3-state compression maps directly to operator-actionable states: Aligned requires no intervention; Drifted warrants monitoring (healing protocols are active, Self-Healing Without Connectivity); Diverged requires immediate escalation ( means consensus has failed, Definition 5b ). The compression eliminates alert fatigue by suppressing the high-dimensional telemetry stream that operators cannot process at the rate of generation.

Physical translation: The function \(f\) maps the current system state to a single health label that any upstream coordinator can consume without reading raw metrics. Instead of forwarding hundreds of sensor readings per second, the edge node emits one word — Aligned, Drifted, or Diverged — once per MAPE-K tick. An operator monitoring a 47-drone swarm sees 47 health labels, not 47,000 raw metric streams. The compression is lossless for decision purposes: Aligned means no action needed; Drifted means watch for escalation; Diverged means act now.

Connection to health monitoring: The Intent Health Indicator is the operator-facing projection of the fleet health state from the gossip protocol. The gossip layer provides \(\gamma(\sigma)\); the compression layer translates it into human-actionable signal.

L0 Physical Safety Interlock

Definition 108 (L0 Physical Safety Interlock). An L0 Physical Safety Interlock is a hardware-level circuit that enforces a safe-state transition independent of and prior to any software layer [4, 14] , characterized by four properties: non-programmability (the safe-state condition is wired, not configured); MAPE-K bypass (the circuit fires regardless of MAPE-K state or software health); determinism (transition time (watchdog period) with no software path); and non-resettability from software (recovery from the L0 Physical Interlock requires physical human action).

The L0 circuit trips the binary veto signal when any physical parameter crosses a wired threshold, independent of all software; the software watchdog reads at each Execute tick and skips the tick when to prevent thermal runaway from retrying commands to a fused actuator. Thresholds are set at manufacture — non-programmable from software and non-resettable without physical intervention. Placing the L0 circuit on a separate power rail from main compute is structurally required: a shared brownout can simultaneously defeat both the software watchdog and the physical interlock, eliminating the hardware safety backstop at the moment of maximum stress.

Hold management: the L0 interlock engages indefinitely until cleared. To prevent indefinite hold due to firmware faults: (1) a hardware watchdog timer (illustrative value) triggers a controlled power cycle if the interlock remains engaged beyond \(T_{\text{watchdog}}\); (2) a physically-present operator (requiring hardware key or authenticated console) may perform an emergency release after verifying the triggering condition is resolved; (3) for GRIDEDGE and scheduled-operation deployments, an operator-scheduled release window may be pre-authorized with a minimum (illustrative value) before the scheduled operation.

Why non-resettable? If the L0 interlock were software-resettable, a corrupted MAPE-K loop or a sufficiently severe software fault could disable the interlock and re-enable a fused actuator — eliminating the safety guarantee precisely when it is most needed. By requiring physical human action to reset, we guarantee that even complete software failure cannot bypass the safety boundary (equivalent to the Safe Operating Envelope (SOE) from Anti-Fragile Decision-Making at the Edge — the set of states where the Lyapunov stability criterion holds). The inter-tick safety gap identified in Self-Healing Without Connectivity (discrete-time certificate does not protect between sampling instants) is exactly why a hardware-layer backstop is mandatory for systems where failure propagates faster than \(T_\text{tick}\).

False-positive mitigation and L0.5 pre-stage. L0 interlocks are irreversible; false-positive rates must be validated below 1 per 10,000 operating hours (illustrative value) before deployment. To reduce false-positive risk, consider a reversible L0.5 pre-stage: reduced capability, no autonomous actuation, but NOT a pyrotechnic/permanent lockout. L0.5 activates when any single watchdog fires; L0 activates only when two or more independent watchdog timers agree simultaneously. This two-stage approach preserves safety under dual-independent-failure scenarios while preventing a single transient spike from permanently disabling a critical asset.

where is the set of monitored physical parameters (voltage, temperature, acceleration, arming signal). When , the system enters — a state that cannot be exited by any software command.

Example: In CONVOY , each ground vehicle carries a wired over-temperature cutoff on its drive motor: if the motor winding exceeds , the circuit opens the power relay in under 2 ms — well below the 20 ms software watchdog period — regardless of whether the autonomic control stack is running, hung, or actively sending drive commands. A software fault that saturates the motor controller cannot prevent the interlock from firing; once the relay opens, no software command can close it until a technician resets the physical latch. This hard boundary makes the MAPE-K thermal-management loop a best-effort optimization, not a safety dependency.

Distinction from software watchdogs: Definition 108 is distinct from the Software Watchdog ( Definition 41 ) and Terminal Safety State ( Definition 53 ). The Software Watchdog detects software failure and triggers a software response. The Terminal Safety State is a MAPE-K outcome. The L0 Physical Interlock bypasses the entire software stack — it fires because a physical condition was met, regardless of whether the software is functioning. The MAPE-K loop cannot override it; neither can a remote command.

Implementation examples: Dead Man’s Switch (DMS) circuits, hardware-enforced power cutoff, physically irreversible actuation (pyrotechnic separation, thermal runaway inhibitor). The interlock is not part of the autonomic control plane — it is the boundary condition that the autonomic control plane must never violate.

Cognitive Map: The human-machine teaming section formalizes five constructs at the automation boundary. Proposition 74 (Predictive Handover Criterion) establishes the lead time requirement: initiate handover before the safety floor is reached, by the SA reconstruction duration. Definition 105 (Asymmetric Trust Dynamics) models slow trust build, fast trust loss — the system must not assume trust persists across incidents. Definition 106 (Causal Barrier) rejects commands whose mental model is more stale than the decision’s impact window. Definition 107 (Intent Health Indicator) compresses system state into an operator-consumable signal to prevent alert fatigue. Definition 108 (L0 Physical Safety Interlock) is the absolute boundary: hardware-enforced, software-bypassing, mission-aborting — the constraint that makes all other autonomic guarantees credible.


State-Delta Briefing and the Slow-Sync Handover

Proposition 74 specifies when to begin handover but not how to execute it safely. At reconnection, the delta between the operator’s mental model and actual system state is at its maximum — presenting raw telemetry causes Mode Confusion (operator applies stale mental model to live data) and Automation Surprise (snap commands before SA reconstruction). The seven-step State-Delta Briefing protocol addresses this: compute per-variable divergence scores, rank them, impose a calibrated Shadow Mode observation window (read-only for duration ), present the Difference Map with at most items, and gate write-access on explicit acknowledgment — Shadow Mode pre-loads Level 1 SA before the operator is shown what changed. The Shadow Mode duration \(T\) must be calibrated against actual operator SA reconstruction times: setting \(T\) too short defeats the purpose, while the 120-second hard ceiling ( ) prevents pathological long partitions from producing indefinite lockouts.

Proposition 74 (Predictive Handover Criterion) establishes when to initiate transfer — conservatively, before \(\Psi(t)\) reaches , accounting for SA reconstruction time . It does not specify how to execute the transfer safely. At reconnection, the delta between the operator’s mental model and actual system state is at its maximum. Presenting raw telemetry at this moment triggers two failure modes: Mode Confusion (operator applies stale assumptions to live data) and Automation Surprise (unexpected system state drives snap commands before SA is reconstructed). That gap is the judgment horizon ( Definition 91 ).

After a 47-minute RAVEN blackout, the swarm has autonomously re-planned routes, consumed fuel non-uniformly, rerouted through alternate corridors, and reassigned formation roles. The operator rejoining encounters a fleet that is functionally different from the one delegated. Without a structured transition, the operator either under-reacts (trusting the stale mental model) or over-reacts (issuing conflicting commands mid-maneuver). Both failure modes appear in aviation Mode Confusion incidents (Air France 447, 2009; Asiana 214, 2013) where automation-to-manual handover caused loss of situational control.

The State-Delta Briefing protocol closes this gap in three steps: rank divergence, impose a calibrated observation window, then gate write-access on briefing acknowledgment.

Definition 109 (State-Delta Briefing Protocol). Given pre-partition state \(\Sigma(t_0)\), post-partition state , and partition duration , the handover proceeds as follows.

HLC dependency: the delta sequencing in this protocol relies on the Hybrid Logical Clock causal ordering properties established in Proposition 41 . Under HLC failure conditions (see Causal Barrier safe mode, above), delta sequencing falls back to physical timestamp ordering with a staleness flag appended to each delta record.

Step 1 (delta): For each state variable , compute the normalized divergence score:

where \(\sigma_i\) is the operational range of variable \(i\). The divergence score \(d_i\) is dimensionless and comparable across heterogeneous state variables (fuel fraction, route deviation, formation index).

Step 2 (rank): Sort variables by \(d_i\) descending to produce ranking .

Step 3 (norm): Compute the fleet divergence norm: .

Step 4 (shadow duration): Compute Shadow Mode duration:

Physical translation. \(T\) scales with how long the partition lasted and how much the fleet diverged during it. A 5-minute partition with minimal divergence ( ) gives seconds — a brief observation window. A 47-minute partition with high divergence ( ) gives seconds — nearly a minute at \(k = 2\). The formula ensures the handover burden scales with the briefing complexity, not with a fixed constant. The cap prevents a 3-day partition from producing a 72-minute lockout.

where \(k\) is a fleet-wide calibration constant and is a hard ceiling (nominally 120 s). When or , \(T \to 0\) and the briefing collapses to a direct handover.

Calibrating \(k\). The constant \(k\) is a fleet-wide calibration factor measured from operator trials: it represents how many seconds of SA reconstruction a human operator requires per second of partition per unit divergence. Empirical values: CONVOY (12 vehicles, trained operators) \(k = 2.0\); RAVEN (47 drones, high-stress scenario) \(k = 2.8\). Calibrate per operator cohort in pre-deployment trials. Default: \(k = 2.0\) if no trial data is available.

Step 5 (Shadow Mode): For duration \(T\), write-access to is disabled. The operator observes ’s intended next actions in real time — the system narrates its reasoning via the Intent Health Indicator ( Definition 107 ). No intervention is possible. This pre-loads Level 1 situational awareness (perception of current system behavior) before the Difference Map is shown.

Step 6 (Difference Map): At \(T\) seconds, present \(\Delta\Sigma(t)\) ( Definition 110 ) to the operator: the top diverged variables, ranked by \(d_i\), severity-tagged, with pre- and post-partition values side by side.

Step 7 (gate): activation requires explicit operator acknowledgment of the Difference Map. If CRITICAL-tier items remain unresolved, \(T\) extends by , where 30 s is the per-CRITICAL-item baseline extension, entering a second Shadow Mode cycle. This loop continues until all CRITICAL items are resolved or the operator accepts residual risk explicitly.

Re-partition during briefing: If connectivity is lost while the system is in SHADOW_MODE or BRIEFING_PRESENTED state (i.e., before Q_COMMAND_ACTIVE clears), abort the current briefing and re-enter PARTITION state. The partial briefing is discarded. Upon the next reconnection, compute fresh \(\Delta\Sigma(t)\) from the updated partition endpoint and begin a new briefing. Commands queued during BRIEFING_PRESENTED are held, not executed, until the gate clears on the next successful briefing. Commands queued more than \(\tau_{SA}\) seconds before the gate clears are discarded as presumptively stale — an operator’s mental model formed before a second partition does not reflect the current fleet state.

Physical translation: When a human operator regains connectivity after a partition, they cannot process the full fleet state. The briefing protocol identifies the minimum set of state changes that affect pending decisions — tactical changes first, then strategic, then administrative. An operator briefed in this order can make time-critical decisions within seconds of reconnection rather than waiting for the full state synchronization to complete.

Definition 110 (Difference Map). The Difference Map \(\Delta\Sigma(t)\) is the ranked, severity-tagged representation of state divergence at reconnection:

The Difference Map bounds operator time to Level 2 situational awareness at 15 seconds from the briefing; (illustrative value) items (Miller’s Law) and CRITICAL-first ordering are both required for the bound to hold — violating either degrades comprehension beyond the window. Processing rate is 1.5–12.0 s per item (theoretical bound), with the CRITICAL tier requiring 4–26 s for comprehension (theoretical bound). Field measurements with 10+ real operators under time pressure consistently produce (illustrative value) slower times than lab conditions.

where is the variable identifier, , , , \(r_i\) is divergence rank, and is the severity tier.

Variables ranked beyond are collapsed to an “\(N - 7\) additional changes” summary line. Severity tiers follow the same k-sigma structure as Definition 33 (Divergence Sanity Bound): CRITICAL for \(d_i > 3\), WARN for \(d_i \in (1, 3]\), INFO for \(d_i \leq 1\).

Physical translation: The difference map answers “what changed while you were disconnected?” in decision-relevant order. Assets that moved outside their planned routes appear first; assets within tolerance are suppressed. The operator sees the delta, not the full picture — this reduces briefing time from minutes to seconds for large fleets.

The cap reflects Miller’s Law (working memory capacity \(7 \pm 2\) chunks) [15] : Miller, G.A., 1956, ‘The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information,’ Psychological Review, 63(2), 81–83. Presenting more than 7 diverged variables simultaneously does not increase operator SA — it fragments attention and delays comprehension of the highest-priority items. The cap is a cognitive capacity constraint enforced by the protocol, not a data limitation.

Empirical validation of \(N_\text{max}\) against your operator cohort under time-pressure conditions is recommended before finalizing this value. For operators with domain expertise, \(N_\text{max}\) may be raised to 9; for high-stress scenarios (RAVEN), lower to 5.

The handover state machine:

    
    stateDiagram-v2
    [*] --> PARTITION
    PARTITION --> RECONNECT_DETECTED : connectivity restored
    RECONNECT_DETECTED --> SHADOW_MODE : compute delta-Sigma, start timer T
    SHADOW_MODE --> BRIEFING_PRESENTED : T elapsed
    BRIEFING_PRESENTED --> Q_COMMAND_ACTIVE : ack + no unresolved CRITICAL
    BRIEFING_PRESENTED --> SHADOW_MODE : CRITICAL unresolved, T extended
    Q_COMMAND_ACTIVE --> PARTITION : connectivity lost
    Q_COMMAND_ACTIVE --> [*]

Proposition 75 (Situation Awareness Bound). Under the State-Delta Briefing Protocol (Definition 109) with Difference Map (Definition 110), if (theoretical bound) and variables are sorted CRITICAL-first, then a trained operator achieves Level 2 Situational Awareness (comprehension of current situation, Endsley 1995) within (theoretical bound) seconds, independently of partition duration .

Proof sketch. Information-theoretic bound: per item. Trained-operator HMI alert-processing rates for ranked alert summaries are 1.5–12.0 s per item (Endsley 1995; NTSB accident data). Severity-first ordering ensures CRITICAL items are processed in the first 4–26 seconds, exceeding Level 2 SA threshold for highest-priority variables before the 15-second mark. Shadow Mode (Step 5) pre-loads Level 1 SA during the observation window, so Difference Map comprehension begins with perceptual context already established. Partition duration affects \(T\) ( Definition 109 Step 4) but not — longer partitions produce longer Shadow Mode intervals that absorb divergence perception incrementally, not longer Difference Map reading times. \(\square\)

Physical translation: seconds is the maximum information loss during partition expressed as an operator comprehension time: regardless of whether the partition lasted 5 minutes or 5 hours, a correctly structured briefing takes at most 15 seconds to bring the operator to Level 2 SA. The stale upstream picture — the delta between what the operator knew at partition start and what the fleet actually is at reconnection — is fully communicated within that window. The bound holds only when items are shown CRITICAL-first; violating either condition extends beyond the bound.

RAVEN calibration. Partition duration minutes (illustrative value). Post-reconnection divergences: fuel consumption 40% above plan ( (illustrative value)), route sector deviation ( (illustrative value)), formation lead reassigned ( (illustrative value)) — all WARN, no CRITICAL.

With \(k = 0.5\) (illustrative value), Shadow Mode duration seconds (illustrative value). The operator observes 49 seconds (illustrative value) of autonomous operation, receives a 3-item Difference Map (WARN only), acknowledges, and gains . Total reconnect-to-active-command time: under 65 seconds (illustrative value) — the 15-second SA target applies to the Difference Map reading step alone; Shadow Mode absorbs the divergence perception load during the preceding observation window.

The state-delta briefing section solves the handover quality problem. The protocol converts a potentially dangerous cold handover into a structured warm transition: divergence is ranked ( Definition 109 Steps 1–12), an observation window is computed and enforced (Steps 4–24), and write-access is gated on briefing acknowledgment (Steps 6–27).

The Situation Awareness Bound ( Proposition 75 ) proves that Difference Map comprehension reaches Level 2 SA within 15 seconds when and CRITICAL items come first — partition duration affects Shadow Mode length, not briefing reading time. The RAVEN calibration shows the protocol in numbers: 47-minute (illustrative value) partition, 49-second (illustrative value) shadow mode, 3-item Difference Map, under 65 seconds (illustrative value) total to active command.

Empirical status: The bound and \(N_{\max} = 7\) cap are derived from Endsley (1995) and NTSB alert-processing data for ranked alert summaries; the 2.1 s per-item rate assumes a trained operator under moderate cognitive load — high-stress scenarios ( RAVEN combat) may increase per-item processing time to 4–6 s, implying a 28–42 s briefing window that should be validated against operator trials before finalizing \(N_{\max}\) for each deployment context.

Watch out for: the 15-second bound assumes a trained operator under moderate cognitive load; in high-stress scenarios (contested partition with simultaneous active threats), per-item processing time can rise from 2.1 s (theoretical bound) to 4–6 s (illustrative value), extending the bound to 28–42 s (illustrative value) for \(N_{\max} = 7\) items — and the Shadow Mode pre-load that enables the 15-second target is only effective if the operator is attending to the autonomous system during the observation window rather than managing concurrent out-of-band communications.


The Limits of Constraint Sequence

The constraint sequence framework is a powerful prescriptive tool, but like all frameworks it has validity boundaries. Applied outside those boundaries, it produces incorrect sequencing recommendations, and a system built on the wrong sequence fails in ways that are expensive to diagnose — because the framework itself provided the false confidence. Three responses address this: enumerate the structural failure modes of the framework itself (cyclic dependencies, adversarial graph evolution, and resource-infeasible sequencing); establish where engineering judgment must supplement formal derivation; and translate the framework’s three foundational mathematical constraints (clock discipline, resource floor, stability envelope) into production-observable early-warning signals that detect approach to a validity boundary before the boundary is crossed. The framework’s power is that it converts architectural sequencing from judgment to derivation; its weakness is that it assumes the prerequisite graph is stable and acyclic, and that the three structural constraints hold throughout operation. The three structural signals address the second assumption — but not the first: in a genuinely novel deployment environment where the graph itself is wrong, no signal can fire that the model was not designed to measure.

Every framework has boundaries. The constraint sequence is powerful but not universal. Recognizing its limits is essential for correct application.

Where the Framework Fails

Novel constraints: The framework assumes constraints are known. Unknown unknowns—constraints that weren’t anticipated—aren’t in the graph. When a novel constraint emerges, the sequence must be updated.

Example: A new adversary capability (sophisticated RF interference) creates a constraint not in the original graph. The team must add the constraint, identify its prerequisites, and re-evaluate the sequence.

Circular dependencies: Some capabilities genuinely depend on each other at runtime even when the development prerequisite graph is acyclic. The precise failure mode is a cold-start deadlock: Monitor waits for Resource Manager to allocate CPU before it can run; Resource Manager waits for Healer to release runaway processes before it can stabilise budgets; Healer waits for Monitor to provide a diagnosis before it acts. All three block — the system cannot boot.

This is not a flaw in Proposition 66 . Proposition 66 applies to the capability development graph, which is acyclic: Self-Measurement (the Monitor capability) must be validated before Self-Healing (the Healer capability) can be developed, which must be validated before Fleet Coherence requires the Resource Manager at scale. The development ordering is a strict DAG. The runtime component dependency graph is a different object and can be cyclic.

Resolution: L0 isolation breaks the runtime cycle. Definition 18 (L0 Dependency Isolation Requirement) requires each runtime component to have a zero-dependency L0 survival-mode variant. The Monitor’s L0 variant reads raw hardware registers with no process-level dependencies; the Healer’s L0 variant fires on static thresholds with no Resource Manager input; the Resource Manager’s L0 variant applies fixed priority tables with no feedback from the Healer. The cold-start bootstrap sequence brings up the hardware watchdog and safe-state logic first (zero runtime dependencies, activates from ROM), then the L0 sensor baseline (raw hardware registers only, no inter-component calls), then L0 threshold-based healing (static rules, no Resource Manager interaction), then the L0 static resource manager (fixed tier priorities, no Healer coupling), and only then the L1+ MAPE-K loops — the full Monitor/Healer/ResourceManager feedback cycle — once L0 stability is confirmed over \(T_{\text{stable}}\).

Proposition 8 (Hardened Hierarchy Fail-Down) guarantees that once L1+ enters the cyclic regime, L0 independence is preserved: a deadlock in the L1+ cycle cannot cascade below L0 and cannot prevent L0 from maintaining basic survival.

Development-time circular dependencies — where two capabilities depend on each other for validation, not just runtime operation — require a different resolution. Self-measurement quality depends on communication reliability (gossip needs a working channel); communication reliability depends on self-measurement (detecting and healing bad links). These cycles can’t be serialized. Resolution: derive initial approximations , from simulation or design specifications, develop each assuming the other’s initial approximation, then iterate until successive estimates change less than a predefined tolerance (e.g., 1% of threshold value). This converges because self-measurement quality and communication quality are weakly coupled at initialization — neither depends strongly on the other until the system approaches operational load.

Resource constraints: Sometimes you can’t afford the proper sequence. Budget, time, or capability limits may force shortcuts.

Example: A team has 6 months to deliver. The proper sequence requires 12 months. They must make risk-informed decisions about which phases to abbreviate.

Mitigation: Document the shortcuts. Know what risks you’re accepting. Plan to revisit abbreviated phases when resources allow.

Time constraints: Mission urgency may require deployment before the sequence is complete.

Example: An emerging threat requires rapid deployment. The system passes Phase 2 but Phase 3 is incomplete.

Mitigation: Deploy with documented limitations. Restrict operations to validated capability levels . Continue validation in parallel with operations.

Engineering Judgment

The meta-lesson: every framework has boundaries. The constraint sequence is a tool, not a law. The edge architect must know when to follow the framework and when to adapt.

Signs the framework doesn’t apply include: constraints that don’t fit the graph structure, validation criteria that can’t be defined, resources insufficient to permit proper sequencing, and novel situations not anticipated by the framework.

When these signs appear, engineering judgment must supplement the framework. The framework provides structure; judgment provides adaptation.

Anti-fragile insight: Framework failures improve the framework. Each case where the constraint sequence didn’t apply is an opportunity to extend it. Document exceptions. Analyze root causes. Update the framework for future use.

Three Structural Signals

The three constraints established in Why Edge Is Not Cloud Minus Bandwidth — clock drift, resource floor, and stability under mode-switching — each have observable early-warning signals in production. These signals matter precisely because of the framework’s boundary: by the time a formal guarantee is violated, the system is already failing. Monitoring the approach to violation gives the autonomic loop — and the operator — time to act while the formal tools still apply.

Clock drift. Every node records the per-exchange deviation between its own Hybrid Logical Clock (HLC, Definition 61 ) watermark and each peer’s. The staleness bound from Proposition 14 assumes clock discipline holds. When per-node estimated drift rate exceeds twice the hardware specification , the partition duration accumulator ( Definition 15 ) will drive the node past the Drift-Quarantine Re-sync trigger ( Definition 63 ) before the next maintenance window — meaning causal ordering under Proposition 45 cannot be verified for observations made during that gap.

SignalNominalAlertResponse
Per-exchange HLC deviation Trigger Drift-Quarantine (Def 42) immediately
Estimated drift rate Schedule NTP sync; increase gossip fanout
Fleet fraction with unresolved HLC divergence Fleet-wide clock discipline failing; escalate to operator

Adaptation: When the 10% (illustrative value) threshold fires, halve the gossip interval for all affected node classes to propagate corrected watermarks. Any node for which is excluded from the CRDT merge path ( Definition 62 ’s HLC-aware merge is invalidated by drift beyond \(\varepsilon\)) until re-sync completes.

Physical translation: measures how fast node \(i\)’s clock is drifting beyond what the hardware specification allows. A 100 ppm (illustrative value) crystal drifts by at most 0.1 ms per second (illustrative value); a reading of 0.25 ms per second (illustrative value) means the crystal is degrading. The alert at fires when drift is accelerating — before the staleness bound is breached. For a RAVEN node with minutes (illustrative value) into a partition: at 100 ppm (illustrative value) nominal, HLC divergence is negligible; at 250 ppm (illustrative value) measured, logical timestamps are diverging at a rate that will exceed \(\varepsilon\) within the partition window. Catching this early means Drift-Quarantine can complete a re-sync at the next brief connectivity window rather than discovering the divergence at full reconciliation time.

Resource floor. The composite resource availability \(R(t)\) ( Definition 1 ) has a critical threshold (threshold — requires deployment calibration) at which survival-mode shedding activates. The operationally important threshold is earlier: Proposition 68 establishes the autonomic overhead ceiling . Below (theoretical bound), the healing planner’s action feasibility constraint begins rejecting Severity 2 and above actions ( Definition 44 ). The system is above but already healing-impaired.

SignalNominalPre-criticalEmergency
\(R(t)\) resource availability
Autonomic fraction
Healing actions rejected by \(g_2\) in last 10 ticks0 Healing loop operationally dysfunctional

Adaptation: At , apply Resource Priority Matrix ( Definition 43 ) pre-emptive shedding before reaching : shed Severity 4 logging state first (no safety invariant), then reduce anti-fragile learning weight updates, then compress the gossip table to active-only peers (drop nodes not contacted within ), and finally suspend non-critical measurement threads. Severity 1 and 2 healing mechanisms are shed last. A controlled descent through the pre-critical zone preserves more healing capacity at than an uncontrolled drop that arrives there without budget for any healing action.

Physical translation: is not a danger threshold — it is a warning horizon. On a 500 mW (illustrative value) edge device where mW (illustrative value), the autonomic ceiling is 150 mW (illustrative value). At , total available power is 175 mW (illustrative value) — only 25 mW (illustrative value) above the mission minimum, leaving zero margin for any autonomic function above L0. At this point, any healing action requiring data transmission to coordinate with a peer will fail its feasibility check. Pre-emptive shedding starting at preserves the Healing Deadline guarantee of Proposition 21 for the Severity 1 actions that remain within budget. Waiting for means those Severity 1 actions compete with mission survival for the same depleted resource pool.

Stability under mode-switching. Proposition 22 establishes the stable gain ceiling for the MAPE-K loop. is mode-dependent: lower capability levels increase the tick interval to conserve compute. As the system sheds load, the stable gain ceiling tightens — a gain \(K\) calibrated at L3 may be above the ceiling at L1. Simultaneously, degraded connectivity raises effective feedback delay \(\tau\) through the stochastic delay distribution of Definition 37 : in the Contested regime, the P99 delay can exceed twenty-five times (theoretical bound) the nominal value ( Proposition 23 ), collapsing the safe gain envelope to near zero for remote Severity 2+ actions.

SignalNominalAlertResponse
Gain-delay product Reduce \(K\) by 30%; restrict to Severity \(\leq 1\) healing
Mode transition rate relative to dwell bound SMJLS dwell condition violated; halt capability transitions
Healing actions increasing net error after execution0 /hourLyapunov condition failing; reduce \(K\) immediately

Adaptation: When the gain-delay alert fires, reduce \(K\) by 30% (illustrative value) and restrict the healing planner to Severity \(\leq 1\) actions — at high delay variance, higher-severity interventions exceed the Proposition 23 robust gain bound and are more likely to overshoot than converge. If mode transitions are violating the dwell condition, extend by \(2\times\) (illustrative value) to restore the mean-square stability condition of the SMJLS proof before authorizing the next transition.

Physical translation: is the dimensionless load on the Proposition 22 stability margin — how much of the safe gain envelope the current parameters are consuming. At 0.7, there is 30% margin remaining, enough buffer for delay spikes in the Degraded regime. At 0.85, a single P99 delay event in the Contested regime brings the effective gain product above 1.0 and the loop becomes transiently unstable. The mode-transition alert is the second line: the SMJLS stability proof assumes the system stays in one mode long enough for the error signal to decay before the next switch. Four transitions per dwell interval means each switch’s error compounds rather than decays. A system that is technically above and within can have completely lost its healing convergence guarantee through the stability margin alone — these two signals detect that state before the failure manifests.

Composite early-warning. Define the structural signal count:

: 0 = nominal; 1 = one structural constraint approaching its validity boundary; 2 = two constraints converging simultaneously — halt anti-fragile learning updates, notify operator; 3 = all three approaching simultaneously — enter survival mode, freeze all policy updates, escalate immediately.

Physical translation: is not a failure alert — it is a precondition alert. The formal guarantees of Propositions 14, 9, and 21 each require their respective structural constraint to hold. At , two are simultaneously outside their valid range. The system may still appear functional by observation, but the theoretical foundation that guarantees convergence is no longer intact on two dimensions at once. Anti-fragile learning in this state updates policies from data the framework’s assumptions no longer vouch for — halting updates at is not conservatism, it is maintaining the epistemic validity of the learning loop. For a RAVEN swarm reading after 30 minutes (illustrative value) of contested partition — drifting clocks and a tightening gain ceiling simultaneously — the two constraints are not independent: clock drift degrades the staleness estimates the healing planner uses to assess whether observations are fresh enough to act on, and acting on stale observations under a constrained gain bound compounds both violations.

Three failure modes no signal can prevent. Even with all three monitors instrumented, certain degradations lie permanently outside the autonomic loop’s reach:

Partial flash write on power failure is detectable after the fact by a gap in the hash chain ( Definition 67 , Semantic Commit Order) without corresponding anomaly events — the chain diverges without cause. Recovery requires human-initiated state rebuild. No autonomic mechanism can reconstruct a pre-write state from a partial record: this is not a framework limitation, it is a consequence of information theory.

Simultaneous all-partition quorum deadlock occurs when every node in the fleet reaches simultaneously and no node has sufficient resources to initiate recovery coordination. Prevention requires a pre-provisioned lightweight coordinator node whose L0 footprint survives below . This is a provisioning-time architectural decision — \(\Gamma\) cannot trigger it at runtime because runtime resources have run out.

Framework assumption violations in novel deployment environments arise because all three signals assume the framework’s structural model is correct for the deployment context. In a genuinely novel environment — a hardware platform, RF signature, or adversarial capability not represented in any prior deployment — all three signals may read nominal while the system degrades along an unmeasured dimension. Proposition 57 (Stress-Information Duality) applies directly: the first field deployment in a novel environment carries maximum information precisely because the models are most wrong. Instrument exhaustively. Validate every formal bound against field data before trusting any guarantee. The anti-fragility coefficient measured across the first month of operational deployment is the empirical validity test for every proposition in this series.

All three are human-action problems. The \(\Gamma\) score gives the autonomic loop everything it can act on. The three failure modes mark where it stops — precisely where the judgment horizon ( Definition 91 ) begins.

Cognitive Map: The limits section closes the loop on the constraint sequence framework — and immediately turns its own limits into instruments. Four boundary conditions mark where the framework fails: cyclic prerequisites, adversarial graph evolution faster than development, undefinable validation criteria, and resource-infeasible sequencing. Engineering judgment fills those gaps; the anti-fragile insight converts each framework failure into an extension opportunity. The three structural signals then translate the framework’s mathematical assumptions directly into production-observable metrics: HLC deviation for the clock-drift constraint, composite \(R(t)\) for the resource floor, and gain-delay product for the stability envelope. The composite \(\Gamma(t)\) integrates all three into a single early-warning number — at \(\Gamma = 2\), two formal guarantees are simultaneously outside their validity range and anti-fragile learning must pause; at \(\Gamma = 3\), survival mode is the only valid response. The three unfixable failure modes mark the absolute boundary where \(\Gamma\) ends and human authority begins: the judgment horizon as a production engineering observable.


Composed Failure: Power + Partition + Clock Drift

The hardest failure scenario in this framework — simultaneous power degradation, network partition, and clock drift — requires coordinating mechanisms from across all preceding definitions. The following protocol integrates them into a single executable sequence.

Preconditions. A node enters composed failure when all three conditions hold simultaneously: battery \(B(t) < B_{\text{crit}}\) (below critical threshold from Definition 122 ); partition active with \(C(t) = \text{DENIED}\) ( Definition 1 ); and clock drift (exceeding HLC tolerance from Definition 61 ).

Protocol (execute in order):

The first step is the L0 gate: evaluate Proposition 37 (Weibull Circuit Breaker) and, if \(T_{\text{acc}} \geq P_{95}\) partition duration, activate the L0 physical interlock ( Definition 108 ) — distinct from the software-level terminal safety state (see Self-Healing Without Connectivity) — and halt. This step has absolute priority.

The second step is the MVS floor: apply Definition 121 (Minimum Viable System) to compute the reduced service set \(S_{\text{MVS}}\) and suspend all non-MVS healing actions.

The third step is clock isolation: freeze HLC advancement ( Definition 61 ), accept no causal updates, and buffer incoming state deltas ( Definition 109 ) with a timestamp of \(t_{\text{freeze}}\) until drift resolves.

The fourth step is resource triage: apply Definition 43 (Resource Priority Matrix) under the constraint \(B(t) < B_{\text{crit}}\), reduce the autonomic overhead budget ( Definition 122 ) to its minimum viable allocation, and suspend EXP3-IX arm evaluation ( Definition 81 ) if remaining CPU budget falls below \(C_{\text{min}}\).

The fifth step is the safe action default: while the second through fourth steps are active, all decisions default to the Safe Action Filter ( Definition 89 ) and no exploratory arms are evaluated.

The sixth step is the exit condition (all three must clear):

Clock synchronization guard: Exit condition is evaluated only after clock synchronization has been re-established (step 4 completion). A node in step 3 (clock isolation) cannot evaluate or satisfy the step 6 exit criterion until sync recovery. The criterion is therefore gated on step 4 completion and cannot be satisfied while HLC advancement remains frozen.

If any exit criterion cannot be evaluated within \(W_{\text{max}}\) due to resource exhaustion (step 4), the system remains in composed-failure mode and re-evaluates at next available tick. This guarantees no stuck state under the assumption that \(B(t)\) eventually recovers or the L0 physical interlock activates in step 1.

In practice, this means: when a node simultaneously runs low on power, loses the network, and loses clock synchronization, the framework does not attempt to resolve all three at once. It resolves them in a fixed priority order — hardware safety first, then software survival floor, then clock isolation, then resource triage — and applies a conservative safe-action default until all three exit conditions clear.

Cascade depth limit: to prevent the protocol from consuming the entire compute budget in a single MAPE-K tick, the maximum cascade depth within one evaluation window is \(d_{\max}^{\text{cascade}} = 3\) tiers. If a fourth tier requires evaluation, it is deferred to the next tick and flagged as a pending escalation. This circuit breaker ensures the autonomic overhead paradox ( Proposition 35 ) cannot manifest as a compute-exhaustion stuck state.


Closing: The Autonomic Edge

Six posts have built the formal foundations for autonomic edge architecture. The question is what they establish as a unified system — and whether the RAVEN swarm that emerged from this series is actually different from the one that would have been built without it. The series answers the six foundational questions in the order they must be answered: what the system becomes under partition, what it knows about itself when isolated, what it does with that knowledge, how isolated peers stay coherent, how it improves from disconnection, and in what order all of this must be built. The RAVEN swarm that answers all six is architecturally different from one that answers two. The depth of formal grounding comes at a cost — each capability requires more upfront investment than the expedient alternative; a system with hard-coded healing rules and manual reconciliation can be built faster — but the constraint sequence argument is that the expedient system will fail in the field in ways that are expensive and slow to diagnose, while the formally grounded system fails in ways that are detectable, bounded, and recoverable.

We return to where we began: the assertion that edge is not cloud minus bandwidth.

This series has developed what that difference means in practice:

Contested connectivity established the fundamental inversion: disconnection is the default; connectivity is the opportunity. The connectivity probability model \(C(t)\) quantifies this inversion. The capability hierarchy (L0-L4) shows how systems must degrade gracefully across connectivity states.

Self-measurement showed how to measure health without central observability. The observability constraint sequence (P0-P4) prioritizes what to measure first. Gossip -based health propagation maintains awareness across the fleet. Staleness bounds quantify confidence decay.

Self-healing showed how to heal without human escalation [3] . MAPE-K adapted for edge autonomy. Recovery ordering prevents cascade failures. Healing severity matches detection confidence.

Fleet coherence showed how to maintain coherence under partition. CRDT s and merge functions for state reconciliation. Hierarchical decision authority for autonomous decisions. Conflict resolution for irreconcilable differences.

Anti-fragility showed how to improve from stress rather than merely survive it. Anti-fragility metrics quantify improvement. Stress as information source. The judgment horizon separates automated from human decisions.

The constraint sequence integrates these capabilities into a buildable sequence. The prerequisite graph . Constraint migration. The meta-constraint of optimization overhead. The formal validation framework for systematic verification.

The Goal

The goal is not perfection. Perfection is unachievable in contested environments. The goal is anti-fragility : systems that improve from stress [16] .

An anti-fragile edge system detects when its models fail, learns from operational experience, improves its predictions with each stress event, knows when to defer to human judgment, and emerges from each challenge better calibrated for the next.

The Final Insight

The best edge systems are designed for the world as it is, not as we wish it were.

Connectivity is contested. Partition is normal. Autonomy is mandatory. Resources are constrained. Adversaries adapt.

These are not problems to be solved—they are constraints to be designed around. The edge architect who accepts these constraints, rather than wishing them away, builds systems that thrive in their environment.

The RAVEN swarm that loses connectivity doesn’t panic. It was designed for this. Each drone measures itself. Clusters coordinate locally. The swarm maintains mission capability at L2 while partitioned. When connectivity returns, state reconciles automatically. And through the stress of partition, the swarm learns—emerging better calibrated for the next disconnection.

This is autonomic edge architecture.


The constraint sequence and handover boundary framework developed in this series builds on four bodies of literature, adapting each to the contested edge context where disconnection is the default operating condition.

Autonomic computing and self-adaptive systems. The MAPE-K autonomic control loop — Monitor, Analyze, Plan, Execute over a shared Knowledge base — was formalized by Kephart and Chess [1] and elaborated in IBM’s architectural blueprint [2] . Huebscher and McCann [17] surveyed the degrees and models of autonomic behavior, while Salehie and Tahvildari [3] characterized the landscape of self-adaptive software and its research challenges. This series extends that work to the edge context, adding three elements absent from cloud-centric formulations: connectivity-state-dependent phase gates, constraint migration under adversarial conditions, and resource-ceiling bounds for autonomic overhead. The constraint sequence ( Definition 92 ) and prerequisite graph ( Definition 93 ) provide a formal sequencing substrate that the original autonomic computing vision left implicit.

Cyber-physical systems and safety-critical control. Lee [4] identified the design challenges that arise when computation must be embedded in and interact with physical processes — timing, reliability, and the impossibility of treating physical dynamics as independent of software correctness. The L0 Physical Safety Interlock ( Definition 108 ) and Control Barrier Function gain scheduler ( Definition 40 ) are direct instantiations of this requirement: hardware-enforced safety boundaries that the software stack cannot override, and mathematically certified stability margins [14] that bound the MAPE-K loop’s actuation authority. The Avizienis et al. dependability taxonomy [7] provides the fault-failure-error classification that underlies the healing severity ordering ( Definition 44 ) and the Minimum Viable System predicate ( Definition 50 ).

Edge computing orchestration and handover. Satyanarayanan [5] characterized the emergence of edge computing as a paradigm distinct from the cloud, and Shi et al. surveyed its vision and challenges; the ETSI MEC reference architecture [6] standardized the framework for multi-access edge computing. The constraint sequence framework adapts these orchestration models to contested environments where connectivity is not a service-level assumption but a Weibull-distributed stochastic variable. The predictive handover criterion ( Proposition 74 ), the causal barrier ( Definition 106 ), and the state-delta briefing protocol ( Definition 109 ) address the human-machine boundary problem that standard MEC orchestration leaves to deployment policy. The CAP theorem context [10] motivates the CRDT-based state reconciliation ( Definition 58 , Phase 3 gate) as the only approach compatible with partition-tolerant consistency.

Resilience engineering and chaos testing. Taleb’s anti-fragility concept [16] — that some systems gain from disorder rather than merely tolerating it — is operationalized here as a testable engineering property ( Definition 79 ) with a measurable coefficient . The Field Autonomic Certification checklist ( Definition 104 ) applies chaos engineering principles [11] — systematic fault injection, partition cycling, and adversarial stress — as the formal validation methodology for the phase gate framework. The Byzantine fault model [12] underpins the peer-validation layer ( Definition 64 ) and the logical quorum ( Definition 66 ) — extending classical Byzantine tolerance to edge systems where adversarial presence is a continuous variable rather than a binary condition.


Optimal Sequencing

The constraint sequence corresponds to a topological sort of the prerequisite graph . Valid sequences satisfy —prerequisites before dependents. Optimal sequences minimize weighted position , placing high-priority capabilities early.

Resource allocation at optimum equalizes marginal values across functions:

This Lagrangian condition ensures no reallocation can improve total value. The optimal allocation is interior — neither (pure mission) nor (pure autonomy). Both contribute positive marginal mission value: measurement enables better decisions; healing reduces capability loss. The condition indicates the optimum equalizes marginal returns. Online, approximate this by reallocating toward whichever function shows higher marginal improvement per unit resource.

Cognitive Map: The closing section synthesizes the six-post series into six answerable questions and the mathematical dependencies between them. The six-question structure is not rhetorical — each question maps directly to one post’s formal contribution, and the order of the questions is the constraint sequence itself. The optimal sequencing result closes with the Lagrangian condition: at the resource optimum, marginal value is equalized across mission, measurement, healing, and coherence. This is the formal justification for maintaining all four functions rather than collapsing to pure mission: the optimum is always interior.


Series Synthesis

The six posts in this series collectively address six structural weaknesses in naive edge deployments. The table below maps each weakness to its formal solution and the post where the solution is introduced.

WeaknessSolutionPost
Stochastic partition durationWeibull Partition Duration Model ( Definition 13 ) + Weibull Circuit Breaker ( Proposition 37 )P1, P3
Autonomic resource overheadProposition 68 (Autonomic Overhead Bound)P6
Clock drift under partitionHybrid Logical Clock (Definition 61) + Drift-Quarantine Re-sync (Definition 63)P4
Byzantine health corruptionPeer-Validation Layer (Definition 64) + Logical Quorum (Definition 66)P4
Multi-failure cascade dead-endsTerminal Safety State (Definition 53) + Hardware Veto (Proposition 32)P3
Handover boundary complexityConstraint Sequence (Definition 92) + Phase Gate Function (Definition 103)P6

SCALEFAST scenario: A cloud-to-edge migration project using this framework applies the constraint sequence in reverse: capabilities built for the cloud-native environment must be re-validated against the edge prerequisite graph. The Weibull circuit breaker ( Proposition 37 ) fires during SCALEFAST migration testing when the new edge nodes encounter partition durations that the cloud-native codebase was never designed for — the FAC checklist ( Definition 104 , items C8–C10) identifies these gaps before production deployment.


Series Conclusion

A six-post series has covered hundreds of formal definitions, propositions, and mechanisms. The risk is that the formal machinery obscures the answer to the engineer’s actual question: what do I build differently? Six questions, answerable without a network connection, distinguish an autonomic edge system from a cloud system that tolerates occasional disconnection. Most edge architectures answer zero or one of them. The series answers all six, in the order the mathematics requires. The depth of the formal foundations requires upfront investment that a less rigorous approach skips — the constraint sequence argument is that the skipped investment becomes field debt, expensive to diagnose and often irrecoverable.

At some point, every engineer who has deployed a distributed system into a contested or remote environment has gotten the call: the system is unreachable, the operator cannot intervene, and the system was never designed to operate without the operator. The fix is manual. The outage is measured in hours.

That call is a design problem, not an operations problem. The system failed not because of a bug but because the architecture assumed connectivity and had no answer for its absence — no operating mode, no healing logic, no coherence mechanism, no way to get better from the experience. When connectivity left, so did the system.

This series builds the answer, formally, in the sequence the mathematics requires.

Six Questions No One Is Asking

There are six questions an autonomic edge system must be able to answer without a network connection. Most edge architectures answer zero or one. This series answers all six, in order, because the order is not optional.

1. What does the system become when the link drops? Not “what does it do” — what is it? The capability level hierarchy (L0–L4) answers this. The connectivity state \(C(t)\) is a Markov process across four connectivity regimes ; Denied is a legitimate steady state, not a failure code. Proposition 2 establishes the inversion threshold \(\tau^*\): below it, distributed autonomy strictly dominates cloud control on every operational metric. For contested and industrial deployments, \(C(t) < \tau^*\) is the routine condition. The design target is partition, not connection.

2. What does the system know about itself when isolated? A node cut off from central telemetry must self-measure or it is blind. Local anomaly detection runs at \(O(1)\) per observation — no uplink, no central service. Gossip protocols converge fleet health state in \(O(\ln n / \lambda)\) rounds across any partial mesh — roughly the same for 500 nodes as for 50. The staleness bound tells the system when observations are too old to act on. Byzantine -tolerant aggregation handles adversarial nodes without assuming honesty. A fleet of hundreds maintains accurate situational awareness indefinitely.

3. What does the system do with what it knows? Detection without action is an alarm system. The MAPE-K autonomic loop closes the detect-decide-act cycle. Healing severity ordering ensures the smallest effective intervention is tried first. The minimum viable system defines the floor that recovery must defend. Proposition 22 proves the loop converges — it does not oscillate, it does not cascade, it stabilizes.

4. How do isolated peers stay coherent? Partition events are not consistency violations. They are information events. CRDTs — data structures with commutative, associative, idempotent merge semantics — mean that when partitioned clusters reconnect, states merge deterministically: no coordinator, no consensus round, no lost writes. Vector clocks distinguish causality from coincidence when no global clock exists. State divergence is bounded and measurable. The authority tier hierarchy escalates what local logic cannot resolve.

5. How does the system get better from being disconnected? A system that merely recovers returns to baseline. Anti-fragility — \(d^2P/d\sigma^2 > 0\) — is a testable engineering property: the performance-stress curve is convex. UCB bandit algorithms update operational parameters from each partition event; stress events calibrate the system’s model of its own environment. The judgment horizon bounds what is automated: decisions irreversible at fleet scale, legally consequential, or outside the training distribution route to human authority. That boundary is not timidity — it is what makes the automation deployable in environments where wrong decisions have consequences.

6. In what order must this be built? The five answers above form a strict dependency chain that cannot be reordered. Self-measurement precedes self-healing — you cannot repair what you cannot observe. Self-healing precedes fleet coherence — unreliable nodes cannot sustain distributed consensus. Fleet coherence precedes anti-fragile learning — you cannot learn from partition events that corrupt your state. The prerequisite graph encodes this formally; the constraint sequence is any topological ordering of that graph. The constraint migration result adds that the binding constraint shifts with \(C(t)\) — what limits the system at \(C(t) = 0.8\) differs from what limits it at \(C(t) = 0.1\). Phase gates enforce formal validation at each transition. Skipping a layer is not a schedule decision. It is a correctness error.

What Changes in the Next Design

Three practices change when an engineer internalizes this framework:

Design the disconnected system first. Before the connected architecture, sketch what the system does when fully isolated. The isolated case, if not in the design from day one, cannot be retrofitted without rebuilding the foundation. The connected case is easier to add to a system designed for partition than the reverse.

Choose data structures by their merge semantics. Before selecting a store or cache, ask one question: when two partitioned instances of this data reconnect, what is the merge rule? If the answer is “we figure it out at reconciliation time,” there is no coherence design yet — only a hope. CRDTs with commutative, associative, idempotent merge functions make reconciliation an algebraic property of the data structure, not an operational emergency.

Define the judgment horizon before the automation boundary. Which decisions can the system make autonomously? Which must escalate regardless of capability? This is an architectural decision, not an operational policy. Systems that leave it undefined will draw the line under stress, in production, with no time to deliberate. Systems that define it explicitly are the ones that get deployed into consequential environments.

The Swarm Was Never Waiting for the Network

Why Edge Is Not Cloud Minus Bandwidth opens with forty-seven RAVEN drones losing backhaul without warning. They do not wait. They do not retry. Each drone runs local anomaly detection . Sub-clusters propagate health via gossip . The MAPE-K loop executes recovery. CRDT merge handles reconciliation when connectivity returns. Bandit algorithms update from partition data. Decisions above the judgment horizon route to the operator. The capability level descends and ascends without human intervention.

Six parts later, there is a formal proof for every step of that sequence.

The swarm does not survive partition because it is fault-tolerant. Fault tolerance is reactive — it recovers from conditions it was not designed for. The swarm survives because partition was the design target. There is a difference between a system that handles disconnection and a system that was built for it. The first surprises you at 3am. The second does not surprise anyone, because it was never surprised itself.

The engineer who built the second system is asleep. The system is handling it.


References

[1] Kephart, J.O., Chess, D.M. (2003). “The Vision of Autonomic Computing.” IEEE Computer, 36(1), 41–50. [doi]

[2] IBM Research (2006). “An Architectural Blueprint for Autonomic Computing.” IBM White Paper, 4th Ed.

[3] Salehie, M., Tahvildari, L. (2009). “Self-Adaptive Software: Landscape and Research Challenges.” ACM Trans. Autonomous and Adaptive Systems, 4(2), Article 14. [doi]

[4] Lee, E.A. (2008). “Cyber Physical Systems: Design Challenges.” Proc. ISORC, 363–369. IEEE. [doi]

[5] Satyanarayanan, M. (2017). “The Emergence of Edge Computing.” IEEE Computer, 50(1), 30–39. [doi]

[6] ETSI GS MEC 003 V2.1.1 (2019). “Multi-access Edge Computing (MEC): Framework and Reference Architecture.” ETSI. [pdf]

[7] Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C. (2004). “Basic Concepts and Taxonomy of Dependable and Secure Computing.” IEEE Transactions on Dependable and Secure Computing, 1(1), 11–81. [doi]

[8] Bass, L., Clements, P., Kazman, R. (2012). Software Architecture in Practice, 3rd ed. Addison-Wesley.

[9] Shapiro, M., Preguiça, N., Baquero, C., Zawirski, M. (2011). “Conflict-Free Replicated Data Types.” Proc. SSS, LNCS 6976, 386–400. Springer. [doi]

[10] Brewer, E.A. (2000). “Towards Robust Distributed Systems.” Proc. PODC. ACM. [acm]

[11] Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., Rosenthal, C. (2016). “Chaos Engineering.” IEEE Software, 33(3), 35–62. [doi]

[12] Lamport, L., Shostak, R., Pease, M. (1982). “The Byzantine Generals Problem.” ACM Trans. Programming Languages and Systems, 4(3), 382–401. [doi]

[13] Kulkarni, S.S., Demirbas, M., Madeppa, D., Avva, B., Leone, M. (2014). “Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases.” Proc. OPODIS, 17–80. [doi]

[14] Ames, A.D., Xu, X., Grizzle, J.W., Tabuada, P. (2017). “Control Barrier Function Based Quadratic Programs for Safety Critical Systems.” IEEE Transactions on Automatic Control, 62(8), 3861–3876. [doi]

[15] Miller, G.A. (1956). “The Magical Number Seven, Plus or Minus Two.” Psychological Review, 63(2), 81–83. [doi]

[16] Taleb, N.N. (2012). Antifragile: Things That Gain From Disorder. Random House.

[17] Huebscher, M.C., McCann, J.A. (2008). “A Survey of Autonomic Computing — Degrees, Models, and Applications.” ACM Computing Surveys, 40(3), Article 7. [doi]


Back to top