Self-Healing Without Connectivity
Prerequisites
Two prior results converge here, each answering half of the question this article completes.
From Why Edge Is Not Cloud Minus Bandwidth: the connectivity regimes define when the system must heal without human oversight. During Intermittent and Denied regimes, there is no operator to call. The capability hierarchy ( – ) defines what healing must preserve — at minimum, the survival capability must be maintained through any failure sequence. An edge system that loses basic function during self-repair has failed at its primary design goal.
From Self-Measurement Without Central Observability: anomaly detection produces a confidence estimate \(c \in [0,1]\) for every observed deviation from nominal behavior. The optimal detection threshold \(\theta^*\) calibrates the trade-off between false positives (acting on noise) and missed detections (ignoring real failures). The observability constraint sequence established which health signals remain available as resources shrink.
The logical connection is direct. The self-measurement article answered: what is the system’s current state, and how confident are we? This article answers: what should the system do about it?
The confidence threshold that gates healing actions — act when \(c > \theta^*(a)\) — depends on the cost asymmetry between wrong action types. High-severity actions (restarting a fusion node, isolating a cluster) require high confidence before execution. Low-severity actions (increasing gossip rate, clearing a cache) can proceed at lower confidence because reverting them is cheap. This article derives those thresholds formally. It also establishes the stability conditions under which closed-loop healing converges rather than oscillates.
Overview
Self-healing enables autonomous systems to recover from failures without human intervention. Each concept integrates theory with design consequence:
| Concept | Formal Contribution | Design Consequence |
|---|---|---|
| MAPE-K Control | Stability: | Reduce controller gain when feedback delayed; LTI approximation — valid for the linear control path only. The SMJLS tightening factor of 0.82 (empirically calibrated; formal LMI verification of the switched-stability conditions pending; see Stability Under Mode Transitions) applies to switched-mode and nonlinear-CBF operation. |
| Healing Triggers | Match threshold to action severity | |
| Recovery Ordering | Topological sort on dependency DAG | Heal foundations before dependents |
| Cascade Prevention | Resource quota | Reserve capacity for mission function |
| MVS | Greedy \(O(\ln n)\)-approximation | Prioritize minimum viable components |
This extends autonomic computing [1] and control theory (Astrom & Murray, 2008) for contested edge deployments [4] .
Opening Narrative: RAVEN Drone Down
The RAVEN swarm of 47 drones is executing surveillance 15km from base, 40% coverage complete. Drone 23 broadcasts: battery critical (3.21V (illustrative value) vs 3.40V (illustrative value) threshold), 8 minutes (illustrative value) flight time, confidence 0.94 (illustrative value). The self-measurement system detected the anomaly correctly—lithium cell imbalance from high-current maneuvers.
Operations center unreachable. Connectivity \(C(t) < 0.1\) (illustrative value) for 23 minutes (illustrative value). The swarm cannot request guidance and has 8 minutes to decide and execute. Options considered:
Continuing mission would fly Drone 23 until exhaustion; a crash in contested terrain risks data compromise at 92% (illustrative value) mission completion. Returning to base sends Drone 23 home while neighbors expand sectors, reducing eastern coverage but reaching 97% (illustrative value) mission completion. Compressing the formation tightens all drones inward so Drone 23 flies a shorter path home, at the cost of reduced total coverage area and 89% (illustrative value) mission completion.
The MAPE-K loop must analyze these options, select a healing action, and coordinate execution—all without human intervention. Self-healing means repairing, reconfiguring, and adapting in response to failures without waiting for someone to tell you what to do.
The Autonomic Control Loop
Detecting a failure with 94% confidence does not tell you what to do about it. In a connected system you call an operator. In a denied environment with 8 minutes (illustrative value) until battery exhaustion, the system must select, plan, and execute a recovery action autonomously — without causing new failures in the process. That gap between sensing and centralized decision authority is edge computing’s defining constraint [4] .
The MAPE-K loop — Monitor, Analyze, Plan, Execute — provides the closed-loop structure. Each phase has a defined output type that feeds the next. The Knowledge Base gives every phase access to current system model, historical effectiveness, and policy constraints.
Closed-loop control corrects errors but requires feedback latency \(\tau\). As feedback slows (connectivity degrades), the stable gain ceiling falls. An aggressive healer in a slow-feedback environment oscillates between over-correction and under-correction. The entire design challenge is matching healing aggressiveness to feedback speed.
The MAPE-K Model
Definition 36 (Autonomic Control Loop). The concept of a closed-loop self-managing system was introduced by Kephart and Chess [1] and given architectural form in the IBM blueprint [2] . An autonomic control loop is a tuple \((M, A, P, E, K)\) where:
- is the monitor function mapping observations to state estimates
- is the analyzer mapping state estimates to diagnoses
- is the planner selecting healing actions
- is the executor applying actions and returning observations
- \(K\) is the knowledge base encoding system model and healing policies
Analogy: A thermostat with memory — it monitors temperature, analyzes the trend (not just the current reading), plans a schedule, executes it, and remembers what worked last time. The Knowledge Base is the memory that makes each next cycle smarter than the one before.
Logic: Each phase maps to a typed function: \(M \to A \to P \to E\) forms a closed feedback cycle where \(K\) provides shared context. Stability requires \(K_{\text{ctrl}} < 1/(1 + \tau/T_{\text{tick}})\) — a bound that tightens as feedback slows.
flowchart TD
M[Monitor
Collect sensor readings] --> A[Analyze
Compare to baseline and patterns]
A --> P[Plan
Policy table lookup]
P --> E[Execute
Apply action to actuators]
E --> M
A -->|no anomaly| M
P -->|no action needed| M
E --> K[(Knowledge Base
Update patterns and outcomes)]
K --> A
In other words, MAPE-K is a four-phase closed loop (Monitor, Analyze, Plan, Execute) drawing from and updating a central Knowledge base \(K\). Each phase has a defined input and output type: Monitor consumes raw sensor observations and produces state estimates; Analyzer consumes state estimates and produces diagnoses; Planner consumes diagnoses plus knowledge and produces healing actions; Executor consumes actions and produces new observations that feed back into Monitor. The Knowledge base \(K\) is not a sequential fifth phase — it is a shared store that feeds into all four phases simultaneously via dashed connections, as shown in the diagram below.
IBM’s autonomic computing initiative formalized this structure as MAPE-K : four phases (Monitor, Analyze, Plan, Execute) with shared Knowledge [2] . The broader research context for self-adaptive software — which MAPE-K instantiates — is surveyed in [3] .
The diagram below shows how the four phases form a closed feedback cycle, with the Knowledge base feeding into every stage rather than sitting at a single point.
graph TD
subgraph Control_Loop["MAPE-K Control Loop"]
M["Monitor
(sensors, metrics)"] --> A["Analyze
(diagnose state)"]
A --> P["Plan
(select healing)"]
P --> E["Execute
(apply action)"]
E -->|"Feedback"| M
end
K["Knowledge Base
(policies, models, history)"]
K -.-> M
K -.-> A
K -.-> P
K -.-> E
style K fill:#fff9c4,stroke:#f9a825
style M fill:#c8e6c9
style A fill:#bbdefb
style P fill:#e1bee7
style E fill:#ffab91
Read the diagram: The four phases (green Monitor \(\to\) blue Analyze \(\to\) purple Plan \(\to\) orange Execute) form a clockwise feedback loop. The yellow Knowledge Base feeds into all four phases via dashed lines — it provides the policies, historical models, and current state that each phase consults. The Execute\(\to\)Monitor feedback arrow is the critical path: without it, the loop cannot know whether its own healing actions worked.
Monitor observes via sensors and health metrics (self-measurement infrastructure). Analyze transforms raw metrics into diagnoses — “Battery 3.21V (illustrative value)” becomes “Drone 23 fails in 8 min (illustrative value), probability 0.94 (illustrative value).” Plan generates options and selects the best expected outcome. Execute applies remediation, coordinates with affected components, and verifies success.
Healing action durability contract: Execute-phase actions that modify shared replicated state are tagged with a causal identifier (HLC timestamp + vector clock epoch; see Fleet Coherence Under Partition). A healing action is provisional until confirmed by the next successful delta-sync or quorum check. Conflicting provisional actions from partitioned clusters are resolved by Semantic Commit Order (Fleet Coherence Under Partition); the physical execution of a losing action constitutes an anomaly event that triggers a fresh MAPE-K observation cycle.
Watch out for: the provisional/confirmed distinction assumes delta-sync or quorum-check events complete within the MAPE-K tick cycle; during a prolonged partition where no quorum is reachable, a healing action remains provisional indefinitely — it is never promoted to confirmed or retracted, and the anomaly re-trigger from a losing action’s physical execution cannot fire because no subsequent observation cycle has access to the ground truth needed to detect the conflict.
Knowledge holds distributed state — topology, policies, historical effectiveness, health estimates — and must be eventually consistent and partition-tolerant.
Implementation note: the Knowledge Base is a replicated state store supporting concurrent writes; its merge semantics are established in Fleet Coherence Under Partition — each monitored variable is a CRDT register. “Successful Knowledge Base synchronization” means all registers have received at least one gossip update from a quorum of reachable nodes within ( Proposition 14 , Self-Measurement Without Central Observability).
The control loop executes continuously:
The cycle time—how fast the loop iterates—determines system responsiveness. A 10-second cycle means problems are detected and addressed within 10-30 seconds. A 1-second cycle enables faster response but consumes more resources.
Compute Profile: CPU: per tick — Monitor reads sensor channels, Analyze compares state against diagnostic patterns, Plan performs a policy table lookup, Execute applies one action. Memory: — sliding observation window of depth plus the diagnostic pattern table. The critical path is Analyze-phase anomaly scoring, which scales with the number of concurrent sensor streams.
Closed-Loop vs Open-Loop Healing
Two control approaches apply to healing:
Proportional feedback observes the outcome, compares it to the target, and adjusts. It corrects errors but requires feedback delay .
The closed-loop control action \(U_t\) is proportional to the error between the desired and observed state, scaled by controller gain \(K_{\text{ctrl}}\):
Physical translation: The healing action is proportional to the gap between where the system should be and where it currently is. A small gap produces a gentle nudge; a large gap produces a stronger intervention. The controller gain \(K_{\text{ctrl}}\) sets how aggressively the controller chases the target — too high and it overshoots, too low and it never arrives.
Where \(U_t\) is control action, \(K_{\text{ctrl}}\) is the controller gain, and the difference is the error signal.
Open-loop control uses a predetermined response without verification: execute the action based on input and assume it works.
The open-loop action is a fixed function of the current observation only, with no error correction:
The action depends only on observed state, not on the outcome of previous actions.
The following table compares the two approaches across four engineering properties that matter most for edge healing.
| Property | Closed-Loop | Open-Loop |
|---|---|---|
| Robustness | High (adapts to errors) | Low (no correction) |
| Speed | Slow (wait for feedback) | Fast (act immediately) |
| Stability | Can oscillate if poorly tuned | Stable but may miss target |
| Information need | Requires outcome observation | Only requires input |
Edge healing uses a hybrid approach. When a critical failure is detected, the open-loop phase applies a predetermined emergency response immediately without waiting for feedback. After stabilization, the closed-loop phase observes outcomes and adjusts: escalating if the initial response was insufficient, scaling back if it was excessive.
Drone 23’s battery failure illustrates this hybrid: power consumption is reduced immediately by stopping non-essential sensors (open-loop), then voltage response is monitored, the flight profile adjusted, and the return trajectory decided based on observed endurance (closed-loop).
Healing Latency Budget
Just as the contested connectivity framework decomposes latency for mission operations, self-healing requires its own latency budget.
The total healing time is the sum of five sequential phase durations: detection, analysis, planning, coordination, and physical execution.
Physical translation: Every minute between failure onset and completed healing is a minute the system operates in a degraded or dangerous state. This formula forces you to account for every phase — detection is often the surprise: gossip convergence alone takes 10–69 seconds (theoretical bound), and most systems budget 5 seconds (illustrative value) for it in their SLA calculations.
The formula decomposes total healing latency into explicit sub-budgets for detection, analysis, planning, and execution; for RAVEN , (illustrative value), (illustrative value), (illustrative value), and (illustrative value), with total constrained to . Detection is the surprise budget-breaker — it must be measured independently in isolation, because execution absorbs all visible slack while detection silently overruns its allocated window.
The table below breaks down realistic time budgets for each phase across the two primary scenarios, and identifies the bottleneck that sets the floor for each value.
| Phase | RAVEN Budget | CONVOY Budget | Limiting Factor |
|---|---|---|---|
| Detection | 5-10s | 10-30s | Gossip convergence |
| Analysis | 1-2s | 2-5s | Diagnostic complexity |
| Planning | 2-5s | 5-15s | Option evaluation |
| Coordination | 5-15s | 15-60s | Fleet size, connectivity |
| Execution | 10-60s | 30-300s | Physical action time |
| Total | 23-92s | 62-410s | Mission tempo |
Healing Sequence Timeline:
Complete healing sequence for RAVEN Drone 23’s battery failure—each MAPE-K phase with timing, state transitions, and decision points:
sequenceDiagram
autonumber
participant D as Drone 23
participant M as Monitor
participant A as Analyzer
participant P as Planner
participant E as Executor
participant F as Fleet
Note over D,F: t=0: Anomaly Detected
rect rgb(200, 230, 201)
Note right of D: MONITOR PHASE (5-10s)
D->>M: Battery: 3.21V, dropping 0.02V/min
D->>M: Current draw: 12.3A (elevated)
M->>M: Compare to baseline (3.7V nominal)
M->>A: Anomaly score: 0.94
end
rect rgb(187, 222, 251)
Note right of A: ANALYZE PHASE (1-2s)
A->>A: Classify: power subsystem failure
A->>A: Project: 8 minutes to critical (3.0V)
A->>A: Impact: loss of drone, mission degradation
A->>P: Diagnosis: battery_critical, TTL=480s
end
rect rgb(225, 190, 231)
Note right of P: PLAN PHASE (2-5s)
P->>P: Option 1: RTB (safest, 6 min)
P->>P: Option 2: Nearest landing (3 min)
P->>P: Option 3: Power reduction (extend 4 min)
P->>P: Select: Option 3 -> Option 2
P->>E: Plan: reduce_power, then land_nearest
end
Note over D,F: t=15s: Coordination
rect rgb(255, 243, 224)
Note right of E: EXECUTE PHASE (10-60s)
E->>D: Disable: HD camera, ML inference
D-->>E: Power reduced to 8.1A
E->>F: Broadcast: Drone 23 emergency landing
F-->>E: Ack: Coverage reassigned to Drones 21, 25
E->>D: Navigate to landing zone Delta
D-->>E: ETA: 2 min 40s
end
Note over D,F: t=45s: Healing Complete
rect rgb(232, 245, 233)
Note right of M: VERIFY PHASE
M->>M: Battery stable at 3.18V
M->>M: Landing confirmed at t=180s
M->>A: Healing outcome: SUCCESS
end
Read the diagram: Time flows top to bottom. Each colored box is a MAPE-K phase with its real-world timing. The Monitor phase (green, 5–10 s (illustrative value)) ingests battery voltage and current anomaly score. Analyze (blue, 1–6 s (illustrative value)) classifies and projects time-to-failure. Plan (purple, 2–24 s (illustrative value)) evaluates three options and selects staged response. Execute (orange, 10–35 s (illustrative value)) sheds load, notifies the fleet, and navigates to a landing zone. Verify (light green) confirms the outcome and closes the loop. Total elapsed: 45 seconds (illustrative value) of autonomous decision-making with no operator.
State Transition During Healing:
The diagram below traces Drone 23’s operational state from mission start through healing to either a safe landing or asset loss; read it left-to-right, where each arrow is a triggering event and each note box gives the operational detail for that state.
stateDiagram-v2
direction LR
[*] --> Nominal: Mission Start
Nominal --> Degraded: Anomaly Detected
note right of Degraded
Power subsystem failing
8 min to critical
end note
Degraded --> Stabilizing: Healing Initiated
note right of Stabilizing
Non-essential load shed
Planning emergency landing
end note
Stabilizing --> Recovering: Plan Executing
note right of Recovering
Navigating to landing zone
Fleet coverage reassigned
end note
Recovering --> Safe: Landing Complete
note right of Safe
Drone preserved
Awaiting retrieval
end note
Recovering --> Failed: Healing Failed
note right of Failed
Battery exhausted
Uncontrolled descent
end note
Safe --> [*]: Mission Continue (without D23)
Failed --> [*]: Asset Lost
Read the diagram: States flow left to right. Nominal \(\to\) Degraded on anomaly detection. The Stabilizing state represents the critical window where load-shedding has begun but landing is not yet committed. Recovering is the active landing approach — fleet coverage has already been reassigned. Safe and Failed are the two terminal outcomes. This state machine runs inside each drone; the edges are triggered by sensor thresholds, not operator commands.
Healing Action Selection: Formal Optimization
The planner selects the optimal action \(a^*\) from the action space by maximizing expected utility given current system state \(\Sigma\) and failure severity \(\delta_\text{sev}\):
(disambiguation: \(\delta_\text{sev}\) = healing action severity scalar; \(\delta_\text{stale}\) = staleness decay exponent (constant); = staleness decay function derived from \(\delta_\text{stale}\) — the time-varying form used in the gain formula; \(\delta_\text{inst}\) = per-cycle instability probability ( Proposition 23 ); \(\delta_\theta\) = threshold hysteresis band. Bare \(\delta\) is not used in this article to avoid ambiguity.)
The utility \(U\) decomposes into three terms — the value of recovery weighted by confidence \(c\), the resource cost of the action, and the disruption cost weighted by the probability the diagnosis is wrong:
Physical translation: The utility of a healing action has three parts. The first term is the expected benefit — recovery value scaled by how confident you are the diagnosis is right. The second term is the resource cost of executing the action regardless of outcome. The third is the disruption cost if the diagnosis was wrong (probability \(1-c\)) — you spent resources and caused disruption for nothing. High-confidence diagnoses make the disruption term small; low confidence makes it dominate.
with confidence \(c\) from the diagnosis. The action must also satisfy three hard constraints — healing must finish before the failure becomes critical (\(g_1\)), the action must fit within available resources (\(g_2\)), and the action’s severity must not exceed the delegated authority of the local node (\(g_3\)):
The state transition model captures what happens after action \(a\) is executed: the system moves to healthy with probability proportional to both success rate and diagnosis confidence, remains degraded if the action fails, or stays unchanged if the diagnosis was wrong (probability \(1-c\)):
The decision tree below encodes the planner’s logic for Drone 23: starting from the battery-critical anomaly, each diamond is a yes/no check that gates which action path is taken, and the green Monitor node at the bottom marks the verification step that closes the loop.
flowchart TD
START["Anomaly: Battery Critical
TTL = 8 minutes"]
START --> Q1{"Time to safe
landing < TTL?"}
Q1 -->|"Yes (6 min < 8 min)"| Q2{"Mission impact
of landing?"}
Q1 -->|"No"| EMERGENCY["EMERGENCY
Immediate autorotation"]
Q2 -->|"Low"| LAND["Plan: Land at nearest
safe zone"]
Q2 -->|"High"| Q3{"Can extend TTL?"}
Q3 -->|"Yes"| EXTEND["Plan: Reduce power
+ delayed landing"]
Q3 -->|"No"| LAND
LAND --> EXEC1["Execute: Navigate
+ Notify fleet"]
EXTEND --> Q4{"Extended TTL
sufficient?"}
Q4 -->|"Yes"| EXEC2["Execute: Power reduction
+ Navigate"]
Q4 -->|"No"| LAND
EXEC1 --> MONITOR["Monitor: Verify
healing success"]
EXEC2 --> MONITOR
EMERGENCY --> MONITOR
style START fill:#ffcdd2,stroke:#c62828
style EMERGENCY fill:#ffcdd2,stroke:#c62828
style MONITOR fill:#c8e6c9,stroke:#388e3c
style LAND fill:#fff9c4,stroke:#f9a825
style EXTEND fill:#e3f2fd,stroke:#1976d2
Read the diagram: The decision tree starts at the critical battery anomaly (red). Each diamond is a yes/no gate. The leftmost path — “time to landing < TTL?” \(\to\) “Yes” \(\to\) “mission impact low?” \(\to\) “Land” — is the safest, fastest resolution. The rightmost path tries to extend flight time with power reduction before falling back to landing. The EMERGENCY node (red) fires when even the fastest landing cannot beat battery exhaustion. Every path ends at Monitor (green) — the loop always verifies outcome.
Proposition 21 (Healing Deadline). For a failure with time-to-criticality , healing must complete within margin:
If healing takes longer than the failure window minus a safety buffer, the system cannot recover in time.
Physical translation: You must complete healing before the failure becomes irreversible, with a safety buffer remaining. If Drone 23 has 8 minutes (illustrative value) ( ) and landing requires 2 minutes (illustrative value) ( ), the healing sequence must finish within 6 minutes (illustrative value). If no available action fits in that window, escalate to a faster but more disruptive response.
where accounts for execution variance and verification time. If this inequality cannot be satisfied, the healing action must be escalated to a faster (but possibly more costly) intervention.
In other words, healing must finish early enough to leave a safety buffer before the failure becomes irreversible; if no action fits within that window, the system must escalate to a more disruptive but faster intervention.
For Drone 23, with 8 minutes to battery exhaustion and a 60-second (illustrative value) landing margin, the healing window comfortably exceeds the ~45-second (illustrative value) healing sequence.
Empirical status: The 8-minute time-to-criticality and 60-second margin are RAVEN -specific values derived from Li-Ion discharge curves and landing kinematics; actual margins depend on battery chemistry, payload, and terrain, and should be measured per platform.
When the healing deadline cannot be met, the system must execute partial healing to stabilize without full recovery, skip to emergency protocols that bypass normal MAPE-K , or accept a degraded capability state.
Watch out for: the bound assumes \(T_{\text{crit}}\) is known precisely at the moment healing begins; in practice \(T_{\text{crit}}\) is estimated from state observed at detection time \(t_{\text{detect}}\), not at fault onset, so the effective healing window is \(T_{\text{crit}} - T_{\text{detect}} - T_{\text{margin}}\) — the detection latency \(T_{\text{detect}}\) silently consumes part of the margin, and for RAVEN ’s gossip-based detection ( ), this can reduce the available window by over 8% before any healing action has fired.
Proposition 22 (Closed-Loop Healing Stability). The MAPE-K loop is a discrete-time system executing on a fixed timer with period . Modeling the proportional controller with controller gain \(K_{\text{ctrl}}\) acting on a -sample-delayed error state:
A RAVEN healing loop that reacts too aggressively under radio delay will oscillate, triggering actions that undo each other rather than converging to a stable state.
The closed-loop system is stable if the controller gain satisfies:
Scope — LTI control path only: The gain bound below is formally valid for the Linear-Time-Invariant control path. The nonlinear dCBF controller path uses an empirically calibrated gain of \(0.82 \times K_\text{LTI}\) pending formal LMI verification of the switched-stability conditions (a formal piecewise-linear LMI analysis (pending); see Stability Under Mode Transitions, below). Do not treat the nonlinear path as certified for autonomous deployment without that offline verification step.
Physical translation: The safe gain ceiling shrinks as feedback slows. At zero delay, controller gain can approach 1 (full-authority correction). At a 10-second feedback delay with a 5-second tick, the gain ceiling falls to \(1/(1+2) = 0.33\). At a 100-second delay the ceiling is 0.048 — the controller must be extremely gentle, accepting slow convergence to avoid oscillation.
For this reduces to \(K_{\text{ctrl}} < 1\). For the stable gain decreases proportionally with the delay-to-sample ratio.
Proof sketch (discrete-time Lyapunov): Let \(V(x) = x^2\). The one-step difference under \(d\)-step delay is:
For the worst-case alignment \(x[t] = x[t-d]\) (current and delayed states coincide, representing maximum regenerative feedback):
Requiring \(\Delta V < 0\) at worst case yields the necessary condition \(0 < K_{\text{ctrl}} < 2\).
For the sufficient condition \(K_{\text{ctrl}} < 1/(1+d)\), the closed-loop characteristic polynomial is . For \(d=1\), Jury’s criterion applied directly yields the necessary and sufficient condition \(K_{\text{ctrl}} < 1 = 1/(1+1)\).
For \(d \geq 2\), the sufficient condition is derived by bounding the cumulative influence of the \(d\)-step delay chain. Each additional delay sample tightens the stable-gain envelope by one additive \(K_{\text{ctrl}}\) term, so the \((d+1)\)-term influence sum gives the sufficient bound \(K_{\text{ctrl}} < 1/(1+d)\).
This bound is conservative relative to the exact Schur–Cohn stability boundary — for \(d=2\) the exact boundary is vs. the sufficient bound \(K_{\text{ctrl}} < 1/3\). The conservative margin is appropriate for a gain scheduler operating under stochastic delay: using the exact boundary would risk instability on delay-distribution tails. Expressed in continuous time via :
\(\square\)
Watch out for: the bound assumes fixed delay \(\tau\) under linear time-invariant dynamics — if the CBF Gain Scheduler ( Definition 40 ) updates \(K_{\text{ctrl}}\) at a regime transition while the feedback delay simultaneously shifts from \(\tau_{\text{degraded}}\) to \(\tau_{\text{denied}}\), the gain can violate the sufficient condition for the new regime before the next tick applies the corrected value, producing transient oscillation precisely during mode switches when healing reliability matters most.
Scope: This result holds under linear time-invariant (LTI) loop dynamics. For time-varying gain schedules (as in Definition 40 , CBF Gain Scheduler), the switched-Markov jump linear systems (SMJLS) analysis in the proof tightens the bound to account for regime transitions.
What this means in practice: Keep your controller gain \(K_{\text{ctrl}}\) below this bound, or the healing loop will oscillate — triggering recovery actions that then need their own recovery. For most edge hardware with 100ms feedback latency and 1s tick rate, this means \(K_{\text{ctrl}} < 0.91\). For a time-varying scheduler that increases gain during Connected regimes, add a 15% safety margin.
P99 design rule: Size for P99 feedback delay, not mean delay. At P99 latency (typically \(3\text{–}5\times\) the mean for wireless links), the stability bound tightens significantly. A gain that’s stable at mean latency may oscillate at P99.
Quantile-aware tightening (tail-instability correction). The bound above uses a fixed delay \(\tau\) — typically the mean or a measured round-trip time. For Weibull-distributed partition durations ( Definition 13 ), the mean underestimates the tail: when \(k_N < 1\) (the common case for denied-connectivity episodes), the P99 duration is roughly . Calibrating the gain formula against \(\mathrm{E}[\tau]\) produces a loop that is stable for typical partitions but permits oscillation in the worst 1% of events — precisely when healing matters most.
The quantile-aware bound substitutes the \(\alpha\)-quantile of the Weibull partition duration for \(\tau\):
where \(\lambda_i\) is the Weibull scale parameter for node \(i\), \(k_N\) is the adaptive shape parameter ( Definition 14 ), and \(\alpha = 0.99\) is the recommended operating point. The stability condition becomes:
RAVEN calibration: (illustrative value), \(k_N \approx 0.6\) (illustrative value), (illustrative value). The P99 quantile is (illustrative value).
This gives a quantile-aware ceiling of (illustrative value) — versus \(K_{\text{ctrl}} < 0.028\) (illustrative value) from the \(\mathrm{E}[\tau]\)-based bound. For Intermittent and Denied regimes, always use \(\tau_{P99}\) from Proposition 23 rather than the mean delay. (See Definitions 13–14 in Why Edge Is Not Cloud Minus Bandwidth for calibration from partition logs.)
Warning: Using mean feedback delay instead of the P99 quantile produces a gain that is stable for typical partitions but oscillates in the worst 1% of events — precisely when the healing loop is needed most.
The proposition computes the maximum safe controller gain for a given feedback delay , establishing the ceiling above which fault/heal flapping emerges from overcorrection. For the ceiling is (illustrative value); for it falls to (illustrative value). The formula gives the stability ceiling, not a recommended operating point.
Empirical status: The sufficient stability bound \(K_{\text{ctrl}} < 1/(1+d)\) is analytically derived for a linear time-invariant model; real MAPE-K loops exhibit nonlinear actuator saturation and time-varying delays, so the actual stability boundary may differ and should be validated by discrete-event simulation at the deployment’s P99 feedback delay.
Concurrent loops: For simultaneous healing loops sharing one CPU at \(u\%\) utilization, two quantities grow: the effective feedback delay to , and the effective aggregate gain to . The stability condition on the aggregate is:
Verify concurrent-failure stability through discrete-event simulation before deploying multi-target healing (e.g., simultaneous motor compensation + sensor fallback + communication rerouting on RAVEN Drone 23).
In other words, the slower the feedback (larger \(\tau\)), the more gently the controller must react (smaller \(K_{\text{ctrl}}\)); aggressive corrections in a slow-feedback environment cause the system to oscillate rather than converge.
Corollary 9.1. Increased feedback delay (larger \(\tau\)) requires more conservative controller gains, trading response speed for stability.
Watch out for: the stability trade-off comes at the cost of healing speed — in Denied regime (\(\tau \to \infty\)), the gain ceiling approaches zero, meaning each MAPE-K tick corrects only a negligible fraction of the error state, and recovery may not complete within the healing deadline from Proposition 21 before the failure becomes irreversible.
Staleness correction: When the Knowledge Base has not been synchronized for elapsed time , the error signal \(x[t-d]\) may reflect state that has since evolved. The staleness-adjusted gain
where is the staleness decay function. Here \(\delta_{\text{stale}}\) is the decay exponent from the disambiguation table in the Notation section, and is the maximum elapsed time since last Knowledge Base sync (formally defined in Definition 45 below).
Staleness Decay Function (formally defined later in this article): maps partition duration \(T_\text{acc}\) to a scalar \(\in [0,1]\) representing the remaining confidence in local observations. A value of 1.0 means fully fresh; 0.0 means the observation is too stale to trust for autonomous decision-making.
reduces effective gain in proportion to the staleness decay function \(\phi_{\text{stale}}\) ( Definition 45 ). Since , any \(K_{\text{ctrl}}\) satisfying Proposition 22 ’s stability condition (LTI bound; SMJLS analysis tightens this under time-varying gain — see proof) continues to satisfy it with substituted — staleness correction provides additional stability margin when acting on uncertain state, at the cost of reduced healing responsiveness.
Stochastic extension: when \(\tau\) is not constant
Proposition 22 assumes a fixed delay \(\tau\). In tactical environments \(\tau\) is a stochastic process; its distribution governs whether any finite controller gain \(K_{\text{ctrl}}\) can maintain stability.
Definition 37 (Stochastic Transport Delay Model). Let denote the one-way transport delay at time \(t\), distributed conditionally on connectivity regime \(C\).
Connected (\(C = 1.0\)): with \(\sigma_c\) much smaller than \(\mu_c\) (coefficient of variation approximately 10%). Additive propagation and queuing delays compose multiplicatively across many independent hops, producing a log-normal tail.
Degraded (\(C = 0.5\)): with (coefficient of variation approximately 100%). Retransmission bursts and partial-outage rerouting drive variance to the same order as the mean.
Contested (\(C = 0.0\)): with \(\alpha \in (1, 2)\):
The definition models round-trip delay as a heavy-tail Pareto distribution fitted from MAPE-K logs, enabling robust gain selection via Proposition 23 and guarding against gain under-design when mean-delay assumptions underestimate P99 delay by \(3{-}10\times\) when tail index . The tail index ( means infinite variance) and the hardware-limited delay floor are both fitted from log data. Log-log plots of delay data distinguish distributions: a straight line confirms Pareto; curvature signals Weibull — and each requires a different gain formula.
p-th percentile: . Mean: for \(\alpha > 1\). Variance: undefined (infinite) for \(\alpha \leq 2\).
The Pareto model is natural under adversarial conditions: an adversary who controls jamming duration selects from a strategic distribution, producing power-law delay tails. Shape parameter \(\alpha\) encodes adversarial capability — RAVEN contested-link measurements yield \(\alpha \approx 1.6\), giving with unbounded variance.
Critical consequence: With in the Contested regime, the estimation error of any EWMA or Kalman filter tracking \(\tau\) also has infinite variance, regardless of filter design. Mean-plus-\(k\)-sigma stability margins are meaningless; all quantitative bounds must use percentiles.
Watch out for: the gain bound is derived for a single healing loop with a fixed delay \(\tau\); concurrent healing loops sharing one CPU raise the effective aggregate gain to , which can exceed the stability ceiling even when each individual loop satisfies the per-loop bound — the multi-loop stability condition must be checked separately and is not a consequence of per-loop compliance alone.
Proposition 23 (Robust Gain Scheduling under Stochastic Delay). Let be the acceptable per-cycle instability probability (\(\delta_\text{inst}\) = per-cycle instability probability; distinct from the staleness decay \(\delta_\text{stale}\) and the hysteresis band \(\delta_\theta\)). The regime-dependent robust gain bound is:
Using mean delay to set gain is unsafe in contested regimes — RAVEN ’s P99 delay can be \(13\times\) (illustrative value) the mean, collapsing the safe gain ceiling to near zero.
The proposition derives a conservative loop gain using the delay quantile instead of the mean, guaranteeing stability for of all delay realizations and eliminating intermittent instability from rare long-tail events that a mean-based gain cannot handle. The instability tolerance (illustrative value) targets 99% stability; follows from the Pareto or Weibull quantile formula. At in a 5-second MAPE-K loop, the bound still permits approximately 3.6 instability events per day (illustrative value) — safety-critical deployments require (illustrative value).
Empirical status: The Pareto tail parameters \(\alpha = 1.6\) and \(\tau_{\min} = 0.2\) s are calibrated from RAVEN contested-link measurements; different RF environments, jamming profiles, and terrain will yield different tail indices, requiring per-deployment calibration from delay logs.
Physical translation: In denied regime, the healing loop must run at 82% (illustrative value) of the gain it would use with a reliable connection — clock uncertainty forces conservative control. Degraded regime permits 90% (illustrative value), and connected permits the full design gain. The table directly tells a field operator how aggressively the RAVEN healing loop may respond under each radio condition, without needing to understand the underlying Lyapunov mathematics.
where is the \((1-\delta_\text{inst})\)-th percentile of :
| Regime | Distribution | Permissible actions | |
|---|---|---|---|
| Connected (C=1.0) | LogNormal | All severities | |
| Degraded (C=0.5) | LogNormal | Severity 1 and 2 only | |
| Contested (C=0.0) | Pareto | Severity 1 local only |
For RAVEN ( s (illustrative value), \(\alpha = 1.6\) (illustrative value), \(\delta_\text{inst} = 0.01\) (illustrative value)):
This gives . For a reference tick period of s: .
The bound tightens sharply as the instability tolerance is reduced. Since scales as , driving pushes s and .
As , and : for any operationally meaningful \(\delta_\text{inst}\), no positive gain satisfies the stability condition for remote actions in Contested conditions — all Severity 2 and above actions must be suppressed.
Proof: From Proposition 22 , stability requires , equivalently . Under stochastic \(\tau\), the probability of stability is . Setting this equal to \(1-\delta_\text{inst}\) and inverting gives . For , the Pareto quantile grows without bound as , so and no positive controller gain achieves arbitrary confidence in the Contested regime. \(\square\)
Watch out for: the quantile formula selects from either the Pareto or Weibull family based on log-log delay plots, but misidentifying the family produces a systematically wrong \(\tau_{1-\delta_\text{inst}}\) — if contested-regime delay is Weibull but treated as Pareto (or vice versa), the gain bound can be too generous by a factor of \(3\text{–}10\times\) at the tail, leaving the loop nominally “stable” at a gain that causes oscillation in the actual distribution’s worst percentile.
Corollary 78.2 (Fleet Stability Bound). Proposition 23 bounds instability probability for a single node. For a fleet of \(N\) nodes operating under the same connectivity regime \(C\), with target fleet-wide instability probability , set the per-node instability tolerance to:
and derive \(K_{\text{robust}}\) from Proposition 23 using \(\delta_{\text{node}}\). By the Bonferroni union bound, this guarantees .
Under positive inter-node delay correlation \(\rho_C > 0\) — nodes share a connectivity regime and experience correlated jamming events — the Bonferroni bound remains valid but is conservative: correlated failure reduces effective diversity, so is the correct per-node target at all correlation levels. In the limit \(\rho_C \to 1\) (a single shared partition event drops the whole fleet simultaneously), treating the fleet as one entity and setting is appropriate — the fleet either fails together or not at all.
For RAVEN (\(N = 47\) (illustrative value), \(\delta_{\text{fleet}} = 0.01\) (illustrative value), \(\alpha = 1.6\) (illustrative value), \(\tau_{\min} = 0.2\) s (illustrative value)): (theoretical bound), giving:
This is a \(10\times\) (illustrative value) tighter gain ceiling than the single-node bound of 0.240 (theoretical bound), reflecting the actual safety requirement for a 47-node mission.
Gossip coupling amplifier: A drone that hits the \(\delta_{\text{node}}\) tail and begins oscillating injects jitter into its neighbors’ gossip-based state estimates ( Definition 24 ), raising their effective \(\hat{\tau}\) and pulling their gain schedulers toward instability. This positive feedback between per-node oscillation and fleet-wide estimation noise means fleet stability is not a consequence of per-node stability alone. The fleet-level \(\delta_{\text{node}}\) bound provides the correct single-node target for independent failures; correlated cascade failures — requiring inter-node action coordination — are blocked by the Severity 2 suppression rule in Proposition 23 .
Watch out for: the Bonferroni union bound guarantees \(P(\text{any node unstable}) \leq \delta_{\text{fleet}}\) but is derived under the assumption of independent per-node failures; when inter-node delay correlation \(\rho_C \to 1\) (a single partition event drops the entire fleet simultaneously), the per-node target \(\delta_{\text{node}} = \delta_{\text{fleet}}/N\) is unnecessarily tight — the fleet either fails together or not at all, and the correct per-node target is \(\delta_{\text{node}} = \delta_{\text{fleet}}\), avoiding an \(N\times\) over-conservative gain reduction in the fully correlated limit.
Corollary 78.3 (Confidence-Interval Adjusted Stability Bound). Proposition 23 requires estimating from observations. In Contested regime, observations are sparse by definition: a Pareto tail with \(\tau_{0.99} \approx 3.16\) s yields at most \(n_{\text{obs}} \approx 19\) samples in a 60-second estimation window — well below the \(n \geq 30\) minimum for reliable Hill estimation.
The Hill estimator for Pareto shape \(\alpha\) from \(k\) tail-exceedance observations has standard error:
where is the number of observations used in the tail fit. A lighter estimated tail (\(\hat{\alpha}\) overestimated) causes to be underestimated, producing a gain that appears safe but is not. The confidence-adjusted quantile substitutes the lower \(\beta\)-confidence bound on \(\hat{\alpha}\) into the Pareto quantile formula:
The confidence-adjusted robust gain is then:
When \(k < k_{\min}\) (the tail fit is unreliable), may fall below 1, at which point and \(K_{\mathrm{robust,CI}} \to 0\): the gain degrades gracefully to zero, reverting to the Severity suppression floor that Proposition 23 already requires as . This is the correct behavior under estimation collapse.
The corollary replaces with in Contested-regime gain computations, reverting to Proposition 23 ’s point estimate when . The minimum reliable tail-observation count is (theoretical bound) (Hill SE below 26%); the confidence interval uses \(\beta = 0.10\) (illustrative value) (90% CI). For RAVEN in the Contested window, \(n_{\text{obs}} \approx 19\) and \(k \approx 4\) (illustrative value) — below — placing the estimator below its reliable threshold; Severity suppression is accordingly the correct operating mode, not a conservative simplification. As grows with sparse observations, the Execute dead-band from Definition 38 widens in proportion, ensuring the Execute phase becomes more conservative under estimation uncertainty.
Watch out for: the confidence adjustment relies on the Hill estimator’s Gaussian approximation , which holds only when \(k\) is large enough for the Central Limit Theorem to apply; for RAVEN in Contested regime with \(k \approx 4\) tail observations, this approximation is unreliable — the confidence-adjusted quantile \(\tau^+\) is not a statistically rigorous bound but a conservative heuristic, and the correct response when is Severity suppression, not reliance on the CI formula.
Definition 38 ( MAPE-K Predictive Dead-Band). Let \(A\) be a healing action recommended at . The Execute phase suppresses \(A\) if any of the following hold at :
(a) Delay invalidity — estimated transport delay exceeds the Stale Data Threshold (Prop 79):
(b) Self-correction — probability the target remains in failure state has fallen below :
equivalently , where \(\mu_h\) is the autonomous self-healing rate of the target component.
(c) Gain violation — current delay estimate violates the stability condition from Prop 78:
All three conditions suppress action independently. Condition (b) is the MAPE-K analog of the Smith Predictor’s inner model path: it estimates whether the system will have self-corrected before \(A\) arrives, suppressing \(A\) if so.
In the Contested regime, the prediction error carries the same Pareto tail as \(\tau(t)\) regardless of predictor design. The Smith Predictor reduces the effective delay in the characteristic equation from \(\tau(t)\) to \(\varepsilon(t)\), but both are unbounded in variance. Condition (a) therefore remains the primary suppressor.
Condition (b) also prevents the anti-windup oscillation that Proposition 30 bounds: acting on a stale recommendation after the target has already self-healed is precisely the over-correction scenario Definition 46 (Healing Dead-Band and Refractory State) blocks.
Proposition 24 (Stale Data Threshold). Let be the total state-change rate (healing, failure, and coordination events); the maximum acceptable probability that state has changed since ; the healing deadline from Proposition 21 ; and \(k \geq 1\) a deadline safety factor. The Stale Data Threshold is:
A RAVEN health report older than 10 seconds is more likely to mislead the Analyzer than inform it — acting on it is worse than waiting for a fresh reading.
The threshold sets the maximum safe age for health reports beyond which acting on stale data risks triggering a wrong healing decision from ghost state of a fault that has already self-resolved. For RAVEN , and (illustrative value) yield a maximum safe data age of (illustrative value). Stale detection is structurally inapplicable without monotonic timestamps on every health message — the threshold formula presupposes time-ordered records.
Empirical status: The RAVEN values \(\mu_f = 0.02\)/s and \(p_{\text{stale}} = 0.20\) are calibrated from flight-test fault logs; failure rates vary by component type, environmental stress, and battery level, and should be measured per platform class.
Physical translation: Beyond , acting on the reading is worse than ignoring it — the fault may have already self-resolved, or escalated to a different state. The two-term minimum takes the tighter of two independent constraints: the healing deadline divided by a safety factor (time-to-act), and the Poisson-derived staleness bound (probability the state has changed). Whichever limit is tighter governs. For RAVEN’s 30-second healing window with \(k = 3\), the staleness budget is 10 seconds — if data is older than that, re-sense before acting.
A node must re-run Sense and Analyze before executing if .
Proof: State transitions form a Poisson process at rate . The probability of at least one transition in is . Setting this equal to and solving for \(\Delta t\) gives the second term. The constraint ensures timely execution within the healing deadline if re-sensing is infeasible. \(\square\)
Contested regime — feasibility window: In \(C = 0\), coordination is absent (\(\mu_c \approx 0\)), so . Simultaneously, Prop 78 requires with probability \(1 - \delta_\text{inst}\), imposing a lower bound from the Pareto quantile:
For RAVEN ( s, \(\alpha = 1.6\), \(\delta_\text{inst} = 0.01\), \(\mu_f = 0.02\)/s, ):
The upper bound (state-change rate): s; the lower bound (transport delay): s.
The feasibility window for remote healing actions in Contested RAVEN is \([3.16, 11.2]\) s. Below 3.16 s the action arrives too late with probability above 1%; after 11.2 s the system state has changed with probability above 20%. Outside this window all Severity 2 and above actions are suppressed; only Severity 1 local actions remain valid. In the Degraded regime (\(\mu_d = 0.8\)s, \(\sigma_d = 0.8\)s): s, which already exceeds the upper bound — the feasibility window collapses, signaling that remote actions are inadvisable even at Degraded connectivity during high-failure-rate episodes.
Watch out for: the staleness threshold is derived from a Poisson state-change model at rate \(\lambda_{\text{total}}\), but coordinated failures (simultaneous jammer-induced multi-node faults) violate the Poisson independence assumption — real cascades are bursty, so the probability that state has changed since \(t_{\text{sense}}\) is higher than the formula predicts, and the safe data age is shorter than \(T_{\text{stale}}\) suggests during correlated multi-node events.
Stability Under Mode Transitions: Piecewise Lyapunov Analysis
The gain conditions in Proposition 22 and Proposition 23 guarantee stability within a single capability mode. They make no claim about stability across capability-level transitions. The error state \(x(t)\) at the moment a mode switch fires may lie outside the new mode’s Stability Region ( Definition 4 , defined in Why Edge Is Not Cloud Minus Bandwidth), causing divergence even when both the pre- and post-transition gains individually satisfy their per-mode LTI conditions.
Design-time prerequisites: Theorem PWL presents three LMI conditions. C1 (per-mode Lyapunov decay) is runtime-verifiable from the system’s current gain and delay measurements. C2 (jump bound at mode transitions) and C3 (minimum dwell-time lower bound) are design-time requirements that must be verified offline using an LMI solver before any deployment. The theorem’s stability conclusion holds only if C2 and C3 were satisfied during system design. Treat the C2/C3 conditions as a certification prerequisite, not an operational check.
Theorem PWL (Piecewise Lyapunov Stability). For the Hybrid Capability Automaton ( Definition 3 , defined in Why Edge Is Not Cloud Minus Bandwidth), let \(V_q(x) = x^\top P_q x\) with \(P_q \succ 0\) for each \(q \in Q\). The system is uniformly exponentially stable at the origin if there exist scalars \(\lambda_q > 0\) and satisfying three LMI conditions:
Condition (C1) is the within-mode decay LMI — one per capability level, solvable via MATLAB dlyap or Python cvxpy.
Condition (C2) bounds the Lyapunov value jump at each mode transition.
Condition (C3) sets the minimum dwell time: the system must remain in each mode long enough for (C1)’s geometric decay to overcome (C2)’s jump multiplier before the next transition.
Proof sketch: Between transitions, by (C1). At each transition \(q \to q'\), by (C2). After \(N\) transitions over horizon \(T\): as when (C3) holds. \(\square\)
Physical translation: The piecewise Lyapunov certificate is an engineering contract: before deploying each capability mode, an offline solver must verify three conditions. If all three pass, you have a mathematical guarantee that healing actions in that mode will converge rather than oscillate. If C2 or C3 were skipped during design, you have an optimistic assumption, not a guarantee — and a healing loop that passes C1 alone can still oscillate at mode transitions.
Implementation note: the \(P_q\) matrices are computed offline — once per firmware build using MATLAB dlyap or Python cvxpy — and stored as read-only constants in MCU flash. No LMI is solved at runtime. At each MAPE-K tick the only computation is one quadratic form
for state dimension
, costing at most 36 multiply-accumulate instructions on a Cortex-M4. The SMJLS contraction factor below is likewise precomputed from calibrated Weibull shape parameters and stored as a scalar constant; it is updated between missions on recalibration, not per tick. The \(50\,\mu\text{s}\) runtime budget cited in the NSG diagram below refers entirely to these quadratic-form evaluations — no online eigenvalue computation or LMI solve occurs.
**SMJLS tightening.** Under the Weibull partition model ( Definition 13 ), mode durations are heavy-tailed and switching is semi-Markovian. The mean-square stable gain is strictly tighter than the per-mode LTI bound. For RAVEN ( (illustrative value)): (theoretical bound) — an empirically calibrated 18% (illustrative value) reduction that propagates directly into the gain scheduler below; formal LMI verification of the switched-stability conditions is pending.
Field note — conservative bound for CBF-constrained controllers: The 0.82 tightening factor is established for linear LTI systems. Its applicability to CBF-constrained nonlinear controllers ( Definition 39 ) has not been analytically verified. Use the conservative bound for safety-critical deployments; treat the 0.82 factor as an empirical target to validate during field certification ( Definition 104 ).
Derivation of the 0.82 factor. The factor is not empirical — it is the analytic solution of Condition (C3) for the RAVEN parameter set. Solving the LMI system (C1)–(C3) for RAVEN ( , ) yields: mode-decay rate \(\lambda^* \approx 0.048\) (from the L3 delay-chain companion LMI) and Lyapunov jump multiplier \(\mu^* \approx 1.22\) (from the L2\(\to\)L3 transition, the tightest adjacent-mode pair). The SMJLS mean-square stability condition then requires the gain-scaled LMI to remain feasible under the Weibull-distributed dwell-time distribution — specifically, the expected Lyapunov growth per mode-switch must stay bounded. Numerically, this contracts the feasible \(K_{\text{ctrl}}\) set from the LTI interval to . The 0.82 scaling is parameter-specific: for exponential dwell times (\(k=1\), classical MJLS), the contraction is \(\approx 5\%\); for RAVEN’s heavy tail (\(k_N = 0.62\)), it reaches 18% because heavy tails produce short-dwell excursions that increase the effective transition frequency and compound the Lyapunov jump accumulation.
Watch out for: the stability proof holds when the LMI system (C1)–(C3) has a feasible solution for all mode pairs in \(Q\); there is no guarantee that a solution exists for an arbitrary deployment’s \(A_q\) matrices — if the mode dynamics are too dissimilar (large Lyapunov jump multiplier \(\mu^*\)) or mode dwell times are too short to satisfy C3, the LMI is infeasible and no piecewise Lyapunov certificate exists for that configuration, meaning the deployment cannot be certified as uniformly exponentially stable under mode transitions regardless of how the gains are tuned.
Definition 39 (Discrete Control Barrier Function). Control Barrier Functions provide formal safety guarantees for continuous-time systems [7] ; the discrete-time formulation used here adapts those guarantees to the MAPE-K tick structure [8] . A function is a Discrete Control Barrier Function (dCBF) for mode \(q\) if the safe set is nonempty, compact, contains \(x^* = 0\) in its interior, and there exists such that for all :
The canonical choice for the MAPE-K healing loop is , so that (the Stability Region). The dCBF condition then becomes: the one-tick-ahead Lyapunov value under the proposed control input must not grow faster than the \((1-\gamma_{\text{cbf}})\) contraction rate.
Here \(\rho_q\) is the CBF stability margin — the normalized distance from the current state to the mode-\(q\) stability region boundary. This is distinct from the energy ratio used in Why Edge Is Not Cloud Minus Bandwidth and from the retry multiplier in the refractory backoff formula.
The dCBF condition is evaluated before every Execute phase; if violated, \(K_{\text{ctrl}}\) is reduced via the CBF-QP closed form until the condition holds. The contraction rate governs the tightness of the constraint on admissible \(K_{\text{ctrl}}\) — smaller \(\gamma_{\text{cbf}}\) means tighter contraction; (illustrative value) is a safe default for 5 s MAPE-K ticks, with runtime cost bounded by one quadratic form (36 multiplications, \(<20\,\mu\)s on Cortex-M4 at L1 throttle (illustrative value)). The evaluation cost equals that of computing \(\rho_q(t)\) — the safety filter adds negligible overhead alongside the stability margin computation already performed at each tick.
The Sampling Frequency Trap. The \(20\,\mu\text{s}\) dCBF evaluation is only the logic cost — the cost of checking once \(x\) is known. The hidden cost is state estimation: obtaining a fresh \(x(t)\) requires an IMU sample, sensor fusion, and state prediction.
Under fault conditions — rotor asymmetry, rapid attitude change, impending crash — the safety guarantee requires \(x(t)\) to be fresher than the failure propagation time (~100–300 ms for RAVEN; see Proposition 25 ). This creates a regime-dependent sensing cost that must be pre-budgeted separately from the logic cost:
| Regime | Required sample rate | IMU power (RAVEN) | State estimation cost |
|---|---|---|---|
| Nominal (MAPE-K cycle) | 0.2 Hz | ~0.05 mW | Included in Zero-Tax baseline |
| Degraded (fault detected) | 10–50 Hz | ~2–36 mW | NOT in Zero-Tax baseline |
| Emergency (CBF active) | 100+ Hz | ~15–28 mW | Requires dedicated power reservation |
The Zero-Tax autonomic tier budgets for logic computation at baseline sensing rates. When the CBF safety filter activates (fault detected), the system must automatically escalate its sensing rate — and the power budget for that escalation must be pre-reserved. For RAVEN, this means reserving ~20 mW of the “emergency power margin” exclusively for high-rate IMU sampling during CBF-active intervals. Failure to reserve this margin causes the CBF to operate on stale \(x(t)\), rendering the safety certificate void.
(Notation: is the CBF contraction rate used throughout this section, governing how quickly the safety margin \(h_q\) is permitted to contract per tick.)
Budget at least 20 mW (illustrative value) emergency margin in the RAVEN power profile (~11–60% (illustrative value) of the 150–180 mW (illustrative value) nominal platform budget, achievable by throttling background processing). Require fault-confidence > 0.9 (two consecutive MAPE-K ticks with \(\varepsilon_\text{model}\) exceeded) before escalating to high-rate IMU sampling. Treating every anomaly as a potential crash will drain the emergency power margin in under 30 seconds (illustrative value) at 100 Hz.
Warning: The dCBF logic cost (\(<20\,\mu\text{s}\)) does not include state estimation. Pre-reserve power for high-rate IMU sampling or the safety certificate becomes void when it is needed most.
The model-validity condition requires that \(A_q\) accurately represents the current plant dynamics. Under physical damage (motor degradation shifts poles) or sustained RF interference (actuator desaturation changes \(B_q\)), the true one-step map , so a dCBF check that passes against the nominal model may fail against the true model. The guarantee of Proposition 25 is valid only within the accuracy envelope of \(A_q\); see Proposition 31 for the recovery-time implication.
When state estimate \(x(t)\) is stale (partition age from Definition 26 ), substitute in place of \(h_q(x(t))\) in the safety check, where \(\lambda_{\text{decay}}\) is the staleness decay coefficient from Definition 45 . This staleness correction makes the check conservative: a stale state estimate shows a smaller safety margin, deferring actions rather than falsely approving them.
Notation. The staleness decay coefficient \(\lambda_\text{decay}\) is the initial slope of the exponential decay curve from Definition 45 : . Geometrically, it is the rate at which the safety margin shrinks per second of stale data. For the OUTPOST temperature sensor example ( ), .
Sensitivity analysis: \(\varepsilon_\text{model}\) calibration and false-lockout risk. The model-error tolerance \(\varepsilon_\text{model}\) is a tunable threshold with a direct false-lockout/false-clearance trade-off:
| \(\varepsilon_\text{model}\) | False-lockout risk | False-clearance risk | Recommended use case |
|---|---|---|---|
| 0.01 | High — benign IMU noise triggers lockout | Very low | Safety-critical systems with high-quality sensors and well-identified plant models |
| 0.05 | Moderate — calibrated for RAVEN with \(\pm 2\%\) gyro noise | Low | General-purpose; well-calibrated edge nodes |
| 0.10 | Low | Moderate — mild model drift passes undetected | Harsh environments; accept higher model uncertainty |
| 0.20 | Very low | High — significant model errors pass | Only if model accuracy is structurally unachievable; pair with frequent re-identification |
Calibration procedure. Derive \(\varepsilon_\text{model}\) from field data in three steps: (1) Run the identified model \((A_q, B_q)\) against recorded nominal trajectories; measure the 95th-percentile prediction residual \(e_{\text{P95}}\). (2) Set — a 50% safety margin above typical prediction error. For RAVEN: \(e_{\text{P95}} \approx 0.033\), giving . (3) Validate with fault-injection: inject known model-invalidating faults (locked rotor, 20% calibration drift) and verify detection within \(n_\text{detect} \leq 3\) MAPE-K ticks.
False-lockout scenario. A rapid altitude drop or downdraft shifts aerodynamic coefficients by 8–58% for a RAVEN drone — temporarily exceeding for a non-critical environmental transient. Require the model error to persist above \(\varepsilon_\text{model}\) for \(n_\text{persist} \geq 3\) consecutive MAPE-K ticks before triggering lockout. This filters single-sample glitches while preserving detection of sustained model failures.
The persistence filter \(n_\text{persist} \geq 3\) ticks (15 s at \(T_\text{tick} = 5\,\text{s}\)) is a mandatory implementation parameter — not an optional optimization. Without it, a single out-of-bounds sensor reading during a downdraft or vibration event triggers a healing lockout. This matches the Schmitt-trigger hysteresis window in Definition 47 and the Adaptive Refractory Backoff minimum period in Definition 48 .
Warning: A too-tight \(\varepsilon_\text{model}\) causes false lockouts on vibration; a too-loose one lets real model drift pass undetected. Calibrate from field P95 residuals and always require \(n_\text{persist} \geq 3\) ticks of sustained exceedance before acting.
Definition 40 (CBF Gain Scheduler). Given current state \(x\), mode \(q\), and stability margin \(\rho_q(t)\), the mode-and-state-indexed safe gain is:
where \(\eta = 0.85\) provides a 15% model-error margin and is the piecewise scheduling function:
The scheduler looks up from a precomputed 100-entry \(\rho_q\)-indexed table at each MAPE-K tick, replacing the static \(K_{\text{ctrl}}\) in the healing actuator; the table fits in 400 bytes (illustrative value) of MCU flash with no runtime LMI required. Full gain (\(\Phi = 1\)) applies when \(\rho > 0.5\) — no derate in the safe interior; the gain derates linearly as \(\rho\) falls below 0.5; healing is suspended (zero gain) when \(\rho < 0\). A DEFER rate above 5% per hour (illustrative value) (\(\rho_q < 0\) triggering suspension) indicates either is too aggressive for the current mode or the \(P_q\) matrices are miscalibrated from stale field data.
| \(\rho_q(t)\) | \(\Phi(\rho)\) | Operational meaning | |
|---|---|---|---|
| \(> 0.8\) | 1.0 | 0.85 | Safe interior — full corrective authority |
| \([0.5,\, 0.8]\) | 1.0 | 0.85 | Approaching boundary — no derate yet |
| \([0.2,\, 0.5)\) | \(2\rho\) | 0.17–0.85 | Near boundary — proportional derate active |
| \([0,\, 0.2)\) | \(2\rho\) | \(<0.17\) | Safety-critical margin — minimal gain; extend refractory |
| \(< 0\) | 0 | 0 | Outside \(\mathcal{R}_q\) — healing suspended |
Physical translation: The CBF gain scheduler is the lookup table the healing loop uses to decide how aggressively to respond, based on two inputs: how bad the current health score is, and which operating mode the system is in. A high health score in normal mode gets a fast response; a critically low battery in survival mode gets a slow, conservative response to preserve the last of the power budget.
Analogy: Cruise control with a collision-avoidance override — the CBF is the hard “don’t get closer than X meters” rule that overrides the speed controller regardless of driver input. No matter how urgently the driver wants to accelerate, the safety envelope takes precedence.
Logic: The dCBF condition enforces a per-tick safety margin. The gain scheduler then linearly scales healing authority down as \(\rho_q\) approaches zero.
Compute Profile: CPU: per tick — one quadratic form of size , one table lookup, one scalar multiply. Memory: — one precomputed matrix per regime ; no runtime LMI solve required.
Nonlinear Safety Guardrail (NSG). The dCBF and gain scheduler combine into a unified per-tick safety filter that wraps the standard MAPE-K Execute phase. The four phases execute in order each tick:
flowchart TD
M["[M] MONITOR
Compute stability margin ρ_q in mode q"]
M --> Mcheck{"ρ_q < 0?"}
Mcheck -->|"Yes — outside safe set"| Mdefer["Set K_gs = 0
Log OUTSIDE_SAFE_SET"]
Mdefer --> Mend(["Skip to next MAPE-K tick"])
Mcheck -->|"No"| A{"[A] ANALYZE
Mode transition proposed?"}
A -->|"No — same mode q"| P
A -->|"Yes — q to q_new"| Acomp["Compute ρ_q_new
for target mode q_new"]
Acomp --> Acheck{"ρ_q_new < 0?"}
Acheck -->|"Yes — target unsafe"| Atdefer["Defer transition
Hold mode q
Log TRANSITION_UNSAFE"]
Atdefer --> P
Acheck -->|"No"| Adwell{"Dwell time satisfied?
t_since_switch ≥ τ_dwell_min?"}
Adwell -->|"No"| Await["Hold mode q
Wait for dwell_min"]
Await --> P
Adwell -->|"Yes"| P["[P] PLAN
Compute K_gs from stability table
Ramp gain if K_nominal > K_gs"]
P --> E["[E] EXECUTE
Apply K_gs
Commit q = q_new
Log ρ_q to telemetry"]
Read the diagram: This flowchart is the safety filter that wraps every MAPE-K tick. Monitor computes the stability margin \(\rho_q\); if negative (outside the safe set) the tick is skipped entirely. Analyze checks whether a mode transition is safe and whether the dwell time in the current mode has been satisfied. Plan looks up the derated gain from the stability table. Execute applies the gain, commits the mode transition if approved, and logs stability margin to telemetry. All checks are O(1) arithmetic — under \(50\,\mu\text{s}\) on a Cortex-M4.
Runtime: two quadratic forms plus one table lookup — under at L1 throttle on a Cortex-M4 for .
Proposition 25 (Nonlinear Safety Invariant). If and the Nonlinear Safety Guardrail is active at every MAPE-K tick, then for all .
If the RAVEN loop starts in a safe state and the guardrail checks run every tick, the system provably never leaves its stability region across all mode transitions.
The proposition formally certifies that no healing action fires while the system is outside its Stability Region in any capability mode — an invariant required as evidence for Level 3+ Field Autonomic Certification . The precondition is verified at boot (Phase 0 of FAC); (illustrative value) ticks of DEFER guarantees re-entry for all RAVEN / CONVOY / OUTPOST configurations. A stability margin trending from 0.85 to 0.40 (illustrative value) over 90 minutes under sustained L1 throttle is the early-warning signal that pure LTI analysis cannot surface: (illustrative value) appears healthy at every tick until the loop suddenly destabilizes.
Inter-tick safety margin (physical validity condition). The discrete-time safety certificate \(h_q(x(t_k)) \geq 0\) is only physically meaningful if failure modes propagate slower than the sampling period \(T_{\text{tick}}\). Formally:
must hold at every tick, where \(L_h\) is the Lipschitz constant of \(h_q\) and \(|f_q|_{\max}\) is the maximum rate of change of the system state.
For RAVEN, rotor-failure propagation takes approximately 200 ms (illustrative value) while (illustrative value), so this condition is violated by \(25\times\) (illustrative value). The certificate guarantees the swarm was safe at the last check interval, not that it remains safe until the next.
Systems with failure propagation times shorter than \(T_{\text{tick}}\) require either interrupt-driven sensing or a physical L0 interlock ( Definition 108 ) as the true safety backstop.
L0 Physical Safety Interlock ( Definition 108 , preview). Definition 108 is formally introduced in The Constraint Sequence and the Handover Boundary. In brief: a hardware-wired circuit that arrests all actuators regardless of software state; non-resettable without physical human action. It is the true safety backstop for failure modes that propagate faster than the MAPE-K sampling interval \(T_\text{tick}\) — precisely the inter-tick gap identified above.
Calibrating \(L_h\). The Lipschitz constant \(L_h\) bounds how fast the safety function \(h_q(x)\) can change as the system state evolves. Empirical estimate: linearize \(h_q\) around the equilibrium and compute \(L_h = \max_x |\nabla h_q(x)|\) over a representative state trajectory from field data. For RAVEN’s rotor-health barrier function ( ), \(L_h \approx 1\). For nonlinear barriers (e.g., CBF based on kinetic energy), \(L_h\) must be estimated numerically. A conservative upper bound is sufficient for the safety certificate; a tight \(L_h\) improves the frequency at which the certificate is non-vacuous.
Warning: The discrete-time safety certificate is only valid when failure propagation is slower than \(T_{\text{tick}}\). For RAVEN, rotor failures propagate in ~200 ms (illustrative value) — 25× (illustrative value) faster than the 5 s (illustrative value) tick. Hardware interlocks, not software certificates, are the safety backstop for fast-propagating faults.
Proof: By strong induction on tick \(t\). Base: by precondition. Inductive step: assume . (i) Within-mode tick: is selected to satisfy the dCBF decrease condition ( Definition 39 ), giving , so . (ii) Mode transition \(q \to q'\): the ANALYZE phase checks — equivalently — before allowing transition. If the check passes, and within-mode stability applies for \(q'\). If it fails, the transition is deferred and the within-mode argument applies to \(q(t)\). (iii) DEFER with \(\rho_q < 0\): ; the open-loop delay chain \(A_q^0\) (gain removed) has all eigenvalues at zero (nilpotent shift) for the class of plants where \(A_q\) is nilpotent, so \(V_q(x)\) decreases monotonically — for finite ticks. For general Schur-stable plants with spectral radius , the dCBF contraction condition guarantees \(V_q(x)\) decreases monotonically at rate \((1-\gamma_{\text{cbf}})\) per tick, achieving within \(N \leq d_{\max}\) ticks. \(\square\)
Watch out for: the invariant holds only when the precondition is verified at boot and the dCBF check runs without interruption at every tick; for RAVEN , rotor failure propagates in ~200 ms while — a rotor fault that occurs mid-tick renders the safety certificate vacuously true at the last check but physically violated at the next, meaning the certificate bounds the state at check times, not between them.
Adaptive Gain Scheduling
The stability condition suggests a key insight: as feedback delay \(\tau\) varies with connectivity regime , the controller gain \(K_{\text{ctrl}}\) should adapt accordingly. Gain scheduling as a technique for handling operating-point variation is surveyed in [9] .
Gain scheduling by connectivity regime :
Define regime-specific gains that maintain stability margins across all operating conditions:
Notation — \(\alpha\) symbols in this article: Multiple scalars named \(\alpha\) appear with distinct roles; subscripts are the only disambiguator. is the EMA smoothing coefficient (Adaptive Gain Scheduling section): each tick moves 10% toward the target gain, completing transitions in ~10 ticks. is the gain stability margin (Adaptive Gain Scheduling section): the gain is set to 75% of the theoretical ceiling to maintain robustness against delay estimation errors. is the resource allocation fraction for healing (Cascade Prevention section): caps total healing resource draw at 20% of available budget. \(\alpha(R) \in (0,1]\) / is the MAPE-K throttle coefficient ( Proposition 36 ): scales MAPE-K frequency by available resource margin. is the priority aging cap (Resource Priority Matrix section): limits how much a waiting action’s priority can drift upward. is the class-K function parameter in the discrete Control Barrier Function ( Definition 39 ): governs how tightly the CBF contraction bound is enforced; in this article’s notation. is the Lyapunov decay rate per mode \(q\) (Theorem PWL, C1 condition): appears as \(\lambda_q\) in the LMI; listed here because some references write the per-mode decay as \(\alpha_q\). is the confidence threshold scaling factor in the staleness-aware threshold adjustment ( Definition 45 section): multiplies the confidence floor when the Knowledge Base is partially stale. covers the Weibull shape constant role: in Definitions 13–14 the Weibull shape parameter is \(k_N\); some literature writes this as \(\alpha\), but in this article it is always \(k_N\). Bare \(\alpha\) without subscript in Definition 37 is the Pareto tail index for contested-regime delay distribution — a statistical fitting parameter, not a design knob. Bare \(\alpha\) in the LinUCB formula is the exploration bonus scale in the contextual bandit gain-selection formula (Cascade Prevention section).
Bare \(\alpha\) without subscript always refers to the EMA smoothing coefficient in this article. Full series notation registry: Notation Registry.
Notation — \(K\) symbols in this article: Two distinct quantities use the letter \(K\) and must not be confused. \(K_{\text{ctrl}}\) is the controller gain (stability bound): the scalar parameter in the proportional healing actuator; the stability condition bounds how aggressively the healer may respond to avoid oscillation. \(K\) in MAPE-K is the knowledge base: the fifth element of the tuple \((M, A, P, E, K)\) in Definition 36 ; a replicated state store that feeds all four phases. \(K\) in this role is never a scalar and is never constrained by a stability inequality.
Notation — \(\gamma\) symbols in this article: Four distinct quantities use \(\gamma\) and are disambiguated by subscript. is the CBF contraction rate: the class-K function parameter in Definition 39 (Discrete Control Barrier Function); governs the per-tick contraction of the safety margin \(h_q\); default \(\gamma_{\text{cbf}} = 0.05\) for 5 s MAPE-K ticks. \(\gamma_{\text{step}}\) is the threshold step-size: the increment by which the anomaly detection threshold \(\theta(t)\) steps toward the optimal target per adaptation cycle, bounded by \(|\Delta\theta|\). \(\gamma_{\text{damp}}\) is the derivative dampener rate: the confidence-derivative threshold in the actuation hold condition, defined as . is the RL discount factor: the temporal discount applied to future rewards in reinforcement-learning and MDP formulations in the action-selection sections.
where is the stability margin factor ( (illustrative value) retains 75% of the theoretical gain limit, providing a robust safety margin against delay estimation error).
The table below translates this formula into concrete gain values for each connectivity regime , with the Healing Response column describing the behavioral consequence of operating at that gain.
| Regime | Typical \(\tau\) | Controller Gain \(K_{\text{ctrl}}\) | Healing Response |
|---|---|---|---|
| \(Full\) | 2-5s | 0.15-0.40 | Aggressive corrections; fast convergence to target state |
| \(Degraded\) | 10-30s | 0.025-0.08 | Moderate corrections; stable but slower to converge |
| \(Intermittent^+\) | 30-120s | 0.007-0.025 | Conservative corrections; accepts slow convergence to avoid oscillation |
| \(Denied^+\) | \(\infty\) (timeout) | 0.005 | Minimal corrections; reverts to open-loop predetermined responses |
\(^+\) For Intermittent and Denied regimes where transport delay follows a heavy-tailed (Pareto) distribution ( Definition 37 ), the “typical \(\tau\)” used in the gain formula is the P95 percentile from Proposition 23 ’s stochastic model — the mean delay is either very large or undefined under these distributions. Use Proposition 23 directly for these regimes; Proposition 22 ’s deterministic formula with mean \(\tau\) is valid only for Connected and Degraded regimes.
Smooth gain transitions:
Abrupt gain changes can destabilize the control loop. The exponential smoothing formula below blends the new target gain with the previous gain using mixing coefficient , so that each timestep moves only a small fraction of the way toward the target.
where prevents oscillation during regime transitions.
Bumpless transfer protocol:
When switching between regime-specific gains, maintain controller output continuity by computing the new gain for the target regime, calculating the output difference , and spreading \(\Delta U\) over a transition window to avoid step discontinuities.
Proactive gain adjustment:
Rather than waiting for a regime transition to trigger a gain change, the controller linearly extrapolates the current feedback delay trend to predict the delay at lookahead time \(\Delta\) and pre-adjusts the gain before the delay actually increases.
If predicted delay exceeds current regime threshold, preemptively reduce gain before connectivity degrades.
CONVOY example: During mountain transit, connectivity degradation is predictable from terrain maps. The healing controller reduces gain 30 seconds before entering known degraded zones, preventing oscillatory healing behavior when feedback delays suddenly increase.
Cognitive Map: The MAPE-K loop is a proportional feedback controller whose stable controller-gain ceiling falls as feedback delay grows — . Three levels of protection enforce this: the per-mode LTI gain bound ( Proposition 22 ; LTI bound — SMJLS analysis tightens this under time-varying gain, see proof), the robust percentile-based gain for heavy-tailed contested delays ( Proposition 23 ), and the runtime Nonlinear Safety Guardrail that checks the stability margin before every Execute phase. Healing actions must also finish before the failure becomes irreversible ( Proposition 21 ) and use data fresh enough that state has not changed since sensing ( Proposition 24 ). The result: a healing loop that is provably stable, provably timely, and provably operating on current information.
The Watchdog Protocol: Layer-0 Hardware Safety
Self-healing software that crashes has no way to heal itself. A MAPE-K loop that deadlocks during the Analyze phase cannot Monitor its own deadlock or Plan a recovery — the healer has become the patient. Watchdog timers are a foundational technique in the fault taxonomy of dependable systems [10] .
The engineering response is to wrap every autonomic loop in a three-layer hardware watchdog. The innermost layer is a hardware timer that fires a reset interrupt if the software loop misses its heartbeat, bypassing the OS entirely. Each outer layer monitors the layer inside it and is strictly simpler than what it monitors.
The watchdog adds a mandatory bypass period of up to \(T_0\) seconds after any MAPE-K hang. Tighter watchdog periods (smaller \(T_0\)) reduce unprotected exposure but require the loop to complete its heartbeat more reliably — making the bypass window less tolerant of occasional slow cycles under load.
Proposition 22 (LTI bound; SMJLS analysis tightens this under time-varying gain — see proof) guarantees closed-loop stability when the MAPE-K software loop executes correctly. It provides no guarantee when the loop itself fails: the Monitor thread deadlocks waiting for a gossip response, the Planner enters an infinite loop over a cyclic dependency graph, or the Executor hangs mid-action after a kernel panic. In these cases, the autonomic software that is supposed to heal the system has itself become the patient — with no higher-level authority to call.
The MAPE-K loop “pets” the watchdog at the end of each successful Execute phase. If the loop hangs, the counter expires, the interrupt fires, and control transfers to a pre-certified bypass program that operates entirely without MAPE-K software involvement.
Definition 41 (Software Watchdog Timer). A watchdog protocol is a tuple with three concentric monitoring layers:
- Layer 0 (hardware WDT): fires bypass action \(B_0\) if the MAPE-K thread does not write a heartbeat within \(T_0\) seconds. \(T_0\) must satisfy (minimum MAPE-K cycle time) to detect hangs within one loop iteration.
- Layer 1 (software watchdog): a dedicated watchdog thread checks MAPE-K liveness every \(T_1\) seconds and triggers restart \(B_1\) after \(k\) consecutive missed heartbeats.
- Layer 2 (meta-loop): a minimal monitoring process checks that Layer 1 itself is alive; escalates to \(B_0\) if Layer 1 fails.
\(\mathbf{B} = (B_0, B_1)\) is the ordered bypass action pair (\(B_1\) attempted first; \(B_0\) on escalation). is the restoration predicate — the conditions under which the MAPE-K loop may resume control after bypass activation.
Heartbeat priority guarantee: the watchdog pet operation must be assigned the highest execution priority in the MAPE-K scheduler — above all healing action execution. Under concurrent healing load (\(N_\text{concurrent}\) simultaneous loops), the execution queue delay \(\tau/(1-u)\) applies to all other operations but not to the watchdog heartbeat. If the implementation cannot guarantee watchdog priority, the watchdog timeout must be set to where \(u_\text{max}\) is the maximum expected queue utilization.
Layer separation principle: \(B_0\) must be implementable with no component at hardware-interrupt level or above — no OS calls, no shared memory locks, no MAPE-K module dependencies. Each layer must be strictly simpler than the one it monitors.
Physical translation: Three concentric alarms. The innermost watches MAPE-K every \(T_1\) seconds — if MAPE-K stops heartbeating, the software watchdog restarts it. If the software watchdog itself stops, the hardware watchdog fires after \(T_0\) seconds and resets the processor. The hardware layer has zero dependencies on the software it is watching: it monitors an electrical signal, not a function call. A processor frozen in a bad memory state will still trigger the hardware watchdog.
graph TD
MAPEK["MAPE-K Software Loop
(heartbeat each cycle)"]
L1["Layer 1: Software Watchdog
monitors MAPE-K liveness
every T1 seconds"]
L0["Layer 0: Hardware WDT
fires if no heartbeat
within T0 seconds"]
B1["Bypass B1
Restart MAPE-K thread
preserve state snapshot"]
B0["Bypass B0
Execute safe-state action
no OS involvement"]
RESTORE{"Restoration check R
resume MAPE-K?"}
MAPEK -->|"heartbeat"| L0
MAPEK -->|"heartbeat"| L1
L1 -->|"k misses"| B1
L0 -->|"T0 expired"| B0
B1 -->|"restart ok"| MAPEK
B1 -->|"restart fails"| B0
B0 --> RESTORE
RESTORE -->|"R satisfied"| MAPEK
RESTORE -->|"not satisfied"| B0
style B0 fill:#ffcdd2,stroke:#c62828
style B1 fill:#fff3e0,stroke:#f57c00
style MAPEK fill:#c8e6c9,stroke:#388e3c
style RESTORE fill:#e3f2fd,stroke:#1976d2
Read the diagram: The green MAPE-K box sends heartbeats to both Layer 1 (software watchdog thread, orange) and Layer 0 (hardware WDT, red). Layer 1 acts first — after \(k\) consecutive missed heartbeats it triggers Bypass B1 (restart the MAPE-K thread). If B1 fails, or if Layer 1 itself stops, Layer 0 fires Bypass B0: a certified safe-state program that runs with no OS calls, no shared memory, no MAPE-K module involvement. The blue Restoration diamond checks three conditions before allowing MAPE-K back in control; if any condition fails, the system stays in B0.
Proposition 26 (Watchdog Coverage Condition). Let be the MAPE-K loop failure rate (events per unit time). With Layer-0 hardware WDT period \(T_0\), the expected unprotected exposure time per failure event is bounded:
A 100 ms RAVEN hardware watchdog catches MAPE-K hangs \(3000\times\) faster than waiting for a human operator to notice.
Without a watchdog, expected unprotected time is . The watchdog improvement factor is:
Proof: A hang at time \(t\) is detected by the next WDT expiry at time \(t + T_0\) at the latest. The unprotected window \([t,\, t + T_0]\) is bounded by \(T_0\). \(\square\)
Physical translation: — the watchdog replaces human detection time with a hardware timer period. For RAVEN, a human operator might take 5 minutes (illustrative value) to notice a hung loop; the hardware WDT fires in 100 ms (illustrative value). The gain is (illustrative value). Smaller \(T_0\) means faster detection but tighter timing requirements on the MAPE-K loop’s heartbeat — the loop must reliably complete and write its heartbeat within every \(T_0\) window even under peak load.
Empirical status: The 5-minute human detection time is a planning assumption for unattended RAVEN operations; attended deployments with active monitoring will have shorter human detection times, reducing the MTTU gain but not changing the architectural argument for hardware watchdog protection.
Restoration condition : The bypass state is not permanent. The MAPE-K loop may resume when: (1) all capabilities are independently verified stable, (2) the condition causing the hang is no longer present, and (3) a dry-run MAPE-K cycle completes successfully with no actions executed. The dry-run prevents re-entry into a loop that will immediately hang again.
Critical design constraint: All healing actions must be idempotent and resumable. If the WDT fires mid-action, re-executing the action from scratch must produce the same outcome as completing the interrupted execution. Non-idempotent actions (e.g., “append to counter”) require transaction semantics before they can be managed by a watchdog-protected loop.
Watchdog-refractory interaction: if the Software Watchdog fires while the Healing Dead-Band is in a refractory state, the process restart triggered by the watchdog resets the refractory counter to zero. To prevent the resulting loss of backoff context from causing oscillatory restarts, the node must persist its current refractory cycle count \(n\) to non-volatile storage on every refractory increment. After a watchdog-triggered restart, the persisted \(n\) is restored and backoff resumes from \(\tau_\text{ref}(n)\) rather than \(\tau_\text{ref}(0)\).
RAVEN calibration: (illustrative value) (maximum tolerable period before attitude control degrades), \(T_1 = 1\,\text{s}\) (illustrative value), \(k = 3\) (illustrative value). Bypass \(B_0\): maintain current heading, throttle, and altitude in attitude-hold mode. MTTU gain: (illustrative value) vs. \(T_0 = 0.1\,\text{s}\) yields \(3000\times\) (illustrative value) improvement.
HYPERSCALE calibration: \(T_0 = 30\,\text{s}\) (Kubernetes liveness probe as Layer-0), \(T_1 = 5\,\text{s}\), \(k = 2\). Bypass \(B_0\): stop accepting new requests, drain in-flight transactions, hold persistence layer state steady. The bypass action must never forcibly terminate the database layer regardless of MAPE-K state — data integrity takes precedence over healing speed.
Compute Profile: CPU: per tick — one counter decrement and one threshold comparison (software layer); hardware WDT register write carries no CPU overhead. Memory: — single heartbeat counter and timeout threshold. The binding scheduling constraint is priority assignment: the heartbeat write must run at highest scheduler priority to prevent priority inversion.
Watch out for: the MTTU bound holds only when the MAPE-K loop’s heartbeat write runs at the highest scheduler priority; if the heartbeat can be preempted by a resource-intensive healing action, the effective exposure window extends beyond \(T_0\) — precisely during high-load conditions such as simultaneous healing loops and cascade failures, when fast watchdog response is most critical.
Commercial Application: HYPERSCALE Data Center Self-Healing
HYPERSCALE operates edge data centers serving low-latency requirements. When central orchestration becomes unreachable—partition, DDoS, or maintenance—each site must heal autonomously. Sites contain compute nodes, storage, network infrastructure, and microservices with complex dependency graphs.
The MAPE-K implementation for HYPERSCALE edge sites:
The diagram expands the abstract MAPE-K loop into concrete HYPERSCALE components, showing three parallel monitor sources feeding a three-stage analysis pipeline before reaching execution — note how the Knowledge base feeds into Analyze and Plan but not Execute directly.
graph TD
subgraph "Monitor Layer"
M1["Metrics Collector
Node health, latency
Every 5s"]
M2["Log Aggregator
Error patterns
Streaming"]
M3["Synthetic Probes
End-to-end health
Every 15s"]
end
subgraph "Analyze Layer"
A1["Anomaly Detector
Statistical analysis"]
A2["Dependency Mapper
Runtime discovery"]
A3["Impact Assessor
Blast radius calc"]
end
subgraph "Plan Layer"
P1["Action Generator
Candidate healing ops"]
P2["Risk Evaluator
Side effect analysis"]
P3["Coordinator
Multi-action sequencing"]
end
subgraph "Execute Layer"
E1["Orchestrator
Container/VM control"]
E2["Network Controller
Route, firewall"]
E3["Load Balancer
Traffic steering"]
end
subgraph "Knowledge Base"
K1["Service Catalog
Dependencies, SLOs"]
K2["Healing History
What worked before"]
K3["Current State
Cluster snapshot"]
end
M1 --> A1
M2 --> A1
M3 --> A1
A1 --> A2
A2 --> A3
A3 --> P1
P1 --> P2
P2 --> P3
P3 --> E1
P3 --> E2
P3 --> E3
E1 -->|"feedback"| M1
K1 -.-> A2
K2 -.-> P1
K3 -.-> A1
style K1 fill:#fff9c4,stroke:#f9a825
style K2 fill:#fff9c4,stroke:#f9a825
style K3 fill:#fff9c4,stroke:#f9a825
Read the diagram: Three parallel monitor sources (metrics every 5s, streaming logs, synthetic probes every 15s) all feed the Anomaly Detector. Analysis flows sequentially left-to-right: anomaly detection \(\to\) dependency mapping \(\to\) impact assessment. Planning is likewise sequential: generate candidates \(\to\) evaluate risk \(\to\) sequence multi-action plans. The three Execute controllers (container, network, load balancer) each receive their actions from the Coordinator. The Knowledge Base (yellow) feeds Analyze and Plan with dotted arrows but does not feed Execute directly — keeping the execution path simple, policy-driven, and auditable.
Healing latency budget for HYPERSCALE :
| Phase | Budget | Limiting Factor |
|---|---|---|
| Detection | 15-30s | Metrics collection interval + anomaly threshold |
| Analysis | 5-10s | Dependency graph traversal, impact calculation |
| Planning | 2-5s | Action enumeration, risk scoring |
| Coordination | 10-30s | Multi-service sequencing, pre-flight checks |
| Execution | 30-180s | Container restart, health check convergence |
| Total | 62-255s | SLO: 95% of incidents resolved in <5 minutes |
Dependency-aware restart sequence: When the payment microservice fails, HYPERSCALE ’s analyzer discovers the dependency chain. The diagram below shows the runtime dependencies read left-to-right: arrows point from caller to dependency, the red node is the failed service, and the orange nodes are downstream services affected by the failure.
graph LR
LB["Load Balancer"] --> API["API Gateway"]
API --> AUTH["Auth Service"]
API --> PAY["Payment Service
(FAILED)"]
PAY --> DB["Payment DB"]
PAY --> QUEUE["Message Queue"]
PAY --> FRAUD["Fraud Check"]
FRAUD --> ML["ML Scoring"]
style PAY fill:#ffcdd2,stroke:#c62828
style FRAUD fill:#fff3e0,stroke:#f57c00
style ML fill:#fff3e0,stroke:#f57c00
Read the diagram: Arrows point from caller to dependency (left to right). The red Payment Service is the failed node; orange nodes (Fraud Check, ML Scoring) are downstream services whose behavior degrades with the failure. The Load Balancer and API Gateway (upstream) are unaffected. The healing sequence must verify all dependencies of the failed service — Payment DB, Message Queue — before restarting the service itself. A restart that fails immediately due to an unhealthy dependency wastes the healing budget and risks cascading restarts.
The healing sequence respects dependencies: first verifying that Payment DB is healthy (no restart needed if healthy), then confirming Message Queue is accepting connections, restarting Payment Service with fresh state, waiting for the health check (HTTP 200 on /healthz), re-enabling traffic via Load Balancer, and finally verifying end-to-end transaction success via synthetic probe.
Cascade prevention in practice: During a storage node failure, HYPERSCALE caps the number of simultaneously restarting nodes to one-third of the currently healthy nodes, ensuring at least two-thirds of the cluster remains serving traffic at any moment while healing proceeds.
Physical translation: At most one-third of currently healthy nodes restart simultaneously, guaranteeing at least two-thirds always serve traffic. The \(\max(1, \cdot)\) floor ensures at least one node can always restart even in a tiny cluster. This is a capacity reservation: the cluster reserves two-thirds of its healthy nodes as a service buffer while the remaining third cycles through healing.
With 28 storage nodes and 1 failed, maximum concurrent restarts = 9. This ensures at least 18 nodes remain serving traffic during any healing operation.
Game-Theoretic Extension: Healing Resource Congestion
When multiple MAPE-K loops coexist — one per monitored subsystem — each loop solves its healing action optimization independently. Their resource claims compete for shared capacity (CPU, bandwidth, power), forming a congestion game.
Congestion game: Each MAPE-K loop \(i\) selects a healing action requiring resource vector \(\mathbf{r}(a_i)\). The cost of action \(a_i\) increases with the number of loops simultaneously using the same resources (congestion level \(n_r\) on resource \(r\)).
By Rosenthal’s theorem (1973), congestion games always admit a pure Nash equilibrium, which minimizes the potential function \(\Phi(\mathbf{a})\): the sum over all resources \(r\) of the cumulative marginal costs incurred as each successive loop claims that resource, where \(n_r(\mathbf{a})\) is the number of loops simultaneously using resource \(r\) under action profile \(\mathbf{a}\).
where \(c_r(k)\) is the marginal cost of resource \(r\) at congestion level \(k\).
Coordination protocol: Each MAPE-K loop selects healing actions to minimize \(\Phi\) (best-response descent) respecting the aggregate resource constraint . The healing coordination game admits a pure Nash equilibrium (Rosenthal 1973). Best-response dynamics converge in potential games, but MAPE-K healing uses gradient-based updates rather than pure best-response; convergence to Nash should be verified empirically for each deployment. In practice, this means a shared resource declaration table: loops register resource requirements and receive grants only when the current allocation remains feasible.
Practical implication: Replace the heuristic “max concurrent restarts = ” with a congestion game coordination layer. When multiple failures occur simultaneously ( RAVEN jamming causes multi-component failures), loops negotiate resource grants through potential-function minimization rather than competing independently. This generalizes to heterogeneous resource requirements without per-scenario tuning.
Stability boundary. The single-loop stability proof ( Proposition 22 , Theorem PWL) does not extend directly to the multi-loop case. The congestion game establishes Nash equilibrium existence under Rosenthal’s theorem, but it does not bound the number of coordination rounds or prevent inter-loop oscillation: loop A fixes subsystem X, loop B’s action incidentally reverts X, loop A fires again. Two conditions are sufficient to prevent infinite livelock: (1) every healing action consumes a positive, non-recoverable amount of a finite resource (time, refractory credits, or energy budget) so no loop can fire indefinitely without exhausting its allocation; and (2) all loops share the same priority matrix (monotone descent on the common potential function \(\Phi\)). Under these conditions, the multi-loop system inherits finite convergence from the potential-game structure. Condition (1) is enforced by the refractory period ( Definition 46 and Proposition 30 ); condition (2) is enforced by requiring all loops to reference the same priority matrix instance ( Definition 43 ) — a single shared table, not per-loop copies.
The qualitative conditions above are necessary but not sufficient against the inter-node oscillation failure mode specific to fleet healing: Node A sheds load to Node B; Node B independently detects an anomaly and sheds it back; the fleet enters a chaotic exchange rather than a steady state. This differs from the single-node chatter addressed by Definitions 46–49 — it involves no single loop firing twice, so the refractory period cannot prevent it. Three mechanisms can enforce convergence at the fleet level:
| Mechanism | Guarantee | Failure mode |
|---|---|---|
| Probabilistic Backoff — jitter delay before Execute (CSMA/CD analog) | Breaks synchrony; at most one node fires per gossip period | Does not prevent oscillation — nodes may defer forever or fire in rotation |
| Resource Tokens — virtual token budget (\(T_i\) transfers/gossip period) | Bounds max oscillation frequency | Token exhaustion blocks healing even when critical; deadlock possible |
| Global Energy Function — HAC gate: action admitted iff \(V\) strictly decreases | Formal Lyapunov certificate; no oscillation by construction | None — the only approach with a convergence proof |
Recommended approach: Global Energy Function ( Definition 42 ) as the primary gate, with Probabilistic Backoff as a synchrony-breaking supplement. The following definition and proposition make this precise.
Definition 42 (Fleet Stress Function and Healing Admission Condition). Let the fleet be . Each node \(i\) maintains resource state , where \(\ell_i\) is normalized load, \(d_i = 1 - b_i\) is battery deficit (\(b_i\) = state of charge), and \(q_i\) is queue depth fraction. The Fleet Stress Function is:
where \(\varepsilon > 0\) softens the barrier near \(x = 1\). \(\varphi\) is strictly convex, \(\varphi(0) = 0\), as \(x \to 1\).
Authority gate (prerequisite): Before evaluating HAC, verify , where authority tier \(Q_{\text{effective}}\) reflects the node’s current escalation level in the four-tier hierarchy (Tier 0 = heartbeat-only, Tier 1 = degraded-local, Tier 2 = full-autonomic, Tier 3 = cloud-delegated); \(Q_{\text{required}}(a)\) is the minimum tier needed to authorize healing action \(a\). (Formally defined as Definition 68 in Fleet Coherence Under Partition.) If the executing node lacks the required authority tier, reject action a immediately — HAC is not evaluated. This gate fires first in the execution pipeline: Authority, then Hardware Veto ( Proposition 32 ), then HAC, then Actuate.
A healing action (transferring resource \(r\) from node \(i\) to node \(j\) by amount \(\Delta r\)) satisfies the Healing Admission Condition (HAC) if and only if:
where \(S'\) is the post-transfer state and is the minimum required improvement. An action failing HAC is rejected by the Execute phase before any command is transmitted. Each node additionally adds jitter before evaluating HAC (Probabilistic Backoff):
The condition requires \(\varepsilon = 0.01\) (illustrative value), (illustrative value) (0.1% fleet stress reduction per action), and (illustrative value) (maximum single-transfer fraction). \(V(S)\) is computed from the gossip health vector; peer data is bounded-stale by ( Proposition 14 ); the HAC check is \(O(N)\) in gossip vector size, constant time for a fixed fleet. A non-decreasing \(V\) trace is the diagnostic signature of either a HAC implementation bug or a fault not addressable by load redistribution — both warrant escalation to severity S3 ( Definition 44 ).
Physical translation: \(V(S)\) is the mathematical analog of a stress elevation above sea level. Every healing action is a downhill step — the HAC check confirms the step goes down before it is taken. Node A shedding to Node B lowers the hill; shedding back would go uphill. HAC rejects it. The fleet can only descend.
Proposition 27 (Fleet Healing Convergence — Lyapunov Certificate). Let the fleet execute HAC-gated healing under Definition 42. Let \(S^*\) be any state satisfying , , for all \(i\). Then:
When every RAVEN healing action must reduce fleet stress, the swarm cannot enter the oscillation cycles that required manual intervention in uncontrolled simulations.
(i) Positive definiteness: \(V(S) \geq 0\) for all \(S\); \(V(S) = 0\) iff \(\ell_i = d_i = q_i = 0\) for all \(i\).
(ii) Monotone decrease: Every admitted healing action satisfies .
(iii) No inter-node oscillation: If is admitted at step \(t\), then is rejected at step \(t+1\). More generally, no healing action admitted at step \(t+1\) can return \(V\) to \(V(S(t))\).
(iv) Finite convergence: Starting from \(V(S(0)) < \infty\), the fleet reaches \(S^*\) in at most healing steps.
Proof sketch: (i) follows from \(\varphi(x) \geq 0\) and \(\varphi(0) = 0\). (ii) is immediate from the HAC gate definition.
(iii): Suppose was admitted at step \(t\), so for . The return action at step \(t+1\) restores \(\ell_i, \ell_j\) to their step-\(t\) values, so , violating HAC — is rejected. The argument extends inductively to any cycle : each hop decreases \(V\) by at least , so the final return hop would need to increase \(V\) by the accumulated decrease — HAC rejects it.
(iv): \(V \geq 0\) and decreases by at least per step, so at most steps can occur. \(\square\)
Assumption: load vector \(d_i, d_j\) and queue depth \(q_i, q_j\) remain constant during the healing step interval \([t, t+1]\). Under time-varying traffic, the monotone-decrease property holds up to load-induced perturbations — the Lyapunov certificate bounds finite convergence to within one traffic-fluctuation window, not to exact \(V = 0\).
Physical translation: The convergence bound is concrete. For RAVEN (\(N = 47\), typical \(V(S(0)) \approx 5.0\) (illustrative value) under 12-drone battery anomaly, (illustrative value)): at most 1000 (illustrative value) healing steps. At one step per gossip period (5 s), full fleet recovery is guaranteed within 83 minutes (illustrative value) from any initial fault state — or immediately if faults are addressable in fewer steps. Without HAC, the same scenario produced a 6-node oscillation cycle lasting 22 minutes (illustrative value) before manual intervention (100-run simulation).
Note: the convergence bound assumes stable load during each healing step. Systems with bursty traffic converge within one load-fluctuation window rather than to exact zero stress — the certificate guarantees forward progress, not instantaneous optimality.
Empirical status: The convergence bound of 1000 steps (83 minutes) and the 43-step average from 100 simulation runs are specific to the RAVEN parameter set (\(N=47\), \(\eta_{\min}=0.005\), \(V(S(0)) \approx 5.0\)); different fleet sizes, failure rates, or resource distributions will produce different convergence times.
RAVEN calibration: During simulated 12-drone simultaneous battery anomaly (\(N = 47\)): Without HAC, healing entered a 6-node A-B-C exchange pattern lasting 22 minutes (illustrative value) before manual intervention. With HAC plus probabilistic backoff ( jitter window), oscillation was eliminated in all 100 simulation runs and fleet stress \(V(S)\) decreased monotonically to \(< 0.01 \cdot V(S(0))\) within 43 steps (illustrative value) (3.6 minutes (illustrative value) average).
Relationship to existing results: The HAC gate addresses a failure mode orthogonal to those in Definitions 46–49. The refractory period ( Definition 46 , Proposition 30 ) prevents a single node’s loop from firing too frequently; the Schmitt trigger ( Definition 47 ) prevents threshold chatter on a single sensor; the derivative dampener ( Definition 49 ) suppresses transient spikes on a single signal. HAC is the first mechanism that constrains inter-node healing transfers at the fleet level. The conditions are complementary: a system should enforce all of them in the Execute phase.
Authority prerequisite: HAC applies only to actions for which the executing node holds the required authority tier ( Definition 68 ). (Authority tiers: L0 = node-scope actions only; L1 = cluster-scope; L2 = fleet-scope; L3 = command-scope. Formally defined in Definition 68 , Fleet Coherence Under Partition.) A node operating at rejects the action at the authority gate before reaching HAC — HAC is not evaluated. This ordering ensures that a partitioned node with temporarily elevated effective tier cannot bypass the Lyapunov energy gate.
Watch out for: the finite convergence bound steps assumes the load state \((\ell_i, d_i, q_i)\) remains constant during each healing step interval; under bursty traffic or simultaneous external faults, \(V(S)\) may increase between admitted steps even when HAC was satisfied at firing time — the certificate guarantees forward progress on average, not monotone decrease under all traffic patterns, and the convergence step count can exceed the bound when external perturbations consistently reverse a fraction of each healing step’s improvement.
Resource Priority Matrix: Deterministic Conflict Resolution
The congestion game converges to Nash equilibrium via iterative best-response dynamics — but convergence takes multiple coordination rounds. This is too slow when two actions claim the same CPU simultaneously and combined demand exceeds supply. A deterministic preemption layer sits above the congestion game: when resource claims conflict, the priority matrix resolves the contest in \(O(1)\) time without coordination overhead.
Definition 43 (Resource Priority Matrix). Given resource set and healing action set , the Resource Priority Matrix assigns priority weight to action \(a_i\)’s claim on resource \(r_j\). When actions \(a_i\) and \(a_k\) both claim resource \(r_j\) with demands \(d_i, d_k\) such that \(d_i + d_k > Q_j\) (available capacity):
Priority weights derive from the lexicographic objective hierarchy — Survival \(\succ\) Autonomy \(\succ\) Coherence \(\succ\) Anti-fragility [11] — with the healing action’s protected capability tier determining its row weight:
Thermal vs. throughput conflict (the motivating case): thermal emergency cooling protects hardware survival ( , \(P = 1.0\)); throughput optimization serves mission coherence ( , \(P = 0.5\)). When both demand the same CPU cores, cooling preempts throughput instantly and deterministically — no negotiation round required.
RAVEN CPU priority matrix (representative subset):
| Healing action | Protected tier | CPU priority | Preempts |
|---|---|---|---|
| Battery emergency land | \(\mathcal{L}_0\) | 1.0 | All lower tiers |
| Thermal throttle | \(\mathcal{L}_0\) | 1.0 | All lower tiers |
| Formation rebalance | \(\mathcal{L}_2\) | 0.8 | Coherence, anti-fragility |
| Gossip rate increase | \(\mathcal{L}_3\) | 0.5 | Anti-fragility only |
| Model weight update | \(\mathcal{L}_4\) | 0.3 | None (yields to all) |
Proposition 28 (Priority Preemption Deadline Bound). Under strict priority preemption with the Resource Priority Matrix, action \(a_i\) misses its healing deadline only if the total CPU time consumed by strictly higher-priority actions during \(a_i\)’s execution window exceeds available slack :
An L0-tier thermal emergency always meets its deadline because nothing can preempt it; a throughput optimization may be starved if thermal events last longer than its slack.
For - tier actions ( ): nothing preempts them, so under any resource contention. For throughput optimization ( ): miss probability is bounded by the probability that thermal events last longer than the throughput slack.
Proof: Under strict preemption, a tier- action holds the resource continuously once granted and is interrupted only by a strictly higher-priority preemptor. Worst-case blocking equals the sum of all higher-priority execution times within the same window. Deadline miss requires this sum to exceed available slack. \(\square\)
Watch out for: the \(\mathcal{L}_0\)-tier deadline guarantee holds only when at most one action occupies the priority level at any given time; when multiple components simultaneously declare \(\mathcal{L}_0\)-tier emergencies (simultaneous thermal and structural faults), they compete as equal-priority actions and the absolute deadline guarantee degrades to the probabilistic bound — the zero-miss guarantee is absolute only for a single top-tier action, not for a concurrent set of them.
Anti-starvation aging: Low-tier actions ( ) could be indefinitely starved if higher-priority actions arrive continuously. Priority is elevated linearly with queue age to bound maximum wait time:
where caps maximum elevation from aging, and is the maximum acceptable wait time for any tier.
Connection to congestion game: The Resource Priority Matrix is the initializer for best-response dynamics. Rather than starting from equal resource weights and iterating toward Nash, the matrix provides an initial allocation already aligned with the lexicographic objective. The congestion game then fine-tunes within-tier resource sharing.
UCB -based healing action selection (formally developed in Anti-Fragile Decision-Making at the Edge; used here as a preview): HYPERSCALE tracks success rates for each healing action by failure category. The table below shows accumulated attempt and success counts alongside the UCB score that the exploration-exploitation formula assigns, which blends estimated success rate with an exploration bonus that grows when an action has been tried infrequently.
| Failure Type | Action | Attempts | Successes | UCB Score |
|---|---|---|---|---|
| Pod crash loop | Restart pod | 847 | 712 | 0.89 |
| Pod crash loop | Delete + recreate | 234 | 198 | 0.91 |
| Pod crash loop | Scale to 0, then up | 89 | 81 | 0.95 |
| Memory pressure | Evict low-priority | 412 | 389 | 0.96 |
| Memory pressure | Add node | 67 | 51 | 0.84 |
For crash loops, “scale to 0, then up” has highest UCB despite fewer attempts—the exploration bonus rewards trying this promising action more often.
Control plane partition handling: When an edge site loses connectivity to the central control plane:
At T+0s detection fires when the central API becomes unreachable for 3 consecutive health checks (illustrative value). At T+15s (illustrative value) the site enters “autonomous mode” with elevated local authority. At T+20s (illustrative value) the current configuration is snapshotted for later reconciliation. At T+25s (illustrative value) healing thresholds are tightened by 15% (illustrative value) to be more conservative without central backup. From that point onward, all healing actions are logged with causality metadata.
Upon reconnection, the site uploads its healing log. Central platform reconciles any conflicts (e.g., site promoted a replica to primary that central also promoted elsewhere) using causal ordering with HLC timestamps ( Proposition 24 ) with site-local decisions taking semantic priority ( Proposition 49 ). Wall-clock LWW is unreliable during partition due to clock drift; the NTP-Free Semantic Commit Order of Proposition 49 provides the correct causal resolution.
Utility analysis:
The MTTR improvement equals the manual resolution time minus the automated detection and healing time, where includes paging delay, context acquisition, and decision time.
Escalation rate bound: For healing actions with success probability \(p_s\) and \(k\) retry attempts:
With \(p_s \geq 0.9\) and \(k = 3\): . Adding unknown failure modes (\(\approx 5\%\) of incidents): .
Utility improvement: . Sign(\(\Delta U\)) > 0 when .
Cognitive Map: The watchdog protocol enforces strict layer separation: MAPE-K software is monitored by a software watchdog thread, which is monitored by a hardware WDT — each layer strictly simpler than the one it monitors. The MTTU gain formula quantifies what is bought: replacing multi-minute human detection with a sub-second hardware interrupt. HYPERSCALE instantiates these principles at data center scale: three parallel monitor sources, dependency-aware restart sequencing, and a Resource Priority Matrix that resolves resource conflicts in \(O(1)\) without coordination rounds. The lexicographic hierarchy (Survival \(\succ\) Autonomy \(\succ\) Coherence \(\succ\) Anti-fragility) determines priority weights deterministically — thermal emergencies always preempt throughput optimizations regardless of which loop happens to act first. Next: healing actions must also contend with genuine uncertainty about root cause — the following section addresses acting effectively on symptoms alone.
Healing Under Uncertainty
A connected system with expert operators can diagnose failures systematically — gather logs, trace root cause, apply targeted fix. A disconnected edge system during partition has none of that: no historical context, no external expertise, and no time to wait for analysis before the failure worsens.
The alternative is to act on observable symptoms using a cost-calibrated confidence threshold. You don’t need to know why a service is failing to restart it productively. You need to know whether acting’s expected value exceeds waiting’s expected value. That judgment requires only the confidence level and the relative costs of false positives versus false negatives.
Symptom-based healing can temporarily suppress a worsening root cause. Escalation controls — attempt limits, re-trigger windows, treatment cooldowns — bound this risk without requiring root cause knowledge.
Acting Without Root Cause
Root cause analysis is the gold standard for remediation: understand why the problem occurred, address the underlying cause, prevent recurrence. In well-instrumented cloud environments with centralized logging and expert operators, it is achievable.
At the edge, the requirements for root cause analysis may not be met: logging capacity is limited with no access to historical comparisons; the failure demands immediate response while analysis takes time; and no human expert is available during partition.
Symptom-based remediation addresses this gap. Instead of “if we understand cause C, apply solution S,” we use “if we observe symptoms Y, try treatment T.”
The table below gives four representative symptom-treatment pairings together with the rationale explaining why the treatment addresses multiple possible root causes.
| Symptom | Treatment | Rationale |
|---|---|---|
| High latency | Restart service | Many causes manifest as latency; restart clears transient state |
| Memory growing | Trigger garbage collection | Memory leaks and bloat both respond to GC |
| Packet loss | Switch frequency | Interference or jamming both improved by frequency change |
| Sensor drift | Recalibrate | Hardware aging and environmental factors both helped by recal |
The risk of symptom-based remediation: treating symptoms while cause worsens. If the root cause is hardware failure, restarting the service provides temporary relief but doesn’t prevent eventual complete failure.
Three mitigations bound this risk: if treatment T fails after N attempts, escalate to more aggressive treatment; if symptoms return within a time window, assume treatment was insufficient and escalate; and do not re-apply the same treatment too quickly — allow time for observation before repeating.
Confidence Thresholds for Healing Actions
From self-measurement, health estimates come with confidence intervals. The act/wait decision is formalized as a constrained optimization.
Definition 44 (Healing Action Severity). The severity \(\varsigma(a) \in [0, 1]\) of healing action \(a\) is determined by its reversibility \(R(a) \in [0,1]\) and impact scope \(I(a) \in [0,1]\): . Actions with \(\varsigma(a) > 0.8\) are classified as high-severity.
In other words, severity is high when an action is both hard to undo and affects many components simultaneously; a cache flush scores near zero (fully reversible, narrow scope) while isolating a node from the fleet scores near one (irreversible, wide impact).
Act/Wait Decision Problem:
Given a confidence estimate \(c\) from the anomaly detector and a candidate healing action \(a\), the system must decide whether to act now or wait for more evidence. The objective selects the binary decision \(d^*\) that maximizes expected utility, where acting incurs a false-positive cost when the diagnosis is wrong and waiting incurs a false-negative cost when the failure is real.
where \(d = 1\) indicates “act” and \(d = 0\) indicates “wait”, with:
Optimal Decision Rule:
Act when , which yields:
Physical translation: \(\theta^*(a)\) is the break-even confidence — the point where acting and waiting have equal expected cost. If false-positive cost is 10% of the total, act at 10% confidence. When the failure costs \(100\times\) more than the unnecessary restart ( ), the break-even drops near zero: act on almost any signal. For a drone reboot (high disruption if wrong, catastrophic if missed), the denominator is large and \(\theta^*\) is high — confirmation required. For a gossip-rate increase (trivial if wrong, valuable if right), \(\theta^*\) is low — act freely.
The formula computes the minimum confidence at which triggering a healing action has positive expected utility given its FP/FN cost ratio, with the threshold following directly from the ratio of disruption cost to total cost. The parameters are: — disruption cost of unnecessary healing; — operational damage while the fault persists (mission degradation, reduced capacity — not asset loss); — incremental operational gain from successful recovery above fault avoidance (mission re-enabled, capability restored beyond minimum viable — not the same asset value already counted in ); these three terms must be economically disjoint. Thresholds differ by action type: a drone reboot requires confidence (illustrative value); a gossip-rate change requires only (illustrative value).
This is the full form stated in Proposition 29 . When is folded into the effective false-negative cost (i.e., ), this reduces to the simplified form of Corollary 84.1.
Three constraints bound the threshold regardless of what the cost-ratio formula produces: a minimum floor so the system is never trigger-happy at near-zero confidence, a maximum ceiling so critical failures are never silently ignored, and a hard floor specifically for high-severity actions.
The table below applies Proposition 29 ’s formula to six representative healing actions: as severity rises and reversibility falls, the Required Confidence column rises correspondingly, demanding stronger evidence before the system acts.
| Action | Severity | Reversibility | Required Confidence |
|---|---|---|---|
| Restart service | Low | Full | 0.60 |
| Reduce workload | Low | Full | 0.55 |
| Isolate component | Medium | Partial | 0.75 |
| Restart node | Medium | Delayed | 0.80 |
| Isolate node from fleet | High | Complex | 0.90 |
| Destroy/abandon | Extreme | None | 0.99 |
For Drone 23:
- Detection confidence: 0.94
- Action: Return to base (medium severity, reversible if wrong)
- Required confidence: 0.80
- Decision: 0.94 > 0.80, proceed with return
Proposition 29 (Optimal Confidence Threshold). The optimal confidence threshold \(\theta^*(a)\) for healing action \(a\) satisfies:
When a drone reboot costs ten times less than a missed failure, the system should act at 9% (illustrative value) confidence — not the intuitive 90% (illustrative value).
where is the cost of unnecessary healing, is the operational damage from the fault continuing (mission degradation, reduced capacity), and is the incremental operational value of successful recovery above the avoided fault loss — mission objective re-enabled, full capability restored beyond minimum viable system. These three components must be economically disjoint.
In other words, set the confidence bar at the fraction of total expected cost attributable to false positives: if unnecessary healing is nine times cheaper than the combined cost of missing a real failure plus the value of recovery, act as soon as confidence exceeds 10%.
Non-overlap requirement: \(C_{\text{FN}}\) and \(V_{\text{heal}}\) must measure distinct economic events. \(C_{\text{FN}}\) captures operational damage while the fault persists — sensor degraded, route suboptimal, mission efficiency reduced. \(V_{\text{heal}}\) captures the incremental gain from recovery that exceeds mere fault avoidance — mission objective re-enabled, full fleet capacity restored. Double-counting trap: if both are set to the same asset value (e.g., , “drone worth $50K”), the denominator inflates to and \(\theta^*\) is spuriously halved — the system becomes trigger-happy, executing hard reboots on low-confidence noise because the math says “nothing to lose.” When asset preservation is the only concern, set \(V_{\text{heal}} = 0\); the formula then collapses to the standard Bayesian threshold . RAVEN Drone 23: \(C_{\text{FN}}\) = 15 % mission efficiency loss from degraded navigation (operational degradation while fault persists); \(V_{\text{heal}}\) = restored to full efficiency and able to cover the relay sector lost during the fault (incremental mission value, distinct from mere efficiency recovery). These are separate economic events — their sum correctly reflects the full incentive to act promptly.
Empirical status: The cost values \(C_{\text{FP}}\), \(C_{\text{FN}}\), and \(V_{\text{heal}}\) must be measured or estimated per action type and deployment context; thresholds derived from incorrectly specified costs will be miscalibrated, and the threshold table (restart: 0.60, isolate node: 0.90) reflects RAVEN -specific cost assumptions that may not transfer to other scenarios.
Watch out for: the threshold is optimal only when \(C_{\text{FP}}\), \(C_{\text{FN}}\), and \(V_{\text{heal}}\) are economically disjoint and correctly measured — setting both \(C_{\text{FN}}\) and \(V_{\text{heal}}\) to the same asset value double-counts the denominator, spuriously halving \(\theta^*\) and making the system trigger-happy on low-confidence evidence; the non-overlap requirement is a structural constraint that cannot be verified from data alone and must be checked by examining whether the two terms measure distinct economic events.
Corollary 84.1. When is absorbed into effective false-negative cost , the threshold simplifies to:
Proof: When \(c \in [0,1]\) is the posterior probability , the expected costs of acting and waiting are:
Acting is preferred when :
The threshold structure implies: asymmetric costs ( ) yield lower thresholds, accepting more false positives to avoid missed failures.
Watch out for: absorbing \(V_{\text{heal}}\) into is valid only when the two quantities measure distinct economic events with no overlap; if the same asset value is used for both terms, the combined \(C_{\text{FN}}^{\text{eff}}\) overstates the true opportunity cost and the simplified threshold fires more aggressively than the original three-term formula warrants.
Game-Theoretic Extension: Adversarial Threshold Manipulation
Proposition 29 ’s optimal threshold \(\theta^*(a)\) is derived against a non-strategic failure process. The dynamic threshold adaptation mechanism — which modulates \(\theta^*\) through — is itself manipulable if the adversary can influence the context variables.
Attack pattern: An adversary who can cause spurious cascade events inflates , which raises \(\theta^*(t)\), which then suppresses detection of the real attack. The threshold-raising event sequence is itself an anomaly signature.
Maximin threshold: The adversarially robust threshold chooses the threshold that keeps detection probability as high as possible even when the adversary selects the attack signal \(a_A\) from their action space that most suppresses detection.
Second-order defense: Monitor the pattern of threshold-raising events. A cluster of false positives that raises \(\theta^*\) immediately before a partition event is itself an anomaly warranting elevated alertness — the dynamic threshold adaptation should include an adversarial-signature monitor that temporarily freezes \(\theta^*\) when manipulation signatures are detected.
Practical implication: For CONVOY and RAVEN operating in adversarial environments, bound the maximum rate at which \(\theta^*\) can increase per unit time (a rate limiter on threshold escalation). Sudden large threshold increases — whether from genuine context changes or adversarial manipulation — should trigger a brief period of heightened sensitivity at the prior (lower) threshold before committing to the new one.
Dynamic Threshold Adaptation
Static thresholds assume fixed cost ratios. In practice, the relative cost of acting versus waiting shifts with mission phase, resource availability, and connectivity — so the threshold must update continuously. The context-dependent optimization selects \(\theta^*(t)\) at each timestep by minimizing expected total cost under current system state \(\Sigma_t\), where the state captures resource level, mission phase, connectivity, and the number of healing actions already in progress.
The expected cost at threshold \(\theta\) given current system state \(\Sigma_t\) is the sum of two terms: the effective false-positive cost scaled by the false-positive rate , plus the effective false-negative cost scaled by the miss rate .
The effective costs are functions of system state :
The effective false-positive cost grows when resources are scarce or many healings are already in progress, while the effective false-negative cost grows during critical mission phases and when connectivity is denied (because no external help is available to handle a missed failure).
Modulation functions:
- : FP cost triples when resources depleted
- : Each concurrent healing increases FP cost by 50%
- : Critical phases multiply FN cost up to \(5\times\)
- : Full connectivity halves FN cost; denied doubles it
Applying Proposition 29 ’s ratio formula to the effective costs gives the time-varying threshold — at each timestep, \(\theta^*(t)\) is simply the fraction of total effective cost attributable to false positives.
During critical mission phases ( ) with good connectivity, the denominator grows large relative to the numerator, driving \(\theta^*(t)\) well below 0.1—the system heals at very low confidence, accepting many false positives to avoid any missed failures.
Threshold bounds:
Unconstrained adaptation can lead to pathological behavior. The hard bounds below enforce a safety interval for \(\theta^*(t)\): ensures the system always requires at least some confidence, and ensures it never completely ignores a detected problem.
Hysteresis for threshold changes:
Rapidly fluctuating thresholds cause inconsistent behavior. The hysteresis rule below holds the current threshold fixed if the change demanded by \(\theta^*(t)\) is smaller than the dead-band (illustrative value), preventing threshold jitter from triggering spurious mode changes.
where (illustrative value) prevents threshold jitter.
State Transition Model: The complete threshold state at time \(t+1\) is the triple of the updated threshold value and the two effective costs that drive it at that timestep.
The threshold itself steps toward the target \(\theta^*(t+1)\) by step-size \(\gamma_{\text{step}}\) only when the gap exceeds the hysteresis band \(\delta_\theta\), and is hard-clipped to the safety interval .
where and is the threshold adaptation step-size.
Staleness-Aware Healing Threshold
Definition 45 (Staleness Decay Time Constant ( )). Let denote elapsed time since the last successful Knowledge Base synchronization. The staleness decay function is:
Notation. This constant is written throughout this article to distinguish it from the HLC trust-window latency bound \(\tau_\text{max}\) in Fleet Coherence Under Partition, which is an entirely different quantity (one-way message delivery time, measured in milliseconds, not hours).
( = staleness decay function; \(\delta_\text{sev}\) = failure severity scalar in the utility function; \(\delta_\text{inst}\) = per-cycle instability probability in Proposition 23 . Bare \(\delta\) is not used in this article to avoid ambiguity.)
where is the staleness threshold from Proposition 14 : , with \(\Delta h\) the acceptable health drift and \(\sigma\) measurement noise. At : \(\delta_\text{stale} = 0\) (fully current). At : (theoretical bound). As : (fully stale).
Staleness-aware threshold: Let be the severity of action \(a\), derived from Proposition 29 ’s optimal threshold. High \(s(a)\) means missing the failure is expensive (low \(\theta^*\), large ). The staleness-augmented threshold floor raises as the Knowledge Base ages:
Critical failures (\(s(a) \to 1\)) are immune: regardless of . Low-severity actions (\(s(a) \to 0\)) are suppressed as \(\delta_\text{stale}\) grows; when the threshold is above any achievable confidence score, effectively disabling that action class until re-sync.
Confidence horizon: The time at which non-critical healing (\(s(a) = 0\)) is suppressed to the maximum threshold :
Valid when . Beyond , the system enters minimal-healing mode: only actions with remain actionable.
graph LR
subgraph S0["t = 0"]
A["Fresh KB, delta = 0"] --> B["theta_stale = theta_opt
Full healing active"]
end
subgraph S1["t = tau_stale_max"]
C["Stale KB, delta = 0.63"] --> D["theta_stale rises
Low-severity suppressed"]
end
subgraph S2["t > T_conf"]
E["Very stale, delta → 1"] --> F["theta_stale > theta_max
Only critical failures"]
end
A -.->|time| C -.->|time| E
style B fill:#c8e6c9,stroke:#388e3c
style D fill:#fff9c4,stroke:#f9a825
style F fill:#ffcdd2,stroke:#c62828
Read the diagram: Three time-snapshots shown left to right. At \(t = 0\) (green): Knowledge Base is fresh, \(\delta_\text{stale} = 0\), staleness-adjusted threshold equals the optimal threshold — full healing active. At (yellow): Knowledge Base has aged to its calibrated limit; \(\delta_\text{stale} = 0.63\) and the threshold rises above \(\theta^*\) for low-severity actions, progressively suppressing them. At (red): the threshold exceeds 1.0 for non-critical actions — they are effectively disabled. Critical failures (\(s(a) \to 1\)) remain actionable throughout all three states regardless of staleness.
from Proposition 14 simultaneously calibrates the Brownian staleness model (maximum observation age before health estimates are unreliable) and the exponential time constant of healing suppression. A tightly-calibrated deployment with small \(\Delta h\) has a short and fast-acting suppression; a loosely-calibrated one tolerates longer Knowledge Base age before healing hesitance sets in.
The staleness threshold is calibrated from the Brownian diffusion model ( Proposition 14 ):
where \(\Delta h\) is the decision-relevant drift threshold, is the normal quantile at confidence \(1-\alpha\), and \(\sigma\) is the observation noise standard deviation. Both here and the staleness-aware healing threshold are governed by this calibrated constant.
The Harm of Wrong Healing
Healing actions can make things worse:
False positive healing: Restarting a healthy component because of anomaly detector error. The restart itself causes momentary unavailability. In RAVEN , restarting a drone’s flight controller mid-maneuver could destabilize formation.
Resource consumption: MAPE-K consumes CPU, memory, and bandwidth. If healing is triggered too frequently, the healing overhead starves the mission. The system spends its energy on healing rather than on its primary function.
Cascading effects: Healing component A affects component B. In CONVOY , restarting vehicle 4’s communication system breaks the mesh path to vehicles 5-8. The healing of one component triggers failures in others.
Healing loops: A heals B (restart), B heals A (because A restarted affected B), A heals B again, infinitely. The system oscillates between healing states, never stabilizing.
Detection and prevention mechanisms:
Healing attempt tracking: Log each healing action with timestamp and outcome. If the same action triggers repeatedly in short time, something is wrong with the healing strategy, not just the target. The healing rate metric below quantifies this: it counts attempts in a sliding window of length \(T\) and divides by \(T\) to yield an instantaneous rate.
If healing rate exceeds threshold, reduce healing aggressiveness or pause healing entirely.
Cooldown periods: After healing action A, impose minimum time before A can trigger again. This prevents oscillation and allows time to observe outcomes. The cooldown constraint below ensures action \(A\) cannot fire again until at least seconds have elapsed since its last execution.
Dependency tracking: Before healing A, check if healing A will affect critical components B. If so, either heal B first, or delay healing A until B is stable.
Control-Theoretic Stability: Damping, Anti-Windup, and Refractory Periods
Proposition 22 ’s stability condition governs the proportional behavior of the MAPE-K controller. But two failure modes remain outside its scope: high-frequency chatter (the loop triggers healing faster than the system can respond, oscillating between degraded and over-corrected states) and integral windup (healing demand accumulates while resources are blocked and discharges as a burst of simultaneous actions when resources free). In classical PID terms, the proportional term is bounded by Proposition 22 , but the derivative and integral behaviors need their own treatment.
Definition 46 (Healing Dead-Band and Refractory State). The healing actuator for action \(a\) is governed by three parameters and occupies one of three states:
- (dead-band threshold): healing is suppressed unless the anomaly score \(z_t^K\) exceeds for consecutive samples — the “Wait-and-See” confirmation window. Single-sample noise spikes are ignored.
- (refractory period): after executing action \(a\), the healing gate for \(a\) closes for seconds. This is the mandatory observation window during which the system watches the action take effect before issuing another.
- (anti-windup cap): accumulated healing demand \(Q_d(t)\) is capped at . Demand arriving when is discarded, preventing burst discharge after a resource-blocked period.
- CBF suspension re-entry: when the CBF gain scheduler ( Definition 40 ) sets \(K_{\mathrm{gs}} = 0\) because \(\rho_q < 0\), the healing gate enters a hard-suspended state. The gate reopens when is observed for at least one full refractory period \(\tau_{\mathrm{ref}}\), or unconditionally after \(d_{\max} \leq 5\) ticks of zero-gain operation, provided the CUSUM sentinel confirms nominal plant dynamics (\(g^+(t) < h^+\)) — when CUSUM is indeterminate or unavailable, the criterion from Proposition 31 takes precedence. The \(d_{\max}\) bound follows from the Proposition 25 proof: with \(K_{\mathrm{gs}} = 0\) the open-loop system is guaranteed to return to within \(d_{\max}\) ticks under nominal dynamics for the class of plants where \(A_q\) is nilpotent; for general Schur-stable plants the dCBF contraction condition governs re-entry instead.
Analogy: A doctor’s dosing schedule — you wait the full interval between doses even if the fever returns, because acting again too soon makes things worse, not better. The dead-band is the minimum symptom level that justifies a dose; the refractory period is the mandatory wait before another dose is allowed.
Logic: The dead-band threshold \(\varepsilon_{\text{db}}\) blocks action on noise; the refractory period \(\tau_{\text{ref}} \geq 2\tau_{\text{fb}}\) (from Proposition 30 ) prevents oscillation by ensuring the system observes the effect of one action before issuing another.
stateDiagram-v2
direction LR
[*] --> READY
READY --> REFRACTORY: action executed
REFRACTORY --> READY: tau_ref elapsed
REFRACTORY --> ANTI_WINDUP: Q_d(t) >= Q_aw
ANTI_WINDUP --> REFRACTORY: Q_d(t) < Q_aw / 2
ANTI_WINDUP --> READY: tau_ref elapsed, D = 0
note right of READY
suppressed while z_t < epsilon_db
for tau_confirm consecutive samples
end note
Read the diagram: Three states. READY: healing gate is open — but actuation is still suppressed while for fewer than consecutive samples (the confirmation window). REFRACTORY: gate closes after execution; reopens after elapses. ANTI-WINDUP: entered when accumulated demand \(Q_d(t)\) saturates the cap ; drains back to REFRACTORY only when \(Q_d\) falls below — a hysteresis that prevents burst discharge from the accumulated queue.
Design parameters by severity tier:
| Severity tier | ||||
|---|---|---|---|---|
| Low ( ) | \(1\sigma\) | 3 samples | 10 | |
| Medium ( ) | \(2\sigma\) | 5 samples | 5 | |
| High (\(\varsigma > 0.7\)) | \(3\sigma\) | 10 samples | 2 |
where is the current feedback delay from Proposition 22 .
Proposition 30 (Anti-Windup Oscillation Bound). For the proportional healing controller with gain \(K_{\text{ctrl}}\) and feedback delay satisfying (Proposition 22), healing oscillation is suppressed if the refractory period satisfies:
A RAVEN jamming event triggering 47 concurrent healing cycles will only chatter if each drone re-fires before observing its neighbors’ outcomes — the refractory floor prevents exactly this.
The proposition sets the minimum dead-band window after each healing action from the round-trip feedback delay , establishing the floor below which a second action fires before the first effect is observable. For RAVEN ( (illustrative value)), the minimum refractory period is (illustrative value). The \(2\times\) floor is derived for linear delay chains; healing actions with nonlinear side effects such as gossip storms may require (illustrative value) — the \(2\times\) bound is the theoretical minimum, not a general operating recommendation.
Empirical status: The \(2\tau_{\text{fb}}\) floor is derived for a first-order linear delay chain; healing actions with nonlinear side effects (gossip storms, cascaded restarts) may require in practice, and the RAVEN value of should be validated by fault-injection testing at maximum concurrent failures.
Proof: In the discrete-time system with delay samples, the minimum period of any sustained oscillation is : two full delay-lengths are required for one complete feedback cycle (action propagates forward through \(d\) steps, effect propagates back through \(d\) steps). The healing controller with refractory period cannot fire at intervals shorter than . Setting prevents the controller from completing more than one correction per minimum oscillation period, suppressing sustained oscillation. \(\square\)*
Anti-windup accumulator update:
Physical translation: A leaky bucket counting how many “act now” signals have arrived above the dead-band threshold. The cap means demand arriving when the queue is full is silently discarded. When the refractory timer finally opens, at most actions discharge — not the unbounded backlog that would otherwise accumulate during a long-duration fault or resource-blocked period.
The accumulator counts pending healing requests above the dead-band threshold, capped at , and dispatches only when and the refractory timer has expired — preventing burst discharge from releasing a suppressed queue all at once. The anti-windup cap is set between 5 and 10 actions (illustrative value), with the constraint bounding total discharge time within the healing deadline. sustained for 3+ ticks (illustrative value) is the diagnostic signature of a persistent fault that load redistribution alone cannot resolve, warranting escalation to a higher severity level.
When \(Q_d(t)\) reaches , the system enters ANTI_WINDUP state and discards new demand until \(Q_d(t)\) drains below . This prevents “burst discharge” — where minutes of suppressed healing demand fires simultaneously the moment connectivity or resources recover.
Relationship to existing results: The dead-band threshold formalizes the minimum-confidence floor from Proposition 29 (constraint \(g_1\)): both prevent trigger-happy behavior at near-zero evidence.
The refractory period formalizes the informal cooldown constraint from the section above. Proposition 30 gives the first derived lower bound on that cooldown: rather than choosing heuristically, set and oscillation-freedom follows from Proposition 22 ’s stability condition.
RAVEN calibration: Feedback delay ( gossip convergence, 47 nodes), regime controller gain \(K_{\text{ctrl}} = 0.3\). Minimum refractory period: . Dead-band for medium-severity battery actions. Without this bound, a jamming event that degrades all 47 drones simultaneously triggers 47 concurrent healing cycles — each drone restarting its communication stack causes momentary radio silence, which registers as a new anomaly to neighbors, triggering another round. This is exactly the healing loop failure mode described above, now quantified.
Watch out for: the \(2\tau_{\text{fb}}\) floor is derived for a first-order linear delay chain; healing actions that trigger nonlinear side effects — gossip storms amplifying neighbor state estimates, cascaded restarts altering load on shared resources — can sustain oscillation at periods longer than \(2\tau_{\text{fb}}\), requiring \(3\text{–}4\times\tau_{\text{fb}}\) in practice; the theoretical minimum should not be used as the operating point without fault-injection validation at maximum concurrent failures.
Proposition 31 (CBF-Derived Refractory Bound). The Proposition 30 floor is necessary but not sufficient under mode-switching dynamics. Under the Stability Region framework ( Definition 4 ), the refractory period must also allow to recover above before the next action. The CBF-derived refractory bound for mode \(q\) is:
In mode-switching systems, the simple feedback-delay floor is not enough — the stability margin must also recover before a second action is allowed.
where , is a regularization floor, and is the stability margin immediately after the first healing action fires. The effective refractory period is:
Singularity prevention (\(\rho_\varepsilon\) floor). Without the floor, if (node approaching complete unreliability), the argument of \(\ln\) diverges and . The \(\rho_\varepsilon = 10^{-3}\) floor prevents this: it caps the computed refractory period at , which for RAVEN (\(\gamma_{\text{cbf}} = 0.05\), ) evaluates to . The companion upper clamp (set operationally, e.g., 600 s for RAVEN) provides a hard ceiling so that a single catastrophically degraded node does not permanently block healing attempts on a system that is in fact recoverable. A node clamped at is flagged for manual review after one full cycle.
The proposition replaces the fixed floor with a state-dependent lower bound that ensures the stability margin recovers above before the next healing action; larger healing actions that consume more stability margin automatically produce longer refractory periods. The minimum safe margin is (illustrative value); the singularity floor is (illustrative value); \(\gamma_{\text{cbf}}\) follows from Definition 39 . For RAVEN L3 with \(\gamma_{\text{cbf}} = 0.05\) (illustrative value) and a large action dropping \(\rho\) to 0.1 (illustrative value), the CBF-derived floor is (illustrative value) versus the Proposition 30 floor of 10 s (illustrative value). The gap between and quantifies how much stability margin the action consumed and is the primary diagnostic for oversized healing gains.
Model-validity scope of : The formula derives from the nominal contraction rate \((1-\gamma_{\text{cbf}})\) per tick — the rate at which \(\rho_q\) recovers under the pre-flight \(A_q\) model. If the true plant dynamics have drifted from \(A_q\), this rate is wrong and the formula produces either a dangerously short or a uselessly long refractory period.
Hyper-aggressive failure (under-refractory): physical damage slows \(\rho_q\) recovery below the nominal rate. Example — RAVEN drone motor at 60% (illustrative value) thrust efficiency shifts the dominant eigenvalue from \(|\lambda| = 0.95\) (illustrative value) to \(|\lambda| = 0.98\) (illustrative value); the ticks required for \(\rho: 0.10 \to 0.20\) extend from (theoretical bound) ticks (70 s) to (theoretical bound) ticks (175 s). The formula fires the next healing action at tick 14 when true \(\rho \approx 0.15\) (illustrative value) — still below \(\rho_{\min} = 0.20\). A second actuation on an under-margined plant can collapse the voltage rail.
Hyper-conservative failure (over-refractory): RF jamming injects noise into the state estimate \(x\), depressing the measured below its true value. The formula computes from an artificially low starting point, producing a refractory period far longer than the true dynamics require. The system remains locked in L0 long after recovery is physically complete.
Warning: Model drift invalidates the refractory formula in opposite ways — physical damage makes it fire too early, sensor noise makes it wait too long. The CUSUM sentinel below detects both failure modes before they accumulate.
CUSUM Model-Drift Sentinel
Vibration noise is zero-mean and short-lived — random errors cancel over a few ticks. Genuine actuator degradation is persistent: the nominal model A_q consistently over-predicts performance. The sentinel accumulates prediction errors over time; noise cancels itself, drift does not.
The one-step prediction error is . Under a healthy plant this is zero-mean Gaussian with rolling standard deviation ; under motor degradation it becomes persistently positive. \(g^+\) counts evidence the model is too optimistic; \(g^-\) counts evidence it is too pessimistic. The slack drains either accumulator after clean ticks, so a single noise spike never triggers an alarm.
Why ? Standard CUSUM reference-value formula \(k = \delta_\text{shift}/2\) for detecting a 3\(\sigma\) sustained shift (\(\delta_\text{shift}\) = expected shift magnitude in standard CUSUM notation). Random noise alone never pushes \(g^+\) above \(h\) before draining; a 3\(\sigma\) sustained drift accumulates to \(h\) in five ticks. Why a 20-tick rolling window? Drone blade-pass vibration produces correlated noise bursts at the MAPE-K tick rate; 20 ticks (100 s at 5 s/tick) spans roughly five vibration cycles, ensuring reflects the true noise envelope.
| Scenario | \(\Delta\rho_{\text{pred}}\)/tick | \(g^+\) outcome | Verdict |
|---|---|---|---|
| Single vibration spike (1 tick, 0.08) | 0.06 above slack | Peaks at 0.06; drains in 3 ticks | No alarm |
| Correlated burst (3 ticks, 0.06 each) | \(\hat{\sigma}\) rises, \(k\) and \(h\) auto-adjust | Threshold rises faster than accumulation | Suppressed |
| Sustained motor degradation (0.05/tick) | Grows 0.03/tick | Alarm at tick 4 (20 s) | Correct detection |
| Sub-threshold creep (0.03/tick) | Grows 0.01/tick | Alarm at tick 10 (50 s) | Caught — 3-tick test would never fire |
A fixed 3-consecutive-tick threshold is fragile under correlated noise: in high-vibration environments (RAVEN drones), state estimate errors are correlated at the vibration resonance frequency, making three consecutive exceedances far more likely than \(p^3\) implies. The replacement is a Page-CUSUM statistic [12] — the same structure as the Adversarial Non-Stationarity Detector ( Definition 84 ):
where is the one-step prediction error. The slack parameter is the rolling 20-tick standard deviation of under nominal conditions. Alarm thresholds \(h^+ = h^- = 5k\) give \(\mathrm{ARL}_0 \approx 500\) ticks (~one false alarm per 42 minutes at 5 s/tick). The four scenario outcomes are summarised in the table above.
RAVEN calibration: , so \(k \approx 0.020\) and \(h^+ = h^- \approx 0.10\). The relative form \(h = 5k\) generalises to any platform; 0.10 is its RAVEN instantiation. Sensitivity: \(h = 7k\) raises \(\mathrm{ARL}_0\) to ~1200 (<1.2 false alarms per 2-hour mission) if false-alarm cost dominates; \(h = 3k\) lowers it to ~150 (~9.6 false alarms) if detection speed is the priority. Calibrate from the ratio (see Proposition 9 ).
Gossip suspension and frozen baseline: When gossip is suspended (Observation Regime O4/O5), the rolling window freezes at its last valid value. The CUSUM thresholds and \(h = 5k\) are held constant during gossip suspension. If gossip remains suspended for more than (five rolling-window lengths), the CUSUM sentinel reverts to a fixed conservative baseline until gossip resumes.
Response: When \(g^+(t) > h^+\) (hyper-aggressive), extend the refractory budget by one additional period, reset \(g^+\), and re-evaluate; if \(g^+\) exceeds \(h^+\) again before the extension expires, hold L0 and flag for human review.
When \(g^-(t) > h^-\) (hyper-conservative), release the refractory window early once ; reset \(g^-\) on release. Note: the \(\eta = 0.85\) gain margin in Definition 40 guards against \(K\) mismatch, not \(A_q\) pole migration — the two corrections are orthogonal.
Required relationship — confirmation window vs. hardware response time: The confirmation window must satisfy , where is the mechanical or electrical settling time of the actuated component. If , the MAPE-K loop can issue a second actuation command while the first is still in progress, resulting in compounded commands on an actuator in an undefined intermediate state. Concrete example: a GRIDEDGE protective relay has a mechanical response time of 500 ms. If (3 samples at 10 Hz), the MAPE-K loop confirms “action taken” before the relay has physically moved; a second fault event can send a second trip command to a relay mid-travel. Minimum safe value: . For RAVEN motor controllers (electrical settling time ), comfortably satisfies the constraint.
HYPERSCALE anti-windup calibration: During a 10-minute storage-layer hiccup, health checks degrade for dozens of pods simultaneously. Without the anti-windup cap, the demand accumulator fills to dozens of queued healing actions and discharges as simultaneous pod restarts the moment the health layer recovers — a self-inflicted availability incident. With , the burst is bounded to 5 concurrent actions regardless of backlog depth.
Three further mechanisms harden the MAPE-K loop against flapping failure modes that the dead-band and anti-windup alone cannot suppress: threshold chattering at a single trip-point, progressive failure escalation under repeated ineffective actions, false actuation on self-resolving transient peaks, and unbounded hardware retry cycles.
Watch out for: the formula derives the refractory period from the nominal contraction rate \((1-\gamma_{\text{cbf}})\) per tick — the rate at which \(\rho_q\) recovers under the pre-flight \(A_q\) model; when actuator degradation shifts the dominant eigenvalue (e.g., a damaged motor rotor increasing \(|\lambda|\) from 0.95 to 0.98), the true recovery is slower and the formula fires the next healing action before \(\rho_q\) has actually reached \(\rho_{\min}\), risking compounded actuation on an under-margined plant.
Definition 47 (Schmitt Trigger Hysteresis). The dead-band threshold of Definition 46 is a single trip-point: the anomaly score \(z_t^K\) can cross it in either direction within the same measurement tick. The Schmitt trigger replaces this with two thresholds \(\theta_H > \theta_L\), where (trigger) and \(\theta_L\) (release) is new:
The NOMINAL \(\to\) TRIGGERED transition fires when for consecutive samples. The TRIGGERED \(\to\) NOMINAL transition fires when . When the current state is held — no transition in either direction.
The flapping-free condition guarantees that no spurious oscillation can traverse the full band in one confirmation window:
The flapping-free condition computes the minimum hysteresis band , set to , guaranteeing no spurious state transition within one confirmation window. A signal oscillating at noise amplitude can traverse at most half the band per confirmation window; making the band wider than the noise amplitude eliminates alert chatter. More than 5 alarm/clear cycles per hour (illustrative value) during testing is the diagnostic signature of a band that is too narrow relative to the signal noise floor.
A signal too rapid to traverse \(\Delta\theta\) within seconds is sensor noise — not a genuine anomaly. Relationship to Proposition 29 : ; the optimal decision threshold sits inside the hysteresis band, so the actuator triggers only when confidence significantly exceeds \(\theta^*(a)\) and releases only when confidence genuinely recovers below it. \(\square\)
Physical translation: The Schmitt trigger prevents oscillation-prevention by making state transitions asymmetric. Triggering requires the signal to exceed the high threshold \(\theta_H\); releasing requires it to fall below the lower threshold \(\theta_L\). A signal bouncing in the band \((\theta_L, \theta_H)\) — sensor noise riding the edge of an anomaly threshold — produces zero state transitions. The band width is sized so that only signals evolving faster than noise can traverse it within the confirmation window.
| Severity tier | \(\theta_H\) (trigger) | \(\theta_L\) (release) | \(\Delta\theta\) |
|---|---|---|---|
| Low (\(\varsigma \leq 0.3\)) | \(1\sigma\) | \(0.3\sigma\) | \(0.7\sigma\) |
| Medium (\(0.3 < \varsigma \leq 0.7\)) | \(2\sigma\) | \(0.7\sigma\) | \(1.3\sigma\) |
| High (\(\varsigma > 0.7\)) | \(3\sigma\) | \(1.0\sigma\) | \(2.0\sigma\) |
RAVEN calibration: Battery-voltage anomaly score oscillates between \(1.6\sigma\) and \(2.4\sigma\) under GNSS multipath jitter ( s, s). Single-threshold produces 4 trips per minute as the score crosses the threshold on every oscillation cycle. Schmitt trigger with \(\theta_H = 2\sigma\), \(\theta_L = 0.7\sigma\) produces zero trips: the score never drops below \(0.7\sigma\) during the jitter episode, so TRIGGERED state holds correctly until the jitter subsides and voltage genuinely recovers.
Definition 48 (Adaptive Refractory Backoff). The fixed refractory period of Definition 46 cannot distinguish an action that is succeeding (condition clears after refractory) from one that is failing (condition persists at every check). Under repeated failure, the same fixed window re-exposes the system to an unresolved fault at a constant rate. Adaptive backoff doubles the refractory window after each consecutive recovery failure:
The definition doubles the refractory window after each consecutive recovery failure, capped at , preventing rapid healing storms that exhaust the action budget on a persistent fault. The initial refractory period is ( Proposition 30 floor); doubling factor (illustrative value); the ceiling is (illustrative value); the counter resets on genuine recovery. A backoff counter exceeding 3 (illustrative value) is the diagnostic threshold for human escalation — it indicates a fault that the autonomic loop cannot resolve autonomously.
where \(n\) is the consecutive failure count (refractory expired; condition still present: \(z_t^K > \theta_L\)), ( Proposition 30 floor), and caps indefinite lockout (default: ). Reset: \(n \to 0\) when \(z_t^K \leq \theta_L\) ( Definition 47 Schmitt release — genuine recovery confirmed). Failure count \(n\) is maintained per-action per-component and is not shared between actions.
Physical translation: Each consecutive recovery failure is evidence that the fault is structural, not transient — doubling the refractory window gives the system exponentially more observation time before the next attempt. This prevents healing storms: under a persistent fault, fixed-window refractory fires at constant rate indefinitely; adaptive backoff reaches the ceiling after \(\log_2(10) \approx 3.3\) failures and stays there, reducing retry rate by \(10\times\) and protecting thermal budget and actuator wear.
OUTPOST calibration: Sensor firmware crash loop; s, s, s. Consecutive restart failures (\(n = 0, 1, 2, 3\)) produce refractory windows of 10 s, 20 s, 40 s, 80 s — the attempt rate halves after each failure, giving the node exponentially more observation time. Settled at s: no more than 5 attempts per hour versus 36 per hour under fixed s. At 5 attempts per hour, accumulated heating from firmware crash-cycles remains below the thermal throttle threshold — the backoff curve is the thermal safety curve.
Definition 49 (Derivative Confidence Dampener). The Analysis phase computes a confidence score \(\theta(t) \in [0,1]\) (Proposition 29). High confidence at a single sample does not distinguish a stable genuine fault from a transient spike peaking above \(\theta_H\) and falling naturally. The derivative dampener adds a trend check in the Analysis phase before escalating to Execute. The sliding-window first-order estimate is:
The sliding-window derivative estimates whether the confidence score is rising ( ) or falling ( ), distinguishing worsening faults from self-recovering transients in the Analyze phase before the actuation hold condition is checked. The window samples (illustrative value) is the empirically calibrated optimum for the series scenarios: is too noisy under vibration, and is too slow to track fast-moving faults — larger \(w\) reduces noise but increases response lag in a direct trade-off.
Actuation hold condition: suppress Execute even when if:
The actuation hold condition suppresses Execute when falls faster than , meaning confidence is recovering fast enough to self-resolve before elapses, eliminating false actuation on transient spikes that briefly cross but are already recovering. For CONVOY , , \(w=5\), give (illustrative value); in CONVOY testing this hold suppressed 67% of reroute commands (illustrative value) that would have been false positives under pure threshold triggering.
The dampener threshold \(\gamma_{\text{damp}}\) is the rate at which confidence would traverse half the hysteresis band in one derivative window — fast enough to cross from \(\theta_H\) to \(\theta_L\) within \(2w\) samples, implying the anomaly will self-resolve before elapses.
Resume actuation when and (stabilized genuine fault). Bypass Execute entirely if — natural recovery is confirmed and the Schmitt trigger returns to NOMINAL without any actuation. Default: \(w = 5\) samples.
Physical translation: A spike in confidence that is already falling when Execute checks it is likely transient noise, not a stable fault. The derivative dampener adds a trend check: if , the anomaly is recovering faster than the confirmation window — hold execution. The oscillation-prevention benefit is that a transient spike above \(\theta_H\) that would trigger an immediate healing action is instead suppressed until the trend stabilizes, eliminating the class of false-positive healing on self-recovering conditions that account for the majority of unnecessary interventions in practice.
CONVOY calibration: Link-quality confidence reaches at \(t = 0\) s, but s s (\(w = 5\), s, \(\Delta\theta = 0.20\)). Derivative dampener holds. At \(t = 10\) s: — natural recovery, Schmitt trigger releases to NOMINAL with no action taken. Without dampening: a reroute command fires at \(t = 0\) on a self-recovering link, triggering a full-convoy reroute maneuver that costs 8 minutes of mission time.
Combined Activation Example
When all three mechanisms are simultaneously active — as occurs during a high-severity RAVEN heating event — their interaction proceeds as follows: (1) The Schmitt Trigger fires when the health score crosses \(\theta_H\), initiating a healing action. (2) The Adaptive Refractory Backoff begins: the system will not fire another healing action for \(\tau_\text{ref}(0)\) seconds. (3) The Derivative Confidence Dampener simultaneously suppresses the derivative signal. If the dampener weight is below 0.5 when the refractory window expires, the release condition is not met even if the health score has nominally recovered below \(\theta_L\) — the system remains in refractory until the dampener clears. Schmitt trigger release takes priority: once the health score is sustainably below \(\theta_L\) AND the dampener weight exceeds its clearance threshold, refractory ends.
Proposition 95 (Healing Algorithm Liveness). [BOUND] The composite flapping-prevention mechanism ( Definitions 47–49 ) terminates within retry cycles. After \(n_{\max}\) failed retries, adaptive refractory backoff ( Definition 48 ) has extended \(\tau_{\text{ref}}(n)\) beyond the remaining mission duration; the system transitions unconditionally to Terminal Safety State ( Definition 53 ) and the hardware veto interlock ( Proposition 32 ) takes effect. This is the global failure-safe exit. This guarantee holds when . When the ceiling cap is reached before the backoff window exceeds the remaining mission time, the system may exhaust \(n_{\max}\) retries without the refractory period bounding the mission — in this case, the Minimum Viable System floor ( Definition 50 ) is activated after \(n_{\max}\) failures regardless.
No matter how many times RAVEN ’s healing loop fails and backs off, it will reach the terminal safety state in at most 5 retries rather than retrying indefinitely.
RAVEN calibration: \(\tau_{\text{ref}}(0) = 240\) s, \(T_{\text{mission}} = 7200\) s, giving \(n_{\max} = 5\) retries. For OUTPOST with 72-hour missions, set s to keep \(n_{\max} \leq 10\).
Empirical status: The \(n_{\max} = 5\) value is specific to the RAVEN initial refractory period of 240 s and 2-hour mission duration; different initial refractory periods or mission durations produce different retry budgets, and the terminal-safety fallback guarantee depends on the initial refractory being large enough relative to the expected healing action duration.
Watch out for: the liveness guarantee requires ; when the ceiling cap is reached before the backoff window spans the remaining mission time, the system can exhaust \(n_{\max}\) retries without the refractory period naturally terminating the loop — in this case, transition to the Terminal Safety State is triggered by the MVS floor rather than the backoff bound, and the formal liveness certificate does not apply.
Proposition 32 (Hardware Veto Invariant). The L0 Physical Safety Interlock (Definition 108) exposes a boolean signal to the MAPE-K Execute phase. When \(v(t) = 1\):
When the hardware thermal fuse trips on an OUTPOST sensor node, no software path — however urgent — can issue another restart command that would worsen the damage.
The Execute phase is bypassed for this tick, so no healing action is issued to component \(c\). The demand accumulator \(Q_d\) (
Definition 46
) is not incremented, so no silent backlog builds during the veto period. The Knowledge base \(K\) records a VETO_ACTIVE event with the component identifier and tick timestamp.
No retry, no timeout override, no software path to resume execution while \(v(t) = 1\). Veto termination: \(v(t) = 0\) requires physical human action ( Definition 108 : non-resettability from software). Claim: for any component \(c\) and any interval \([t_1, t_2]\) with \(v(t) = 1\) for all \(t \in [t_1, t_2]\):
The proposition formally states that no healing action fires on component while the L0 hardware veto : software reads at each Execute tick and skips if set, preventing thermal runaway from the loop endlessly retrying commands to a fused actuator. The signal originates from a physical latch circuit and is non-bypassable from any software path by construction. A software-only veto does not achieve this guarantee — firmware bugs or stack corruption can bypass register writes, so the latch must be a dedicated physical circuit external to the processor.
Proof. \(v(t) = 1\) causes Execute to be skipped at every tick. By Definition 108 (non-resettability from software), \(v\) remains \(1\) until physical intervention — no autonomous path exists to set \(v(t) = 0\). Therefore no action executes in \([t_1, t_2]\). \(\square\)
Reset path: The hardware veto can only be cleared by a physical operator action (power cycle or manual interlock reset). Software cannot clear or suppress it — any attempt to write to the veto register while \(v(t) = 1\) is a no-op by hardware design.
Infinite retry impossibility: total executions on \(c\) satisfy , where is the number of physical human resets (finite by construction) and is bounded by ( Definition 48 ). \(\square\)
OUTPOST
calibration: Thermal-fuse trip on sensor node after 3 restart attempts (\(n = 0, 1, 2\) per
Definition 48
, refractory windows 10 s, 20 s, 40 s). At attempt 4, hardware temperature exceeds fuse threshold: \(v(t) \to 1\). Execute is skipped; \(Q_d\) is frozen at 3; VETO_ACTIVE is logged. Without the veto invariant: attempts 4, 5, 6… each adding thermal load at 80 s intervals, leading to thermal runaway within 20 minutes. With the veto invariant: the node enters Terminal Safety State (
Definition 53
) and awaits physical inspection. \(Q_d\) remains at 3 — no burst discharge on veto release.
( Definition 53 is introduced below in the Terminal Safety State section.)
Watch out for: the invariant holds only when \(v(t)\) originates from a dedicated physical latch circuit external to the processor — a software-implemented veto flag (a memory-mapped register or firmware variable) can be inadvertently cleared by stack corruption, heap overflow, or firmware bugs, eliminating the no-retry guarantee precisely in the failure conditions where it is most needed.
Cognitive Map: Healing under uncertainty layers three defenses against wrong action. First, cost-calibrated confidence thresholds ( Proposition 29 ) set the act/wait boundary from measured FP/FN cost ratios rather than intuition — the threshold adapts continuously with mission phase, resource level, and connectivity. Second, staleness-aware suppression ( Definition 45 ) progressively disables low-severity healing actions as the Knowledge Base ages, ensuring that stale data drives fewer autonomous decisions. Third, control-theoretic oscillation prevention (Definitions 28, 75–71 and Proposition 30 ) eliminates the six known classes of healing oscillation: dead-band confirmation, Schmitt trigger hysteresis, anti-windup accumulator, adaptive refractory backoff, and derivative confidence dampening. The hardware veto invariant ( Proposition 32 ) is the hard floor — when the L0 physical interlock fires, no software path can override it. Next: when multiple components need healing simultaneously, restart order matters — the following section addresses dependency-aware sequence planning.
Recovery Ordering
When multiple components fail simultaneously, healing them in the wrong order causes immediate re-failure. An application server restarted before its database reconnects fails to initialize. The healing action completes technically but the component stays broken.
The response is to model dependencies as a directed graph and restart in topological order — each component starts only after all its dependencies are healthy. For circular dependencies and resource-constrained scenarios, stub mode and the Minimum Viable System identify the minimal safe starting set without resolving the cycle.
Topological ordering requires knowing the dependency graph. At the edge, this graph is often partially known. Conservative assumptions — assume a dependency exists when unknown — produce correct but potentially slower restart sequences.
Dependency-Aware Restart Sequences
When multiple components need healing, order matters.
Consider a system with database D, application server A, and load balancer L. The dependencies:
- A depends on D (needs database connection)
- L depends on A (needs application endpoint)
If all three need restart, the correct sequence is: D, then A, then L. Restarting in wrong order (L, then A, then D) means L and A start before their dependencies are available, causing boot failures.
Formally, define dependency graph \(G = (V, E)\) where:
- \(V\) = set of components
- \(E\) = set of dependency edges; \((A, B) \in E\) means A depends on B
The correct restart sequence is a topological sort of \(G\): an ordering where every component appears after all its dependencies.
Physical translation: \(\sigma(B) < \sigma(A)\) means B appears earlier in the restart sequence than A — restart B before A. The constraint states that for every dependency edge \((A, B)\), B must restart first. A topological sort is any ordering that satisfies this constraint for every edge simultaneously. When the graph has no cycles, such an ordering always exists and can be computed in \(O(|V| + |E|)\) time.
Edge challenge: The dependency graph may not be fully known locally. In cloud environments, a centralized registry tracks dependencies. At the edge, each node may have partial knowledge.
Strategies for incomplete dependency knowledge:
Static configuration defines dependencies at design time and distributes them to all nodes. It works for stable systems but doesn’t adapt to runtime changes. Runtime discovery observes which components communicate with which others during normal operation and infers dependencies from those communication patterns, though it is risky if observations are incomplete. Conservative assumptions treat unknown dependencies as existing, which may result in unnecessary delays but avoids incorrect ordering.
Circular Dependency Breaking
Some systems have circular dependencies that prevent topological sorting.
Example: Authentication service A depends on database D for user storage. Database D depends on authentication service A for access control. Neither can start without the other.
The diagram below shows the mutual dependency as a cycle: each arrow indicates a startup requirement, and both nodes are red because neither can satisfy the other’s precondition.
graph LR
A["Auth Service"] -->|"needs users from"| D["Database"]
D -->|"needs auth from"| A
style A fill:#ffcdd2,stroke:#c62828
style D fill:#ffcdd2,stroke:#c62828
Read the diagram: Both nodes are red — neither can satisfy the other’s startup precondition. Auth Service needs users from the Database; Database needs auth from the Auth Service. The cycle means topological sort is undefined: no valid ordering exists. Both components require the other to already be running, creating a deadlock at startup.
Strategies for breaking cycles:
Cold restart all simultaneously: Start all components in the cycle at once. Race condition: hope they stabilize. Works for simple cases but unreliable for complex cycles.
Stub mode: Start A in degraded mode that doesn’t require D (e.g., allow anonymous access temporarily). Start D using A’s degraded mode. Once D is healthy, promote A to full mode requiring D. The three-step startup order is: A in stub mode first, then D, then A promoted to full mode.
Quorum-based: If multiple instances of A and D exist, restart subset while others continue serving. RAVEN example: restart half the drones while others maintain coverage, then swap.
Cycle detection and minimum-cost break: Use DFS to find cycles. For each cycle, identify the edge with lowest “break cost”—the dependency that is easiest to stub or bypass. Break that edge.
Minimum Viable System
Not all components are equally critical. When resources for healing are limited, prioritize the components that matter most.
Definition 50 ( Minimum Viable System ). The minimum viable system is the smallest subset of components such that , where is the basic mission capability threshold. Formally:
Physical translation: Minimize the number of components (smallest \(|S|\)) while keeping combined capability at or above the mission-critical threshold . The MVS answers: “if I can only heal \(N\) components and want to maximize operational capability, which \(N\) should I prioritize?” — not \(N\), but the smallest \(N\) that clears the floor. Every component outside the MVS is a candidate to remain offline when healing resources are scarce.
The definition identifies the smallest component subset that preserves all critical functions at capability level L1 or above, providing the priority boundary for healing actions when resources are scarce. The set is solved greedily ( approximation (theoretical bound)), re-evaluated at each 10% resource drop boundary (illustrative value). The MVS list is established at design time rather than computed under resource stress — greedy computation during a crisis can itself consume the remaining power and compute budget that the MVS definition is meant to protect.
In other words, the MVS is the leanest set of components that still keeps the system above the minimum acceptable capability level ; every component outside the MVS is a candidate to remain offline when healing resources are scarce.
For RAVEN , the MVS comprises flight controller, collision avoidance, mesh radio, and GPS; non- MVS components — high-resolution camera, target classification ML, and telemetry detail — can remain degraded when healing resources are scarce.
Proposition 33 ( MVS Approximation). Finding the exact MVS is NP-hard (reduction from set cover). However, a greedy algorithm that iteratively adds the component maximizing capability gain achieves approximation ratio \(O(\ln |V|)\).
Under resource scarcity, always heal the component that contributes the most new capability — the greedy choice is guaranteed to find a near-optimal minimum viable set.
Precondition — submodularity: The greedy \(O(\ln |V|)\) approximation guarantee requires the capability function to be submodular (diminishing marginal returns): for all \(S \subseteq T \subseteq V\) and component \(i \notin T\), . This holds when no two components are mutual prerequisites for a capability. It fails when two components are jointly required (e.g., a crypto module + networking stack jointly unlock secure gossip , but neither alone contributes). In that case: (1) treat the pair as a single compound component in the greedy algorithm; (2) verify submodularity by checking all component pairs before running greedy. Failure to verify submodularity may produce a greedy solution 2–3x larger than the true MVS .
Proof sketch: MVS is a covering problem: find the minimum set of components whose combined capability exceeds threshold . When the capability function exhibits diminishing marginal returns (submodular), the greedy algorithm achieves \(O(\ln |V|)\) approximation, matching the bound for weighted set cover. For small component sets, enumerate solutions. For larger sets, use the greedy approximation: iteratively add the component that contributes most to capability until is reached.
In other words, the exact MVS is computationally intractable for large systems, but always-pick-the-most-useful-component-next finds a solution at most \(O(\ln |V|)\) times larger than the true minimum.
Physical translation: The greedy sensor selection algorithm always achieves at least 63% of the coverage the theoretically optimal set would provide. For OUTPOST with 127 sensors, this means the greedy minimum-viable set may miss observability of up to 37% of the threat surface — acceptable for survival mode operation, not for full-capability operation. In practice, the greedy algorithm typically achieves 85–90% of optimal coverage, with the 63% bound being the worst-case guarantee.
Watch out for: the \(O(\ln |V|)\) approximation guarantee requires the capability function to be submodular; when components have joint prerequisites (neither alone contributes to a capability but together they unlock it), submodularity fails and the greedy algorithm may return a set \(2\text{–}3\times\) larger than the true MVS — such pairs must be pre-identified and treated as compound components before running greedy, and failure to verify submodularity produces a silently suboptimal MVS that wastes healing resources on unnecessary inclusions.
Game-Theoretic Extension: Shapley Values for Critical Component Identification
Proposition 33 ’s greedy set-cover approximation identifies a minimum feasible component set. It does not identify which components are most critical to MVS achievability — a question answered by the Shapley value of the cooperative game over component contributions.
MVS cooperative game: Players are the \(n\) nodes (or components). The characteristic function \(v(S)\) is the mission completion probability achievable with the components contributed by coalition \(S\).
The Shapley value of node \(i\) measures its average marginal contribution across all possible coalition orderings:
Shapley vs. minimum set: A node can be in many minimum MVS coalitions (high Shapley value) without itself being a minimum set. High-Shapley nodes are single points of failure for MVS achievability — they appear in most coalitions that cross the feasibility threshold.
RAVEN application: When drone 23 fails and coverage must be redistributed, the drones needed to fill the gap have high Shapley values in the coverage MVS game. Allocating healing resources (battery reserve, repositioning priority) proportional to Shapley values is efficient (total mission value maximized) and satisfies the fairness axioms of efficiency, symmetry, and marginality.
Practical implication: Pre-compute Shapley values for the MVS game during mission planning. Nodes with Shapley values above a criticality threshold receive higher power reserves, priority positions in healing queues, and stricter health monitoring thresholds (lower \(\theta^*\)).
For RAVEN ’s 47 drones, computing Shapley values over the relevant MVS coalitions (typically 5-10 drones) is tractable at per mission phase.
Cognitive Map: Recovery ordering converts the “what to heal” decision (confidence threshold) into the “in what order” decision. Topological sort handles the common case; stub mode breaks circular dependencies; the MVS identifies the minimum healing target when resources are exhausted. Shapley values extend the MVS from a feasibility question (which components must run?) to a criticality question (which components are hardest to replace?) — enabling resource allocation proportional to irreplaceability. Together these form a layered priority structure: heal MVS components first, in topological order, starting from the highest-Shapley node. Next: the healing loop itself is a power consumer — as resources deplete, even the autonomic monitoring must throttle to preserve survival time.
Dynamic Fidelity Scaling
Every gossip round, every Kalman update, every reputation EWMA is energy subtracted from the mission. At full battery this overhead is negligible; near the survival threshold it competes directly with the functions it was designed to protect.
The response is to define five observation regimes keyed to battery level. Each regime suspends a specific set of autonomic tasks, with downgrade boundaries set by measured power draws. The monitoring infrastructure throttles itself before the mission payload does.
Lower autonomic fidelity means slower anomaly detection and coarser health estimates. The system accepts higher false-negative rates to avoid dying from self-monitoring overhead — a deliberate exchange of detection capability for survival time.
Self-measurement is a parasitic load. Every gossip round, every Kalman update, every reputation EWMA is energy subtracted from the mission. At full battery this overhead is negligible. Near the survival threshold it competes directly with the functions it was designed to protect. Dynamic Fidelity Scaling (DFS) formalizes the feedback loop that throttles autonomic overhead as resources deplete — treating monitoring as a luxury earned only by surplus.
Definition 51 (Autonomic Overhead Power Map). Let denote the sustained power draw of level \(L_k\) autonomic tasks — monitoring, analysis, learning, and fleet coordination — excluding mission payload (propulsion, weapons sensors, payload compute). Decompose as:
Physical translation: Total autonomic overhead splits into radio cost (gossip rate \(\lambda_k\) times energy per packet \(T_s\)) and compute cost (algorithm decision rate times energy per decision \(T_d\)). Because , radio cost dominates overwhelmingly. Reducing gossip rate from 8 Hz (L4) to 1/60 Hz (L0) cuts autonomic radio overhead by \(480\times\) — this single lever accounts for nearly the entire L0–L4 power ratio of \(420\times\).
The definition gives total autonomic overhead in milliwatts at each capability level from L0 to L4, with the constraint bounding which tiers are feasible given available power. For RAVEN : L0 mW (illustrative value); L1 mW (illustrative value); L2 mW (illustrative value); L3 mW (illustrative value); L4 mW (illustrative value). Simulation systematically underestimates by \(2{-}3\times\) (illustrative value) due to radio idle drain that does not appear in logic-level models — empirical measurement on real hardware is required for accurate tier feasibility assessment.
where \(\lambda_k\) is the gossip rate at level \(k\) (packets/second), \(T_s\) is the energy per radio packet, is the decision rate of level-\(k\) algorithms, and \(T_d\) is the energy per local compute decision (both from Def 21). Because , radio cost dominates — gossip rate is the primary autonomic power lever.
For RAVEN (\(T_s = 5\) mJ/packet, \(T_d = 50\,\mu\text{J}\)/decision):
| Level | Primary autonomic tasks | Gossip \(\lambda_k\) | \(\mathcal{P}_k\) | ||
|---|---|---|---|---|---|
| L0 | Heartbeat beacon | 1/60 Hz | ~0.08 mW | ~0 | ~0.1 mW |
| L1 | EWMA anomaly detection | 0.5 Hz | 2.5 mW | ~0.05 mW | ~3 mW |
| L2 | Kalman filter + state sync | 2 Hz | 10 mW | ~0.5 mW | ~11 mW |
| L3 | HLC + BFT peer validation | 4 Hz | 20 mW | ~1 mW | ~21 mW |
| L4 | Quorum + reputation learning | 8 Hz | 40 mW | ~2 mW | ~42 mW |
The L0–L4 ratio means full-fidelity autonomic operation consumes 420 times the power of heartbeat-only mode — a factor that dominates survival time in power-limited emergency conditions.
Definition 52 (Observation Regime Schedule). Let \(R(t) \in [0,1]\) be the normalized resource availability (battery SOC for power-constrained nodes). Define five observation regimes with hysteretic thresholds — downgrade threshold and upgrade threshold with hysteresis band :
| Regime | \(R(t)\) range (downgrade) | Active level | Suspended tasks |
|---|---|---|---|
| \(O_4\) High Fidelity | \(R \geq 0.90\) | L0–L4 | None |
| \(O_3\) Reduced Learning | \([0.50,\; 0.90)\) | L0–L3 | Bandit/Q-learning updates (Def 33), reputation EWMA (Def 44) |
| \(O_2\) Conservation | L0–L1 | Kalman (Def 23), HLC tracking (Def 40), BFT validation (Def 43), gossip reduced to 0.5 Hz | |
| \(O_1\) Survival | L0 only | All radio transmissions, all analysis, all learning | |
| \(O_0\) Terminal | None | Trigger (Def 124) |
( Definition 53 is introduced below in the Terminal Safety State section.)
Downgrade is immediate on threshold crossing; upgrade requires to prevent oscillation near the boundary.
Physical translation: Five operating modes ordered from richest to most frugal. The hysteresis band ( (illustrative value)) prevents the system from bouncing back to a higher regime until the battery has recovered by a full 5% (illustrative value) above the downgrade threshold — preventing rapid oscillation near regime boundaries, which would itself consume the power the downgrade was meant to save.
Phase gate prerequisite for L3 tasks (CI-02): Regime \(O_3\) activates HLC tracking ( Definition 61 ) and BFT peer validation ( Definition 64 ) — capabilities belonging to the Phase 2 and Phase 3 certification tiers of the Field Autonomic Certification ( Definition 104 ).
A node that transitions to \(O_3\) based solely on \(R(t) \geq 0.50\) without satisfying the corresponding phase gates runs L3-tier machinery (26 mW, 4 Hz gossip , CRDT causality validation) without verified correctness of the underlying coordination protocol. The correct precondition is — the Phase 2 gate must have been passed during commissioning.
In systems where Phase 2/3 certification has not been completed (e.g., early deployment phases), cap the maximum active level at L1 regardless of battery level: run \(O_2\) thresholds with L0–L1 tasks only, deferring BFT and HLC to post-certification.
Quorum availability gate for O_3 / O_4 (CI-03): L3 and L4 tasks include BFT validation (Def 43) and reputation quorum (Def 45), both requiring a local cluster quorum of \(\lceil 2n/3 \rceil + 1\) reachable peers. When a partition reduces the reachable cluster to \(n' < \lceil 2n/3 \rceil + 1\) nodes, BFT is structurally unavailable regardless of battery level.
Running L3/L4 tasks in this condition wastes energy (20–63 mW) without providing Byzantine guarantees: the plausibility predicate cannot be satisfied with fewer than reachable neighbors.
Operational rule: before entering \(O_3\) or \(O_4\), verify ; if the condition fails, enter \(O_2\) regardless of \(R(t)\). For a CONVOY partition where only 6 of 12 vehicles remain in the cluster (below \(\lceil 8 \rceil + 1 = 9\) required), the correct regime is \(O_2\) even at full battery.
Proposition 34 (Self-Throttling Survival Gain). Let \(Q\) be the mission payload power (propulsion, payload compute; \(Q = 0\) in emergency ground mode). The survival time from current resource level \(R(t)\) to the next critical threshold under regime \(O_k\) is:
Switching a grounded RAVEN drone from full-fidelity to heartbeat-only autonomics extends its survival window from 3.5 hours to 32 hours on the same battery.
Physical translation: Remaining energy divided by total power draw . Throttling reduces without affecting \(Q\) (mission payload); the survival time extends proportionally. For RAVEN near the survival threshold with propulsion off (\(Q = 5\) mW): full-fidelity ( mW) gives 3.5 hours; survival-mode ( mW) gives 32.7 hours — a \(9.3\times\) extension from a single configuration change.
The marginal survival gain from downgrading is:
since by construction. Throttling always extends survival time; the only cost is reduced observability fidelity. \(\square\)
Empirical status: The \(9.3\times\) survival multiplier uses RAVEN -specific power values ( mW, mW, \(E_{\max} = 1110\) mWh); the multiplier is sensitive to radio idle drain, which simulations underestimate by \(2\text{–}3\times\) — measure empirically on the target hardware before relying on this ratio.
Self-throttling trigger: the node transitions the instant \(R(t)\) crosses from above, and immediately suspends the tasks listed in Def 123. Regime state is stored in non-volatile memory so that a warm-reboot restores the correct throttle level without re-running \(R(t)\) estimation from scratch.
Watch out for: the survival time formula uses autonomic overhead power values that simulation systematically underestimates by \(2\text{–}3\times\) due to radio idle drain not captured in logic-level models — a node that computes a 32-hour survival window from simulated may in practice have only 10–15 hours because the hardware radio idle current alone exceeds the simulated baseline; empirical measurements on target hardware are required before the throttle thresholds are used operationally.
Proposition 35 (Autonomic Overhead Paradox). In PLM mode ( mW for residual sensor power; propulsion off), the full-fidelity vs. survival-mode survival times to starting from are:
Near the survival threshold the monitoring stack itself is the largest power consumer — disabling it is worth more than any single healing action.
The L4-to-L0 throttle multiplier is — the difference between a recovery team arriving before battery death and the drone expiring unrecovered.
Empirical status: The \(9.3\times\) multiplier is specific to the RAVEN power budget; the qualitative paradox (monitoring overhead competing with survival at low battery) is general, but the exact crossover threshold depends on platform-specific and \(E_{\max}\) values that must be measured per hardware variant.
Physical translation: The overhead of self-monitoring can exceed the savings it enables — here is when. Near the survival threshold, the \(42\) mW consumed by full-fidelity autonomic overhead dwarfs the \(5\) mW sensor load. The \(9.3\times\) survival multiplier means the decision to throttle autonomic tasks at is worth more than any single healing action. The paradox: the system must disable its healing infrastructure to survive long enough for healing to matter. The correct trigger is the energy threshold, not a fault detection event.
The autonomic overhead paradox: at , the monitoring infrastructure designed to keep the node alive must be the first thing suspended. A node that refuses to throttle its L4 autonomic tasks in a resource crisis consumes itself — the MAPE-K loop becomes the proximate cause of death rather than its cure. The correct model is lexicographic: survival first, then observability, then fidelity. When , the node does not ask “will suspending this task hurt the mission?” — it asks “does this task cost more energy than it saves?”
Interaction with Prop 79 (Stale Data Threshold): In \(O_1\) (Survival), gossip is suspended entirely — no new measurements arrive, so expires for all remote state. The node operates on stale world-state for the duration of \(O_1\). This is acceptable: in survival mode the only decision is whether to remain in \(O_1\) or transition to \(O_0\) (terminal), both of which are local decisions requiring no remote data.
Watch out for: the \(9.3\times\) multiplier assumes for survival mode, but this floor is also a simulation estimate — if the radio’s idle-receive drain is mischaracterized at just as it is at , both ends of the ratio are wrong and the true multiplier may be substantially lower than \(9.3\times\); measure and independently on target hardware before concluding that a single throttle step provides near-order-of-magnitude survival extension.
Proposition 36 (Self-Throttling Law). The MAPE-K execution frequency is a resource-adaptive function of \(R(t)\):
As RAVEN battery falls toward the floor, the healing loop runs less frequently to survive — but never stops completely while an active failure is present.
The proposition reduces MAPE-K monitoring frequency proportionally to the remaining resource fraction , preventing the autonomic loop from consuming more power than the primary mission at low battery levels. The minimum frequency applies only during active critical failure; the scaling function is piecewise linear between and . A frequency drop to visible in telemetry is the earliest leading indicator of an energy-budget crisis — it is observable before the resource constraint becomes binding.
where the throttle coefficient \(\alpha : [0,1] \to (0,1]\) is:
and the critical-failure indicator is — active whenever any health component falls below the emergency threshold .
Parameters: ( Definition 1 ), (Point of No Return), , for RAVEN .
Proof sketch: When , the system operates at full autonomic frequency . Between and , execution frequency scales linearly, preserving CPU and power budget for survival tasks. Below (“Point of No Return”), autonomic actions above are suspended; the MAPE-K loop drops to to maintain minimal liveness. The \(\max\) term guarantees that whenever — even at \(R \to 0\) — preventing the healing loop from halting during an active emergency.
Liveness Guarantee: and by construction, so whenever a critical failure is active. The Self-Throttling Law cannot silence the MAPE-K loop while a failure requiring response is present.
RAVEN calibration: , , , . At \(R = 0.10\) (halfway between floor and ): \(\alpha = 0.5\), so . One avoided healing action at this resource level recovers \(\approx 4\,\text{s}\) of MAPE-K execution budget.
Floor constraint: the self-throttling formula must not reduce MAPE-K execution frequency below where \(T_\text{tick,max}\) is the maximum tolerable gap between autonomic observations for the current capability level. At the minimum viable monitoring frequency, the node can no longer adapt its behavior but retains the ability to detect entry into the Terminal Safety State. A node whose throttling formula would reduce frequency below \(f_\text{min}\) must instead enter the Observation Regime O\(_4\) sleep schedule rather than continuing sub-threshold MAPE-K execution.
Watch out for: the piecewise-linear throttle coefficient \(\alpha(R)\) assumes \(R(t)\) is measured accurately by the battery management IC, but BMS state-of-charge estimation typically carries a \(\pm 2\text{–}3\%\) SOC error — near \(R_{\text{floor}} = 0.05\), a 3% underestimate means the system drops to \(\alpha_{\text{floor}}\) at \(R = 0.08\) rather than \(R = 0.05\), firing the floor constraint 60% early and needlessly suppressing healing frequency while usable capacity remains; BMS calibration at the target operating temperature must be validated before the throttle thresholds are treated as absolute energy boundaries.
Proposition 37 (Weibull Circuit Breaker). Under the Weibull partition duration model ( Definition 13 ) and partition accumulator ( Definition 15 ), when the partition duration accumulator ( Definition 15 in Why Edge Is Not Cloud Minus Bandwidth) satisfies , the node immediately executes the following state transitions:
When a CONVOY partition hits the 95th-percentile duration, the system drops to survival mode and expects 17 more hours of denial — so it stops wasting battery on full-fidelity autonomics.
Transitions (1)–(3) and (5) fire immediately when . Transition (4) fires on the subsequent partition-end event, at which point the node re-enters the standard capability ladder from .
Proof: By the Weibull CDF, . A circuit breaker at therefore fires on at most 5% of partitions by construction — it is a rare, high-severity gate, not a routine transition.*
Transition (1) is energetically justified by Proposition 1 : suspending – autonomic overhead frees mW ( Definition 51 ), extending the survival window. The expected remaining partition duration at the circuit-breaker threshold — the mean excess life — is:
For : the mean excess life at hr is approximately 17.4 hr — the system expects to remain denied for another 17 hours after the circuit breaker fires. Preserving resources for that duration is the correct response. \(\square\)
Physical translation: The circuit breaker fires when the partition has lasted longer than 95% of historically observed partitions for this environment. At that point, waiting for reconnection is statistically unlikely to succeed soon, and the system shifts to a lower-capability posture to conserve resources rather than continuing to hold state for a recovery that statistics say is not imminent.
CONVOY application: At mission hour 28 (27.1 hr into a sustained denied period), the circuit breaker fires on all 12 vehicles simultaneously. Formation maintains — heartbeat exchange, local threat detection, basic obstacle avoidance — while suspending collaborative route planning and distributed sensor fusion. When connectivity resumes, resets and the capability ladder begins recovery from with standard gating.
Empirical status: The \(Q_{0.95} = 27.1\) hr threshold and mean excess life of 17.4 hr are derived from Weibull parameters \(k_N = 0.62\), \(\lambda_N = 10\) hr calibrated to CONVOY mountain terrain; different terrain profiles, jamming intensities, or atmospheric conditions will shift these values and require per-deployment Weibull fitting from partition logs.
State coordination at reconnection: At reconnection, transition (4) resets \(T_{\mathrm{acc}}\) to zero, marking the start of a fresh partition-duration measurement window. The corresponding trust-window state in the fleet coherence layer (see Fleet Coherence Under Partition) resumes from its current Hybrid Logical Clock value without reset — the HLC accumulates causal history monotonically and is not cleared by partition boundaries. These two state variables therefore evolve on independent clocks: \(T_{\mathrm{acc}}\) is a per-partition odometer that resets, while the HLC trust window is a global causal counter that does not.
Interaction with Proposition 22 (Closed-Loop Stability): Transition (2) reduces , which increases the effective loop delay \(\tau\). By Proposition 22 ’s stability condition , the controller gain \(K_{\text{ctrl}}\) must be reduced in tandem with . The controller parameters stored in Definition 14 ’s bandit update (which also adjusts ) jointly account for both the partition model and the control loop — the system self-calibrates under deep-survival conditions.
Physical translation: Four simultaneous state transitions fire when the Weibull circuit breaker trips: capability drops to L0 (survival-only), MAPE-K frequency drops to , the bandit model shifts to a heavier-tailed prior (expecting longer partition durations), and the accumulator resets on recovery. The floor at 0.30 prevents the model from overcorrecting to an infinitely heavy tail — even after a very long partition, the system retains some expectation of eventual recovery.
Chaos Validation: Proposition 37 defines a testable predicate; fault injection as a validation methodology is the basis of chaos engineering [14] . Three injection scenarios exercise it across the Weibull parameter space:
Micro-Burst ( ): Rapid connectivity flapping with light-tailed, sub-minute bursts — simulating terrain edges and brief EW interference. Each partition ends before can accumulate toward . Pass criterion: circuit breaker never fires; resets cleanly after every recovery; the Definition 14 bandit arm does not shift (zero normalized excess observed per partition).
The Long Dark ( ): 72-hour sustained partition simulating complete satellite and mesh loss — terrain masking compounded by active EW. ; the circuit breaker fires at approximately hour 59. Pass criteria: (1) circuit breaker fires when ; (2) capability maintained continuously through hour 72; (3) outbound queue depth bounded; (4) on reconnection, resets and the capability ladder re-engages from .
Asymmetric Link (uplink loss \(\geq 95\%\), downlink intact): Simulates one-way EW jamming — the node receives incoming traffic but cannot transmit telemetry or acknowledgements. No sojourn model applies; this tests regime classification accuracy and queue discipline under directional asymmetry. Pass criterion: regime classified as (Intermittent, not ) within two gossip periods; \(\theta^*(t)\) begins the partition-aware drift; the unacknowledged outbound queue remains memory-bounded.
Watch out for: the \(Q_{0.95}\) threshold is derived from a single Weibull fit to all observed partition durations, but tactical environments typically produce a bimodal distribution — short terrain-induced interruptions (seconds to minutes) and long EW-induced blackouts (hours) that arise from entirely different physical causes; fitting a single Weibull family to this mixture produces a shape parameter \(k_\mathcal{N}\) between the two modes, yielding a \(Q_{0.95}\) that undershoots the long-blackout quantile by hours and causes the circuit breaker to fire too late on EW-induced partitions, exactly when early resource conservation matters most.
Cognitive Map: Dynamic Fidelity Scaling inverts the usual autonomy priority: the monitoring infrastructure throttles itself first, before the mission payload does. The five observation regimes (O4–O0) are defined by measured power draws from Definition 51 ; the Self-Throttling Survival Gain ( Proposition 34 ) shows that the L4-to-L0 throttle multiplier is \(9.3\times\) — a \(9\times\) difference in survival time from a single configuration decision. The Autonomic Overhead Paradox ( Proposition 35 ) captures the essential tension: near the survival threshold, the MAPE-K loop is the proximate threat to survival, not the failure it was designed to catch. The Weibull Circuit Breaker ( Proposition 37 ) automates this recognition — at the 95th-percentile partition duration, the system drops to L0 and expects another 17 hours of denied connectivity. Next: when the entire autonomic framework fails, a fixed terminal safety state handles the final fallback.
Terminal Safety State
When the MAPE-K loop itself fails — heap exhausted, kernel panic, watchdog chain failure — the autonomic software cannot heal itself, and some response must exist that operates entirely without L1+ software involvement. The answer is a fixed terminal safety state selected by L0 firmware as a function of remaining energy alone — no Analysis, no Planning, no Knowledge Base required. Three states (PLM, BOM, HSS) cover the range from weeks of passive listening to immediate hardware shutdown. The limitation is that the terminal state is static and cannot adapt: a drone in BOM can transmit its position but cannot reason about whether that transmission is tactically safe. The price of zero software dependency is zero software intelligence.
The MVS is the floor the healing algorithm defends. But the healing algorithm can itself fail — the MAPE-K loop may crash, its knowledge base may become corrupted, or its resource quota ( ) may be exhausted. Below MVS lies the terminal safety state : what the node does when all autonomy has been lost.
Definition 53 (Terminal Safety State). The terminal safety state is the operating mode the node enters when the entire autonomic framework — including the MAPE-K loop and all its L1+ dependencies — has failed and cannot self-repair. It is selected by L0 firmware as a function of remaining energy \(E\) alone:
Physical translation: A three-row lookup table on a single measured value — remaining battery fraction \(E\). Above (20%): passive listening, recoverable. Between thresholds: beacon-only, locatable. Below (5%): full hardware shutdown, tamper-secure. The entire decision logic fits in five lines of C with no function calls and no external dependencies — this is the design constraint that makes it L0-implementable.
The definition maps remaining energy to a deterministic safety state (PLM, BOM, or HSS) at every Execute tick, firing immediately on threshold crossing and pre-empting all other actions to prevent uncontrolled shutdown without state preservation or actuator parking. For RAVEN : PLM at battery (illustrative value); BOM at (illustrative value); HSS below (illustrative value); each threshold carries a (illustrative value) hysteresis band. The hysteresis band is structurally necessary: without it, a node oscillating near the HSS threshold cycles in and out of full shutdown — a state sequence more hazardous than remaining in HSS.
Three concrete states are ordered by endurance. In PLM (Passive Listening Mode), the radio operates in receive-only mode with no transmissions and computation limited to the hardware watchdog and energy monitor; endurance is weeks, and the node can receive a recovery command and re-initialize L1+ if the command arrives and power recovers. In BOM (Beacon-Only Mode), the node transmits a periodic low-power position and status beacon at interval with no processing beyond beacon scheduling; endurance is days, enabling recovery teams to locate the node. In HSS (Hardware Safety Shutdown), all software subsystems are powered off and only the tamper-detection circuit and charge controller remain active; endurance is battery lifetime, and this state is appropriate when continued operation risks mission security (e.g., a radio active in a denied zone).
Hardware prerequisite and applicability scope: Definition 53 assumes the node has (1) a dedicated battery management IC (BMS IC) that exposes a real-time energy register readable by L0 firmware without L1+ involvement, (2) a hardware-controlled secure flash zeroization circuit triggered by a GPIO line from L0, and (3) a charge controller that can be commanded to cut load power while preserving BMS and tamper-circuit supply. These are standard on modern battery-powered edge nodes (DJI embedded controllers, Raspberry Pi CM4 with UPS HAT, custom tactical compute modules) but absent on most legacy industrial equipment (PLCs, RTUs, SCADA remotes). Applying Definition 53 to legacy hardware without these components results in a terminal state machine that cannot reliably reach HSS — the “energy register” does not exist, and “zeroization” requires L1+ firmware. For legacy brownfield systems, the terminal safety state reduces to a physical-layer action (pulling a relay that cuts main power), which is Tier 3 or Tier 4 of the Legacy Recovery Cascade ( Definition 56 ) rather than an autonomic software action.
Threshold calibration: and are platform-specific measured quantities, not default parameters. The RAVEN scenario ( , ) is derived as follows:
| Threshold | Requirement | Computation | RAVEN value |
|---|---|---|---|
| Energy for one secure flash zeroization | 180 mJ measured at 3.7V; 5000 mAh battery: 5% = 925 mJ; \(5\times\) margin | 5% | |
| PLM endurance until recovery team (72h) at 2 mA draw | 72h \(\times\) 2 mA = 144 mAh (~3%) + + \(2\times\) cold-battery margin | 20% |
Calibration procedure for any platform: (1) measure secure shutdown energy at minimum operating temperature (worst case); (2) compute minimum PLM endurance from recovery SLA at maximum PLM draw; (3) add \(2\times\) margin for battery capacity reduction at minimum operating temperature (Li-Ion loses 30–61% at ); (4) verify by at least 10 percentage points to avoid threshold ambiguity near the boundary.
Critically, selection must be implemented entirely within L0 firmware — the transition logic must satisfy the dependency isolation requirement ( Definition 18 ): zero imports from L1+ code.
Proposition 38 (Safety State Reachability). For any system state \(S\) — including states where all L1–L4 layers have crashed — is reachable via L0 hardware operations alone:
Even when the entire MAPE-K stack has crashed, the hardware watchdog and L0 firmware can still select and enter the terminal safety state using only a battery level reading.
Proof: By Definition 18 , L0 has no dependencies on L1+; therefore L0 remains operational when all L1+ layers have failed. The software watchdog timer ( Definition 41 ) is implemented in dedicated hardware: it fires when the L1+ software stack stops issuing heartbeats, without requiring any L1+ cooperation. Upon watchdog fire, L0 reads the energy register \(E\) and enters . The entire path — watchdog trigger, energy read, state entry — uses only hardware registers and L0 firmware. \(\square\)
Multi-failure convergence: When power degradation, connectivity partition, and sensor drift coincide simultaneously, the healing loop does not attempt to resolve all three in parallel. The priority ordering from Definition 43 (Resource Priority Matrix) applies: L0 hardware veto fires first (freezing actuators), MAPE-K shifts to diagnostic-only mode, and drift-compensation is suspended until power recovers above the L1 threshold ( Proposition 28 , Priority Preemption Deadline Bound). The terminal safety state is reached within regardless of the failure combination order.
RAVEN scenario: Drone 23’s MAPE-K process crashes mid-healing (heap exhausted by a runaway recovery action). The L1+ watchdog daemon also fails (same heap). The hardware watchdog fires after 500ms — the heartbeat window. L0 reads \(E = 12\%\) (above , below ) and enters BOM. The drone begins transmitting its position beacon at 30-second intervals on the recovery frequency. The swarm’s gossip health protocol ( Definition 24 ) marks Drone 23 as RECOVERY-BEACON and routes a cluster lead to attempt L1+ re-initialization via the BOM command channel. This is exactly the failure mode that Proposition 37 guarantees can be reached: from any state, regardless of which layers have failed.
Watch out for: the reachability proof assumes the energy register is accessible to L0 firmware via a dedicated hardware bus (e.g., a separate SPI channel wired directly to the BMS IC), so that L0 does not depend on any L1+ bus driver to read \(E\); if the BMS IC shares an I2C bus with L1+ peripherals and the L1+ driver holds a bus lock at crash time, L0 cannot read the energy register and must enter a fixed conservative state (HSS) without knowing whether PLM or BOM would be more appropriate — the L0-readability of the energy register must be verified as a hardware design requirement, not assumed.
Cognitive Map: The terminal safety state is the non-negotiable floor below the MVS. Selected entirely by L0 firmware from battery level alone — no L1+ code path exists — it satisfies the Dependency Isolation Requirement ( Definition 18 ) by construction. Proposition 38 guarantees reachability: from any system state, including one where every higher layer has crashed, L0 hardware operations can reach . The three-level structure (PLM \(\to\) BOM \(\to\) HSS) grades the response to remaining energy, preserving recovery potential as long as battery allows. Next: legacy hardware that predates autonomic APIs requires an Autonomic Gateway to participate in the MAPE-K loop at all.
Autonomic Gateway
Most engineering analysis in this series assumes the managed hardware presents an observable health telemetry API — a process that responds to queries, emits structured health metrics, and accepts configuration commands. That assumption fails for legacy industrial equipment, embedded controllers, and tactical hardware designed before autonomic systems existed.
A 1990s diesel generator does not report its internal temperature. A legacy motor controller does not export a health vector. A cold-war-era radio does not accept remote restart commands. Yet these devices must participate in the MAPE-K healing loop — the system cannot simply exclude them because they lack a modern interface.
The Autonomic Gateway is a software adapter that presents legacy hardware to the MAPE-K loop as if it were a fully observable, API-driven system: it synthesizes health metrics from proxy signals, maps healing actions onto physical actuation primitives, and enforces cooldown and pre-condition constraints that the underlying hardware cannot enforce itself.
Definition 54 (Autonomic Gateway). An Autonomic Gateway for a legacy hardware device \(D\) is a tuple where:
- is the set of target health metrics that the MAPE-K Monitor phase expects (e.g., temperature, fuel level, operational state)
- is the set of observable proxy signals physically accessible from the gateway controller (e.g., current draw, ambient temperature, vibration amplitude, exhaust flow)
- is the inference function mapping observable proxies to health metric estimates; for each \(h_i \in H\), \(\varphi_i(o)\) yields a point estimate \(\hat{h}_i\) and uncertainty interval \(\sigma_i\)
- is the set of physical actuation primitives the gateway can execute on \(D\) (e.g., Modbus register write, GPIO signal, relay close, power cycle) (calligraphic distinguishes the actuation set from scalar state variables)
- is the actuation mapping from MAPE-K healing commands to ordered sequences of physical primitives, including pre-conditions, post-conditions, and cooldown requirements
The gateway presents \((H, \Gamma(\cdot))\) to the MAPE-K loop and hides as implementation details.
Authority tier assignment: an Autonomic Gateway node holds authority tier (cluster-scope) by default, unless explicitly provisioned to (fleet-scope) during Phase-0 commissioning. Healing actions issued through a gateway carry the gateway’s tier ceiling — a gateway cannot authorize actions that would require authority, even if the underlying legacy hardware is capable of executing them.
OUTPOST generator example: . The generator has no telemetry port. The gateway observes current draw, ambient temperature, exhaust temperature, and vibration. The MAPE-K loop sees structured health reports and issues restart/shutdown commands; the gateway translates those commands into Modbus register writes and GPIO relay signals.
Physical translation: The gateway is a translator: legacy hardware speaks voltages and Modbus registers; the MAPE-K loop speaks health vectors and healing commands. The gateway converts in both directions. Its inferred health metrics ( Definition 55 ) are estimates with uncertainty bounds — the MAPE-K Analyze phase must treat them as , not as ground truth, or it will over-diagnose faults in legacy equipment that has no native health telemetry.
Definition 55 (Synthetic Health Metric). A synthetic health metric is an inferred measurement of a device-internal quantity that the hardware does not report directly, derived from a physical model relating observable proxy signals to the target quantity.
For the OUTPOST diesel generator, the gateway infers engine thermal state from an RC thermal circuit model. Let be the waste-heat power at time \(t\), where \(I(t)\) is measured current draw, is nominal supply voltage, and \(\eta\) is mechanical efficiency. Engine temperature evolves as:
where is thermal resistance, is the thermal time constant, and \(s(t)\) is elapsed run time since the last cold start. Both and are calibrated once at commissioning by running the generator to thermal steady state while logging current draw and exhaust temperature.
Model uncertainty: \(\varphi_i\) carries irreducible estimation error , where is the model residual variance and is proxy sensor measurement noise. The MAPE-K Analyze phase must treat as the health estimate, not \(\hat{h}_i\) as a point truth.
Physical translation: The diesel generator’s internal temperature is not wired to any sensor the MAPE-K loop can read. This formula estimates it from current draw and run time. The estimate is unreliable during the first 30 seconds after cold start ( versus a \(5^\circ\text{C}\) decision threshold) — the gateway signals “thermal state uncertain” and the MAPE-K loop withholds temperature-dependent decisions until the model warm-up period completes.
Proposition 39 (Gateway Signal Coverage Condition). A gateway provides valid synthetic observability to the MAPE-K loop if and only if the following three conditions hold for every health metric \(h_i \in H\):
The OUTPOST diesel gateway cannot claim valid temperature observability unless its model bias is below the decision threshold, its inference completes within one monitoring window, and its uncertainty is within the anomaly detector’s false-alarm budget.
where \(\delta_i\) is the acceptable bias for metric \(h_i\), is the MAPE-K monitor period, and is the maximum uncertainty the anomaly detector can tolerate while maintaining its false-alarm guarantee (Prop 3).
Proof sketch: If Coverage holds, the Analyze phase operates on a \(\delta_i\)-biased estimate, expanding the anomaly detection threshold by \(k\delta_i\). If Timeliness holds, the Monitor phase is not stale — inference completes within one monitoring window. If Uncertainty holds, the false-alarm rate of Prop 3 is preserved: substituting into the threshold criterion expands the effective threshold by at most \(k\sigma_i\), which stays within the design margin when . If any condition fails, monitoring quality for that metric degrades to at most Heartbeat-Only (L0) level. \(\square\)
OUTPOST calibration: At commissioning, the thermal model achieves mean absolute error \(3.2^\circ\text{C}\) (illustrative value) — below (illustrative value). Inference runs in 2ms (illustrative value) on the gateway ARM processor — below (illustrative value). Cold-start uncertainty (first 30 seconds (illustrative value) before stabilizes) produces (illustrative value), exceeding (illustrative value): the gateway signals “thermal state uncertain” and the MAPE-K loop withholds temperature-dependent healing decisions until \(s(t) > 30\text{s}\) (illustrative value).
Empirical status: The OUTPOST calibration values ( , , 30 s cold-start window) are specific to this generator model and commissioning environment; different hardware, fuel composition, or ambient temperature range will produce different thermal time constants and uncertainty profiles requiring re-calibration.
Watch out for: the Coverage condition is verified at commissioning against a freshly serviced generator, but the thermal model coefficients \(R_\text{th}\) and \(\tau_\text{th}\) drift as the generator ages (deposits on heat exchangers, worn seals, fuel composition variation) — a gateway that passed Coverage at commissioning can silently violate it a year later, generating systematically biased temperature estimates that look valid to the MAPE-K loop while the true engine temperature exceeds the safe restart threshold by several degrees; absent an independent calibration check or a direct temperature sensor for periodic model re-validation, the Coverage condition should be treated as a commissioning guarantee, not a runtime invariant.
The autonomic loop invokes the recovery cascade under these ordered conditions: (1) MAPE-K attempt count for the current fault is below \(N_\text{retry}\) — retry at the next severity tier; (2) MAPE-K attempt count has reached \(N_\text{retry}\) and the action severity is \(\leq\) SEV_3 — invoke the cascade; (3) action severity is SEV_4 and — invoke the cascade with the override flag set, bypassing the attempt-count requirement.
Definition 56 (Legacy Recovery Cascade). A Legacy Recovery Cascade for hardware \(D\) is an ordered sequence of recovery tiers , where each tier \(T_k\) is a tuple :
- : pre-condition predicate that must hold before \(T_k\) may execute
- : the physical actuation sequence (ordered primitives from )
- : post-condition predicate verifying that the tier had effect
- \(W_k\): recovery observation window [s] — time to wait before evaluating
- \(C_k\): cooldown period [s] — minimum time between successive invocations of tier \(k\)
The cascade executes tiers in order, advancing to only if evaluates false after \(W_k\) seconds.
OUTPOST generator cascade:
| Tier | Action | Pre-condition | Post-condition | Window | Cooldown |
|---|---|---|---|---|---|
| \(T_1\) | Modbus soft reset | Link up; engine below \(90^\circ\text{C}\) | Op state = RUNNING within 30s | 30s | 60s |
| \(T_2\) | Controlled stop then start (GPIO) | Engine below \(70^\circ\text{C}\); at least 60s since \(T_1\) | Current resumes baseline \(\pm 10\) A | 45s | 300s |
| \(T_3\) | Full power cycle via relay | At least 5 min since \(T_2\); fuel above 20% | Op state = RUNNING and current above 0 | 60s | 900s |
| \(T_4\) | Human escalation | All prior tiers failed; mission at or below BOM threshold | Operator acknowledged | — | — |
Physical translation: Try the softest fix first. If it fails after \(W_1\) seconds, wait out cooldown \(C_1\) and escalate. The cascade halts when something works or when it reaches the human-in-the-loop step. Pre-conditions exist because a hot restart can permanently damage certain hardware — the cascade respects the generator’s physics, not the MAPE-K loop’s impatience. Skipping straight to the aggressive fix is not faster; it risks making the hardware unrecoverable.
Proposition 40 (Recovery Cascade Correctness). Let be the deadline by which the MAPE-K healing loop must restore \(D\) to an operational state. The Legacy Recovery Cascade satisfies the healing deadline ( Proposition 21 ) if:
An OUTPOST generator cascade through all three tiers takes at most 23 minutes — well within the 90-minute MVS backup-power requirement — even if the generator is hot at failure.
The proposition bounds total cascade duration as the sum of wait, convergence, and actuation times across stages, with ( Proposition 21 ) and equal to the number of cascaded recovery stages in the dependency graph. Each stage requires a hard per-stage timeout — a stage without one that blocks indefinitely on a failed upstream dependency is the single most common cascade failure mode in practice.
where \(K^*\) is the highest tier that must be attempted before declaring the device failed, and is the actuation duration of tier \(k\). Additionally, the thermal pre-condition must hold at each tier boundary; if violated, the cascade suspends until the thermal model predicts cooling below the threshold. Thermal suspension has a maximum duration of . If the thermal pre-condition is not satisfied within \(T_{\text{thermal,max}}\), the cascade advances to the next lower tier regardless — bypassing the pre-condition check and logging a thermal-override event. This prevents indefinite suspension at ambient temperatures above the asymptotic cooling limit.
Proof: By induction on tier index. Base: \(T_1\) executes if holds and completes in . Inductive step: if \(T_k\) fails ( is false), the cascade advances to after cooldown \(C_k\). Total elapsed time at tier \(K^*\) is . Deadline satisfaction follows. The thermal suspension is correct: a hot restart at risks mechanical seizure, converting a recoverable fault into permanent failure. \(\square\)
OUTPOST worst case: min (illustrative value) ( MVS requirement: backup power within 90 minutes (illustrative value) of primary failure). Attempting \(T_1 \to T_2 \to T_3\) in sequence: min (illustrative value) — well within the deadline. If the generator is hot at failure ( (illustrative value)), the cascade suspends \(T_1\) and \(T_2\) until cooling. Using the thermal model, cooldown from \(92^\circ\text{C}\) (illustrative value) to \(70^\circ\text{C}\) (illustrative value) at ambient \(30^\circ\text{C}\) (illustrative value):
Total cascade time with thermal wait: \(12.6 + 23.25 = 35.85\) min (illustrative value) — still within the 90-minute deadline, but consuming 40% (illustrative value) of the available budget, leaving limited margin if \(T_3\) also fails.
Empirical status: The OUTPOST cascade timing values (\(W_k\), \(C_k\), thermal cooldown constant ) are specific to this generator hardware and commissioning measurements; each tier’s window and cooldown must be validated by fault injection on the actual hardware at operating temperature extremes.
(Adaptive gain scheduling under GPS-denied navigation and cascade thermodynamics under sustained high-temperature jamming remain open problems outside the scope of this article.)
Watch out for: the deadline bound assumes each tier’s thermal pre-condition is evaluated from the same synthetic temperature estimate that drove the failure diagnosis; if the thermal model underestimates cooling rate (because \(\tau_\text{th}\) has drifted upward with age), the cascade lifts the thermal suspension before the engine has genuinely cooled, executing \(T_2\) or \(T_3\) on an over-temperature generator — and the thermal-override mechanism, which fires after \(T_{\text{thermal,max}}\), will mask this error by advancing the cascade regardless, converting a recoverable thermal fault into mechanical seizure precisely in the case where the model is most unreliable.
Cognitive Map: The Autonomic Gateway makes legacy hardware MAPE-K-compatible without modifying the hardware. The three-condition Signal Coverage Proposition (47) bounds when synthetic observability is valid: bias within \(\delta_i\), inference within one monitoring window, uncertainty within the false-alarm budget. When any condition fails, that metric degrades to L0 observability only. The Legacy Recovery Cascade ( Definition 56 ) provides the action side: an ordered tier sequence with pre-conditions, post-conditions, cooldowns, and thermal suspension guards — ensuring the cascade respects the generator’s operating constraints rather than the MAPE-K loop’s impatience. Next: even without legacy hardware, simultaneous healing actions can overwhelm shared resources — cascade prevention addresses this.
Cascade Prevention
Healing consumes the same resources — CPU, bandwidth, power — needed for normal operation, and when multiple healing actions trigger simultaneously, resource contention prevents any from completing, leaving the system worse during healing than before. The solution is to reserve a fixed fraction of resources for healing (quota ), prioritize by MVS tier and resource efficiency, and spread simultaneous post-partition restarts using random jitter and staged waves. The trade-off is that a healing resource quota means some healing actions are queued even when the failure is serious — queueing bounds the resource spike but adds latency before the queued healing fires, a deliberate exchange of healing speed for system stability.
Resource Contention During Recovery
Healing consumes the resources needed for normal operation: CPU for MAPE-K analysis, action planning, and coordination; memory for healing state, candidate solutions, and rollback buffers; bandwidth for gossip coordination and status updates; and power for the additional computation and communication.
When multiple healing actions execute simultaneously, resource contention can prevent any from completing. The system becomes worse during healing than before.
Healing resource quotas: Reserve a fixed fraction of resources for healing. Healing cannot exceed this quota even if more problems are detected.
( : healing budget fraction, distinct from and above.)
If healing demands exceed quota, prioritize by severity and queue the remainder.
Prioritized healing queue: When multiple healing actions are needed, order by:
- Impact on MVS (critical components first)
- Expected time to complete
- Resource requirements (prefer low-resource actions)
Formally, the goal is to minimize total weighted completion time across all pending healing actions, where each action \(i\) carries a priority weight \(w_i\) and a completion time \(C_i\).
Classic scheduling algorithms (shortest job first, weighted shortest job first) apply directly.
Thundering Herd from Synchronized Restart
After a partition heals, multiple nodes may attempt simultaneous healing. This thundering herd can overwhelm shared resources.
Scenario: CONVOY of 12 vehicles experiences 30-minute partition. During partition, vehicles 3, 5, and 9 developed issues requiring healing but couldn’t coordinate with convoy lead. When partition heals, all three simultaneously:
- Request lead approval for healing
- Download healing policies
- Execute restart sequences
- Upload health status
The convoy’s limited bandwidth is overwhelmed. Healing takes longer than if coordinated sequentially.
Jittered restarts: Each node draws a random delay uniformly from and waits that long after the partition ends before initiating its healing sequence, spreading simultaneous arrivals across the jitter window.
The effect on load is dramatic: without jitter all \(n\) nodes hit at once at rate \(n \cdot \lambda\); with jitter the average load is reduced by the window length \(T\).
Jitter spreads load over time, preventing spike.
Staged recovery: Define recovery waves. Wave 1 heals highest-priority nodes. Wave 2 waits for Wave 1 to complete.
Formal comparison: With \(k\) waves of \(n/k\) nodes each, staged recovery achieves:
For \(k = 3\) waves, variance reduces by factor of 3, providing tighter bounds on total recovery time at the cost of \(k-1\) synchronization barriers.
Progressive Healing with Backoff
Start with minimal intervention. Escalate only if insufficient.
The healing escalation ladder progresses through six levels, advancing only when the lower level fails: retry (wait and retry the operation for transient failures), restart (restart the specific component), reconfigure (adjust configuration parameters), isolate (remove the component from active duty), replace (substitute with a backup component), and abandon (remove from the fleet entirely).
Between each escalation level, the system waits an exponentially increasing observation window: at level \(k\) with base wait \(t_0\), the wait doubles with each level so that higher-severity interventions receive more time to demonstrate success before further escalation is triggered.
Where \(k\) is the level and \(t_0\) is base wait time.
After action at level \(k\), wait before concluding it failed and escalating to level \(k+1\).
Multi-armed bandit formulation: Each healing action is an “arm” with unknown success probability. The healing controller must explore (try different actions to learn effectiveness) and exploit (use actions known to work).
The Upper Confidence Bound ( UCB ) [Auer et al., 2002b] algorithm provides optimal exploration-exploitation tradeoff:
where \(\hat{p}_a\) is the estimated success probability for action \(a\), \(n_a\) is the attempt count for action \(a\), and \(t\) is total attempts across all actions. The exploration bonus grows for under-tried actions, ensuring eventual exploration.
Derivation: The exploration term follows from Hoeffding’s inequality. For a random variable bounded in \([0,1]\), . Setting yields confidence that scales appropriately with sample count.
Select the action with highest UCB . This naturally balances known-good actions with under-explored alternatives.
Regret bound: UCB achieves where \(K\) is the number of actions and \(T\) is episodes. For RAVEN with \(K = 6\) healing actions over \(T = 100\) episodes, expected regret is bounded by \(\sim 53\) suboptimal decisions—the system converges to near-optimal healing policy within the first deployment month.
Model Scope and Failure Envelope
Every analytical guarantee in this post rests on assumptions — linear dynamics, constant delay, stationary reward distributions — and real systems violate these. Deploying these mechanisms without understanding their validity domain produces unexpected failures. For each mechanism, the assumptions, the failure mode when each assumption is violated, the observable detection signal, and a concrete mitigation must be enumerated; the validity domain is not a footnote but the primary engineering decision. The cost of that enumeration is measurement infrastructure: knowing when you are outside the validity domain requires instrumentation that itself consumes resources and adds complexity.
Each mechanism has bounded validity. When assumptions fail, so does the mechanism.
MAPE-K Stability Analysis
Validity Domain:
The MAPE-K stability analysis holds only when the system state \(S\) satisfies all three assumptions simultaneously; violations narrow or eliminate the domain within which guarantees stability.
where:
- \(A_1\): System dynamics are approximately linear near operating point
- \(A_2\): Feedback delay \(\tau\) is approximately constant
- \(A_3\): No nested feedback loops (healing action does not affect its own sensing)
Stability Criterion: ensures stability under discrete-time proportional control.
The following table maps each assumption violation to its observable symptom, how to detect it, and a concrete engineering mitigation.
| Assumption Violation | Failure Mode | Detection | Mitigation |
|---|---|---|---|
| Nonlinear dynamics | Oscillation at large perturbations | Amplitude exceeds linear regime | Gain scheduling; saturation limits |
| Variable delay | Unpredictable oscillation | Delay variance high | Robust controller design |
| Nested feedback | Instability; runaway | Correlation between action and sensor | Decouple sensing from action |
Counter-scenario: A healing action that restores a sensor affects the very metric being monitored (e.g., restarting a process causes temporary CPU spike). The stability analysis assuming independent sensing does not apply. Detection: correlation coefficient between healing actions and subsequent sensor anomalies exceeds 0.5.
UCB Action Selection
Validity Domain:
UCB ’s regret bound holds only when the reward distribution is stable and actions can be safely retried; the validity domain captures these preconditions formally.
where:
- \(B_1\): Reward distribution is stationary over learning horizon
- \(B_2\): Actions are repeatable (can try same action multiple times)
- \(B_3\): Rewards are bounded in \([0, 1]\)
Regret Bound: holds under stated assumptions.
The table below describes what goes wrong when each UCB assumption is violated, the observable signal that reveals the violation, and the recommended corrective design.
| Assumption Violation | Failure Mode | Detection | Mitigation |
|---|---|---|---|
| Non-stationary environment | Converges to stale optimum | Performance decline over time | Sliding window; discounted UCB |
| Catastrophic actions | Cannot learn from irreversible failure | Action leads to system loss | Action cost constraints; simulation |
| Sparse rewards | Slow convergence | Samples per action < 10 | Prior from similar contexts |
Uncertainty bound: Practical convergence requires \(T > 10K\) where \(K\) is number of actions. For RAVEN with 5 healing actions, meaningful learning requires 50+ samples. Novel failures with < 10 samples should use conservative defaults.
Staged Recovery
Validity Domain:
Staged recovery reduces completion-time variance only when each stage can be verified independently and reversed if it fails; the domain excludes systems where those conditions do not hold.
where:
- \(C_1\): Recovery stages are independently verifiable
- \(C_2\): Partial success is detectable (intermediate states observable)
- \(C_3\): Rollback is possible at each stage
The table below shows what breaks when staged recovery’s assumptions do not hold, and the corresponding engineering response.
| Assumption Violation | Failure Mode | Detection | Mitigation |
|---|---|---|---|
| Atomic failures | Cannot decompose; all-or-nothing | Recovery has no checkpoints | Accept atomic recovery |
| Unobservable intermediate | Cannot verify stage completion | Verification timeout | Probabilistic advancement |
| No rollback | Partial recovery may be worse | Rollback fails | Forward-only with safeguards |
Counter-scenario: Database corruption where partial recovery may leave inconsistent state. Staged recovery may be worse than atomic restore from backup. Detection: data integrity checks fail after partial recovery. Response: atomic restore is preferred for integrity-critical systems.
Cascade Prevention
Validity Domain:
Resource quotas and dependency ordering prevent cascade only when the dependency graph is known, resource pools can be isolated, and each healing action consumes a bounded share of those pools.
where:
- \(D_1\): Failure dependencies are known and acyclic
- \(D_2\): Resource pools are isolable
- \(D_3\): Healing actions have bounded resource cost
The table below identifies the three main ways cascade prevention breaks down, the observable signal in each case, and the mitigation.
| Assumption Violation | Failure Mode | Detection | Mitigation |
|---|---|---|---|
| Hidden dependencies | Cascade propagates unexpectedly | Correlated failures | Dependency discovery; testing |
| Shared resource pools | Healing exhausts shared resources | Resource contention | Resource isolation; budgets |
| Unbounded healing cost | Healing action triggers cascade | Healing resource > available | Cost limits; staged healing |
Summary: Claim-Assumption-Failure Table
The summary table below consolidates all four mechanisms into a single reference, showing the essential claim, the assumptions that support it, and the conditions under which each claim breaks down.
| Claim | Key Assumptions | Valid When | Fails When |
|---|---|---|---|
| MAPE-K converges | Linear dynamics, constant delay | Small perturbations | Large failures; variable delay |
| UCB minimizes regret | Stationary environment, repeatable | Stable system | Non-stationary; catastrophic actions |
| Staged recovery reduces variance | Stages separable, observable | Modular recovery | Atomic failures; unobservable |
| Cascade prevention isolates failures | Dependencies known, resources isolable | Well-understood system | Hidden dependencies; shared resources |
Reinforcement Learning for Adaptive Recovery
UCB treats healing actions as independent arms. In practice, optimal healing depends on context: failure type, system state, resource availability, and environmental conditions. Reinforcement learning (RL) learns context-dependent healing policies.
Contextual Bandits for State-Dependent Healing
Contextual bandits extend UCB by selecting the action that maximizes a linear reward estimate \(\theta_a^T x\) for the current context vector \(x\), plus a confidence-weighted exploration bonus that is large when the covariance indicates the action is under-explored in this region of the context space.
where \(x\) is the context vector (failure features), \(\theta_a\) is the learned parameter for action \(a\), and \(A_a\) is the covariance matrix tracking uncertainty.
Context features for healing decisions:
These six features form the context vector \(x\) that LinUCB conditions on; the Range column indicates what the endpoints represent for each feature.
| Feature | Description | Range |
|---|---|---|
| \(x_1\) | Failure severity (from anomaly score) | [0 = nominal, 1 = critical] |
| \(x_2\) | Time since last healing | \([0, \infty)\) normalized |
| \(x_3\) | Resource availability (power, CPU) | [0 = depleted, 1 = full capacity] |
| \(x_4\) | Connectivity state | {0, 0.33, 0.67, 1} |
| \(x_5\) | Cluster health (avg neighbor status) | [0 = all failed, 1 = all healthy] |
| \(x_6\) | Mission criticality | [0 = routine, 1 = mission-critical] |
LinUCB for RAVEN healing:
The diagram below traces a single decision cycle: context features are extracted, scored against each action’s UCB value, and the highest-scoring action is selected and used to update the model.
graph TD
subgraph "Context Extraction"
F["Failure detected
Anomaly score: 0.85"]
S["State features
x = [0.85, 0.2, 0.6, 0.67, 0.9, 0.7]"]
end
subgraph "LinUCB Policy"
A1["Restart (a1)
UCB: 0.72"]
A2["Reconfigure (a2)
UCB: 0.81"]
A3["Reboot (a3)
UCB: 0.65"]
A4["Failover (a4)
UCB: 0.78"]
end
subgraph "Execution"
E["Execute a2
Observe outcome"]
U["Update theta2, A2"]
end
F --> S
S --> A1
S --> A2
S --> A3
S --> A4
A2 -->|"max UCB"| E
E --> U
style A2 fill:#c8e6c9,stroke:#388e3c,stroke-width:2px
Read the diagram: The context vector (6 features including anomaly score 0.85, connectivity 0.67, cluster health 0.9) flows into the LinUCB policy, which scores each action with its UCB value. Reconfigure (a2) wins with 0.81 — higher than Restart (0.72) because prior successes in this context shifted its \(\theta\) estimate upward. The selected action is executed, and only that action’s \(\theta\) and A matrices update — all other arms are unchanged. This per-arm update is why LinUCB requires only \(O(d^2)\) storage per arm regardless of episode count.
Sample efficiency: LinUCB’s regret bound scales with feature dimension \(d\) rather than action count \(K\), providing better sample efficiency when \(d < K\). Context features enable generalization—a healing action effective for high-severity failures can be immediately applied to new high-severity failures without re-exploration.
Deep Reinforcement Learning for Complex Healing
When the healing problem involves sequential decisions and complex state spaces, deep RL provides more expressive policies.
Policy Network Architecture for Healing:
The diagram below shows how state history passes through an embedding and recurrent layer before splitting into a policy head (action probabilities) and a value head (expected return), the two outputs that drive actor-critic training.
graph LR
subgraph "Input"
S["State
32 features"]
H["History
Last 5 states"]
end
subgraph "Network"
E["Embedding
32 to 16"]
L["LSTM
16 x 5 to 32"]
P["Policy head
32 to K actions"]
V["Value head
32 to 1"]
end
S --> E
H --> E
E --> L
L --> P
L --> V
style P fill:#e3f2fd,stroke:#1976d2
style V fill:#fff3e0,stroke:#f57c00
Read the diagram: State (32 features) and history (last 5 states) both enter the same embedding layer (compresses to 16), which feeds the LSTM (processes the temporal sequence, outputs 32). The LSTM output then branches: the blue policy head produces action probabilities (K outputs — the actor); the orange value head produces a single scalar estimate of expected future reward (the critic). The LSTM is why this architecture can recognize patterns like “this same service has crashed three times in the last 5 minutes” — temporal structure the policy head then exploits.
Actor-Critic for edge deployment:
The policy (actor) selects healing actions; the value function (critic) estimates expected future reward:
PPO maximizes the policy objective \(L(\theta)\) by taking the minimum of the unclipped and clipped probability-ratio objective, preventing any single update from moving the policy too far from the previous version and thus avoiding destructive overshooting.
where is the probability ratio and \(A_t\) is the advantage estimate.
Model size for edge: state embedding \(32\times16 = 512\) parameters; LSTM parameters; policy head \(32\times6 = 192\) parameters; value head \(32\times1 = 32\) parameters; total 6,880 parameters (illustrative value) ≈ 27 KB (float32).
The training approach follows three phases: simulation pretraining (train in a simulated environment with synthetic failures), deployment fine-tuning (continue learning from real failures with reduced learning rate), and policy distillation (compress the large trained policy into an edge-deployable network).
Healing Policy Comparison (theoretical bounds):
Each row reports the asymptotic regret bound, sample complexity to reach \(\epsilon\)-optimality, and the limiting success rate, showing the progression from fixed rules through deep RL.
| Method | Regret Bound | Convergence | Success Rate Bound |
|---|---|---|---|
| Fixed rules | \(\Omega(T)\) (linear) | N/A | |
| UCB bandit | \(O(K^2/\epsilon^2)\) | ||
| LinUCB | \(O(d^2/\epsilon^2)\) | ||
| PPO | \(O(1/\epsilon^2)\) | (optimal) |
Utility ordering derivation: Let be cumulative reward for method \(i\).
follows from tighter regret bounds and context-awareness. PPO’s policy gradient exploits state structure; LinUCB exploits linear reward structure; UCB exploits only action averages; fixed rules have no adaptation.
Model-Based RL for Sample Efficiency
Edge systems have limited failure data. Model-based RL learns a dynamics model and plans using it, enabling policy improvement from synthetic rollouts without requiring many real failures. For OUTPOST , where sensor failures occur roughly once per 30 days, the model is initialized from similar deployments, updated after each real failure, and then used to generate 100+ synthetic rollouts for policy improvement—reducing real-world sample requirements substantially relative to model-free approaches.
Safe Reinforcement Learning with Constraints
Healing actions have constraints: power budget, time limits, safety requirements. Unlike unconstrained RL, Safe RL finds the policy \(\pi\) that maximizes discounted cumulative reward while simultaneously keeping the discounted cumulative cost of each constraint \(i\) below its threshold \(d_i\), so that power, cascade risk, and time violations are penalized structurally rather than through a hand-tuned reward term.
where \(C_i\) are cost functions, \(d_i\) are constraint thresholds, and is the RL discount factor.
Constraint types for edge healing:
The table maps each operational constraint to its cost function \(C_i\) and the threshold \(d_i\) that CPO must not exceed.
| Constraint | Cost Function | Threshold |
|---|---|---|
| Power budget | Energy consumed by healing | 10% of battery |
| Cascade risk | P(healing causes secondary failure) | 5% |
| Time bound | Recovery duration | 5 minutes |
| Service level | Capability degradation during healing | \(\mathcal{L}_1\) minimum |
Constrained Policy Optimization (CPO):
Each CPO policy update finds the parameter that maximizes the objective \(L(\theta)\) subject to two constraints: the KL divergence from the old policy must not exceed \(\delta_\text{KL}\) (keeping updates small), and the expected cumulative cost of every constraint \(i\) must remain at or below its threshold \(d_i\).
RAVEN safe healing example:
Drone healing must not deplete battery below safe return threshold. CPO learns to:
- Prefer low-energy healing actions (reconfigure > reboot)
- Delay healing if battery is marginal
- Accept slightly lower success rate to preserve energy margin
The utility loss \(\Delta U\) of using CPO instead of the unconstrained policy equals the Lagrange multiplier \(\lambda^*\) multiplied by how much the unconstrained policy would have exceeded the constraint threshold \(d\), quantifying the cost of the safety guarantee.
where \(\lambda^*\) is the optimal Lagrange multiplier. CPO trades a lower success rate for a hard constraint-satisfaction guarantee: it never violates constraints by construction, while the unconstrained policy violates them with probability \(\epsilon_C > 0\).
but bounded: the constraint guarantee has value such that total utility when constraint violation is catastrophic.
Hierarchical RL for Multi-Level Healing
Healing operates at multiple levels (component, node, cluster, fleet). Hierarchical RL decomposes the problem: each tier learns a simpler policy scoped to its level, enabling temporal abstraction (high-level decides “what,” low-level decides “how”) and modularity (low-level policies reusable across deployments).
graph TD
subgraph "High-Level Policy (Fleet)"
HLP["Fleet healer
Decides: which cluster"]
end
subgraph "Mid-Level Policy (Cluster)"
MLP1["Cluster healer 1
Decides: which node"]
MLP2["Cluster healer 2
Decides: which node"]
end
subgraph "Low-Level Policy (Node)"
LLP1["Node healer
Decides: which action"]
LLP2["Node healer
Decides: which action"]
end
HLP -->|"heal cluster 1"| MLP1
HLP -->|"monitor"| MLP2
MLP1 -->|"heal node 3"| LLP1
MLP1 -->|"monitor"| LLP2
style HLP fill:#e3f2fd,stroke:#1976d2
style MLP1 fill:#fff3e0,stroke:#f57c00
style LLP1 fill:#e8f5e9,stroke:#388e3c
Read the diagram: Three tiers shown top-to-bottom. The blue Fleet Healer makes coarse decisions (“heal cluster 1, monitor cluster 2”). The orange Cluster Healer receives that directive and makes mid-level decisions (“heal node 3”). The green Node Healer receives the node-level assignment and picks the specific healing action. Each tier solves a simpler problem than the full joint optimization would require, and low-level policies can be reused across deployments with the same node-level healing action set.
Transfer Learning Across Scenarios
RAVEN , CONVOY , and OUTPOST share healing patterns. Transfer learning leverages this:
Transfer from RAVEN to CONVOY :
Transfer involves three components: the shared representation (the state embedding layer transfers directly since both scenarios share connectivity, power, and health features); policy adaptation (the policy head is retrained on CONVOY -specific actions); and value fine-tuning (the value function is recalibrated for the CONVOY reward scale).
Transfer efficiency bound:
Learning from scratch to \(\epsilon\)-optimality requires \(O(|S||A|/\epsilon^2)\) samples — proportional to the full state-action space — while transfer learning from a related source policy reduces this to , where is the \(L_1\) distance between source and target transition dynamics.
where is the domain difference.
The table below shows how domain similarity translates into concrete sample savings: the closer the source and target dynamics, the smaller , and the larger the fraction of training samples that transfer replaces.
| Target | Domain Diff | Complexity Ratio | Sample Reduction |
|---|---|---|---|
| Similar (e.g., drone-to-drone) | \(O(0.1)\) | \(O(0.1)\) | \(\approx 90\%\) |
| Related (e.g., drone-to-vehicle) | \(O(0.3)\) | \(O(0.3)\) | \(\approx 70\%\) |
| Distant (e.g., drone-to-building) | \(O(0.5)\) | \(O(0.5)\) | \(\approx 50\%\) |
(transfer reduces samples) when —i.e., when source and target share structure.
Meta-Learning for Rapid Adaptation: MAML trains an initialization \(\theta^*\) across diverse healing scenarios so that the policy can fine-tune to a new scenario in 5-10 episodes rather than 100+. This is essential for novel deployments where collecting large amounts of real healing experience is impractical before the system must operate.
Online vs Offline RL Tradeoffs
The three training regimes differ in where data comes from, whether unsafe exploration is possible during training, and how efficiently each uses the available healing experience.
| Approach | Data Source | Safety | Sample Efficiency |
|---|---|---|---|
| Online RL | Real-time interaction | Risk of bad actions | Lower |
| Offline RL | Historical logs | Safe (no exploration) | Higher |
| Hybrid | Offline pretrain + online fine-tune | Balanced | Best |
The recommended approach for edge healing follows three phases that minimize risk while enabling continuous improvement: offline (train on historical healing logs with no exploration risk), simulation (fine-tune in a simulated environment with controlled risk), and deployment (conservative online updates with safety constraints at managed risk).
Cognitive Map: Cascade prevention is the meta-discipline of the healing system: ensuring that the act of healing does not produce new failures. Three mechanisms work in concert — resource quotas cap total healing load at of capacity; jittered restarts spread the thundering herd across ; staged recovery reduces completion-time variance by \(1/k\). UCB and contextual bandits then take over from the deterministic policies: as healing episode counts grow, the system learns which action works in which context, progressively refining the probability estimates that underlie Proposition 29 ’s confidence threshold. Offline pretraining, simulation fine-tuning, and conservative online updates compose the lowest-risk RL deployment path for edge systems with sparse real failure data.
RAVEN Self-Healing Protocol
Return to Drone 23’s battery failure. How does the RAVEN swarm heal?
Healing Decision Analysis
Drone 23’s battery alert propagates via gossip . Within 15 seconds (illustrative value), all swarm members know Drone 23’s status. Each drone’s local analyzer assesses impact: Drone 23 will fail in 8 minutes (illustrative value); if it fails in place, a coverage gap opens on the eastern sector with potential crash in contested area; if it returns, neighbors must expand coverage.
Cluster lead (Drone 1) selects the optimal action by evaluating expected mission value for each alternative:
The trade-off is coverage preservation against asset recovery. Compression maintains formation integrity but sacrifices coverage area. Return to base preserves the drone but requires neighbor expansion. Proactive extraction dominates passive observation when asset value exceeds the coverage loss—get the degraded asset out rather than watching it fail in place.
The cluster lead broadcasts the healing plan. Within one second (illustrative value), neighbors acknowledge sector expansion and Drone 23 acknowledges its return path. Formation adjustment completes in roughly 8 seconds (illustrative value). Drone 23 departs, neighbors restore coverage to , and twelve minutes (illustrative value) later Drone 23 reports safe landing at base.
Healing Coordination Under Partition
What if the swarm is partitioned during healing?
Scenario: Seconds into coordination, jamming creates partition. Drones 30-47 (eastern cluster) cannot receive healing plan.
Fallback protocol:
- Eastern cluster detects loss of contact with Drone 1 (cluster lead)
- Drone 30 assumes local lead role for eastern cluster
- Drone 30 independently detects Drone 23’s status from cached gossip
- Eastern cluster executes local healing plan (may differ from western cluster’s plan)
Post-reconnection reconciliation compares healing logs from both clusters, verifies formation consistency, and merges any conflicting state using commutative, associative, idempotent merge operations—ensuring that applying updates in any order produces the same final state.
Edge Cases
What if neighbors also degraded?
If Drones 21, 22, 24, 25 all have elevated failure risk, they cannot safely expand coverage. The healing plan must account for cascading risk.
Before accepting any healing plan, the system checks joint stability: all affected nodes must remain healthy throughout the healing window, so the probability of a stable outcome is the product of each individual node’s health probability across the affected set .
If , reject the healing plan and try alternative (perhaps Option C compression).
What if path home is contested?
Drone 23’s return route passes through adversarial coverage. Risk of intercept during return.
Solution: Incorporate threat model into path planning. Choose return route that minimizes . Accept longer route if safer.
Cognitive Map: RAVEN healing is a live demonstration of every mechanism in this post: gossip propagation delivers Drone 23’s status to all 47 nodes within 15 seconds (illustrative value); the cluster lead applies the expected-utility formula subject to catastrophic-probability constraint; confidence thresholds determine whether to act or wait; the joint stability check filters healing plans that would create secondary failures. Partition during healing is handled gracefully — each cluster acts on its local information, logs causally-ordered actions, and reconciles divergences on reconnection. The eastern cluster’s independent decision is not a failure of the protocol; it is the protocol working as designed.
CONVOY Self-Healing Protocol
Vehicle 4 experiences engine failure during mountain transit. The CONVOY healing protocol differs from RAVEN ’s due to ground vehicle constraints.
Failure Assessment
Vehicle 4 broadcasts a health alert: engine failure in limp mode with reduced power, maximum speed limited to 15 km/h against the convoy’s 45 km/h target, detection confidence 0.91.
The failure is partial—vehicle can move but cannot maintain convoy speed.
Option Analysis
Four options are available: stopping the convoy for field repair (2–4 hours (illustrative value), significant mission delay, but stationary convoy is vulnerable); bypassing — leaving vehicle 4 to await a recovery team with 11 vehicles continuing (minor mission impact, but the isolated vehicle faces security risk in contested terrain); towing vehicle 4 behind vehicle 3 (convoy speed reduced to 20 km/h (illustrative value), moderate delay, increased mechanical stress on vehicle 3); or redistributing critical cargo, securing and destroying vehicle 4, and continuing at full speed (immediate, full-speed continuation, but at the cost of one vehicle).
Decision Framework
Model as Markov Decision Process with state-dependent optimal policy:
State space structure: where is convoy configuration (intact, degraded, towing, or stopped), is distance remaining to objective, and is threat environment (permissive, contested, or denied).
Action space:
The transition dynamics \(P(s' | s, a)\) encode operational realities: field repair success rates, secondary failure probabilities from towing stress, and recovery likelihood for bypassed assets.
Example transition matrix for action “tow” from state “degraded”:
| Next State | Probability | Operational Meaning |
|---|---|---|
| towing | 0.75 | Tow successful, convoy proceeds |
| stopped | 0.15 | Tow hookup fails, convoy halts |
| degraded | 0.08 | Vehicle refuses tow, status quo |
| intact | 0.02 | Spontaneous recovery (rare) |
These probabilities are estimated from operational logs and updated via Bayesian learning as the convoy gains experience.
The reward function combines four weighted terms: mission completion value minus time cost, asset loss cost, and security risk cost, with weights \(w_i\) encoding the mission’s priority ordering among these objectives.
The weights \(w_i\) encode mission priorities—time-critical missions weight \(w_2\) heavily; asset-preservation missions weight \(w_3\); etc.
The optimal value function \(V^*(s)\) satisfies the Bellman equation: the best achievable cumulative reward from state \(s\) equals the immediate reward plus the discounted value of the best reachable next state, where is the RL discount factor weighting future versus immediate outcomes.
The optimal policy shows three phase transitions based on state variables: in the distance-dominated regime (far from the objective), the policy minimizes exposure time and prefers towing; in the time-dominated regime (tight deadline), it prioritizes progress and accepts asset loss; and in the asset-dominated regime (high-value cargo), it preserves assets and accepts delays.
These phase transitions emerge from the MDP structure, not from hand-coded rules. The optimization framework discovers them automatically.
Coordination Challenge
Vehicles 1-3 see the situation one way (closer to vehicle 4). Vehicles 5-12 may have different information (further away, may not have received all updates).
The healing protocol ensures consistency through five sequential steps: vehicle 4 broadcasts the failure to all reachable vehicles; the convoy lead (vehicle 1) makes the healing decision; the decision propagates to all vehicles via gossip ; each vehicle confirms receipt and readiness; and the coordinated maneuver executes on the lead’s signal.
If the lead is unreachable, the nearest cluster lead makes the local decision, reachable vehicles execute the local plan, and unreachable vehicles hold position until contact is restored.
Cognitive Map: CONVOY healing illustrates how MDP structure discovers the non-obvious optimal policy. The three phase-transition regimes (distance-dominated \(\to\) tow, time-dominated \(\to\) abandon, asset-dominated \(\to\) delay) emerge from the Bellman equation without being hand-coded — they are consequences of the multi-objective reward weights and transition probabilities. The coordination protocol (broadcast \(\to\) lead decision \(\to\) gossip propagation \(\to\) confirmation \(\to\) execution) is the MAPE-K sequence instantiated for ground convoy constraints. When the convoy lead is unreachable, the fallback to nearest cluster lead is the same authority-tier degradation structure seen throughout the series.
Composed failure scenario: simultaneous Power + Partition + Drift
CONVOY vehicle ECU-4 experiences all three failure modes concurrently: battery drops below (power failure), backhaul link is jammed (partition), and sensor calibration drifts 15% beyond threshold (drift).
The resolution sequence is fully determined by the framework:
Drift is detected first (fastest feedback loop): the Schmitt trigger hysteresis ( Definition 47 ) fires when drift exceeds \(\theta_H\), but the derivative confidence dampener ( Definition 49 ) holds the decision while the slope is still falling — preventing a false healing action on a still-worsening signal — and MAPE-K enters read-only mode for the affected sensor.
Power failure triggers regime downgrade ( Definition 52 , Observation Regime Schedule): the battery drop to \(O_1\) (alert) halves measurement frequency, the MAPE-K self-throttling law ( Proposition 36 ) reduces to conserve CPU margin, and the \(\alpha(R)\) throttle coefficient begins reducing loop aggressiveness.
Partition triggers the circuit breaker ( Proposition 37 ): when , all five transitions fire — capability drops to , floors at , shifts heavier-tail, resets on partition end, and \(K\) drops to — and the HAC authority check ensures ECU-4 does not issue healing actions it lacks authority for.
At stabilization, ECU-4 operates at (survival only) and \(O_2\) (conservative monitoring), with sensor reads via dead-band only ( Definition 38 ); the hardware watchdog ( Definition 41 ) independently monitors the MAPE-K heartbeat, and resets the throttled loop without software cooperation if it stops responding.
Recovery follows a fixed order: partition ends first ( resets, capability ladder re-entry via ); power recovers second (regime upgrades via Definition 52 hysteresis band — requires 5% battery margin above threshold before upgrade); drift corrects last (requiring calibration convergence confirmed by three consecutive readings within \(\theta_L\)). Each recovery is independent; the system does not require simultaneous recovery of all three to restore normal operation.
Key insight: The three failure modes have different recovery timescales (seconds for partition, minutes for battery, hours for calibration drift) and different recovery mechanisms (protocol, hardware, physical). Designing for independence — not simultaneity — is the architectural property that makes autonomous recovery tractable.
OUTPOST Self-Healing
The OUTPOST mesh has 127 sensor nodes in a remote perimeter with no physical access, ultra-low power budgets, and sustained jamming, making physical intervention infeasible — healing actions must not consume more energy than they restore capability. Each failure mode is matched to a low-energy healing action (frequency hop, recalibration, firmware restart), with energy-efficient scheduling prioritizing by value restored per joule spent and mesh reconfiguration extending neighbor sensitivity to partially cover gaps from permanent sensor loss. The trade-off is that coverage extension from increased neighbor sensitivity raises false positive rates — the OUTPOST mesh accepts a higher false-alarm rate in the coverage-gap zone as the cost of maintaining any detection at all.
The OUTPOST sensor mesh faces unique healing challenges: remote locations preclude physical intervention, and ultra-low power budgets constrain healing actions.
Failure Modes and Healing Actions
Each failure mode in the OUTPOST mesh has a characteristic detection signal and a preferred low-energy healing action; the target success rate reflects design goals under nominal environmental conditions.
| Failure Mode | Detection | Healing Action | Target Success Rate* |
|---|---|---|---|
| Sensor drift | Cross-correlation with neighbors | Recalibration routine | 85% |
| Communication loss | Missing heartbeats | Frequency hop, power increase | 70% |
| Power anomaly | Voltage/current deviation | Load shedding, sleep mode | 90% |
| Software hang | Watchdog timeout | Controller restart | 95% |
| Memory corruption | CRC check failure | Reload from backup | 80% |
*Target rates are design goals; actual rates depend on deployment conditions and calibration.
Power-Constrained Healing
OUTPOST healing actions compete with the power budget. Each healing action consumes energy equal to the product of action power draw and its duration , plus the fixed communication overhead for coordinating the action.
The total energy spent on all healing actions \(i\) must not exceed the available reserve minus the minimum energy needed to keep the mission running.
Where is current battery capacity and is minimum energy required to maintain mission capability.
Healing action scheduling: When multiple healing actions compete for the limited energy budget, the priority score below ranks them by expected capability restored per unit of energy spent, ensuring the most energy-efficient healings execute first.
Mesh Reconfiguration
When a sensor fails beyond repair, the mesh must reconfigure; the diagram shows how neighbors of the failed Sensor 3 extend their sensitivity to partially cover the gap while the dashed arrow marks the coverage zone that remains degraded.
graph TD
subgraph Active_Sensors["Active Sensors"]
S1["Sensor 1
(extending coverage)"]
S2["Sensor 2
(extending coverage)"]
S4[Sensor 4]
S5[Sensor 5]
end
subgraph Failed["Failed Sensor"]
S3["Sensor 3
FAILED"]
end
subgraph Fusion_Nodes["Fusion Layer"]
F1[Fusion A]
F2[Fusion B]
end
S1 --> F1
S2 --> F1
S3 -.->|"no signal"| F1
S4 --> F2
S5 --> F2
F1 <-->|"coordination"| F2
S1 -.->|"increased sensitivity"| Gap["Coverage Gap
(S3 zone)"]
S2 -.->|"increased sensitivity"| Gap
style S3 fill:#ffcdd2,stroke:#c62828
style Gap fill:#fff9c4,stroke:#f9a825
style S1 fill:#c8e6c9
style S2 fill:#c8e6c9
Read the diagram: The red Sensor 3 sends no signal — its F1 input arrow is dashed. Sensors 1 and 2 (green) compensate: dashed arrows point toward the yellow Coverage Gap zone showing their increased sensitivity. Fusion nodes A and B coordinate via the bidirectional coordination edge; B’s cluster is unaffected and continues normally. The yellow gap zone persists — it is smaller than if no extension occurred, but it cannot be eliminated without physically deploying a replacement sensor.
Healing protocol for permanent sensor loss:
The healing protocol for permanent sensor loss proceeds in five steps: neighbor sensors detect missing heartbeats; multiple neighbors confirm to avoid a false positive; the fusion node logs the loss and estimates the coverage gap; neighbors adjust sensitivity to partially cover the gap; and the node is flagged for physical replacement when connectivity allows.
Neighbor coverage extension:
Sensors adjacent to the failed sensor can increase their effective range through:
- Sensitivity increase (higher gain, more false positives)
- Duty cycle increase (more power consumption)
- Orientation adjustment (if mechanically possible)
The net extended coverage sums the original field of view, the marginal gains contributed by each neighbor that extends its sensitivity, minus the Overlap between those extended zones to avoid double-counting.
Full coverage is rarely achievable—the goal is minimizing the detection gap.
Fusion Node Failover
If a fusion node fails, its sensor cluster must find an alternative:
The primary path routes through an alternate fusion node when reachable; the secondary path forms a peer-to-peer mesh among sensors with one sensor acting as temporary aggregator; and the tertiary path has each sensor operating independently with local decision authority.
The fusion state at time \(t\) is determined by a priority cascade: use the primary fusion node while reachable, fall back to the alternate fusion node if primary is lost, and revert to fully autonomous per-sensor operation only when both are unreachable.
Each state has different capability levels and power costs. The system tracks time in each state for capacity planning.
Cognitive Map: OUTPOST healing is constrained by energy first and bandwidth second — the ordering that inverts the usual cloud priority. Every healing action is priced in joules and scheduled by . The five-tier healing table (sensor drift \(\to\) recalibration at 85%, communication loss \(\to\) frequency hop at 70%, through hardware hang \(\to\) watchdog restart at 95%) gives the MAPE-K loop a prioritized action set that stays within the power envelope. Mesh reconfiguration after permanent sensor loss trades false-positive rate for coverage continuity — a trade the design accepts explicitly rather than hiding it behind a confidence threshold.
Commercial Application: SMARTBLDG Building Automation
SMARTBLDG manages building automation for commercial high-rise towers. Systems controlled: HVAC, lighting, access control, and fire safety. When the BMS server fails or loses connectivity, subsystems must heal autonomously while maintaining occupant safety and comfort.
The healing challenge: Building systems have extreme reliability requirements (fire safety must work always) but limited local compute (PLCs with kilobytes of memory). The MAPE-K loop must be distributed across multiple controllers with varying capabilities.
Hierarchical MAPE-K for SMARTBLDG :
The diagram shows four control levels from building-wide BMS down to individual devices, with the key pattern that each level runs its own local MAPE-K loop so that zone controllers remain autonomous when higher tiers are unreachable.
graph TD
subgraph "Building Level"
BMS["BMS Server
Global optimization
Trend analysis"]
end
subgraph "Floor Level (52 floors)"
FC1["Floor Controller 12
Local MAPE-K
Zone coordination"]
FC2["Floor Controller 13
Local MAPE-K
Zone coordination"]
FC3["..."]
end
subgraph "Zone Level (4-8 per floor)"
ZC1["Zone Controller
VAV, lighting
Occupancy response"]
ZC2["Zone Controller
VAV, lighting
Occupancy response"]
end
subgraph "Device Level"
VAV["VAV Box
Damper, reheat"]
LIGHT["Lighting
On/off, dim"]
SENSOR["Sensors
Temp, CO2, motion"]
end
BMS -.->|"Setpoints
Schedules"| FC1
BMS -.->|"Setpoints
Schedules"| FC2
FC1 --> ZC1
FC1 --> ZC2
ZC1 --> VAV
ZC1 --> LIGHT
SENSOR --> ZC1
style BMS fill:#e3f2fd,stroke:#1976d2
style FC1 fill:#fff3e0,stroke:#f57c00
style FC2 fill:#fff3e0,stroke:#f57c00
style ZC1 fill:#e8f5e9,stroke:#388e3c
style ZC2 fill:#e8f5e9,stroke:#388e3c
Read the diagram: Four tiers shown top-to-bottom. The blue BMS sends setpoints and schedules via dotted arrows (guidance, not commands) to floor controllers. Floor controllers issue solid commands to zone controllers; zone controllers command individual VAV boxes, lighting, and sensors. Critically, each orange floor controller box labels itself “Local MAPE-K” — when the BMS dotted-arrow path fails, each floor continues running its own loop independently. Device-level sensors feed only the zone controller directly above them; no sensor data crosses tier boundaries without aggregation.
Failure modes and healing authority by tier:
The table maps each building failure type to its normal and disconnected healing authority, with the Safety Override column showing the inviolable constraint that supersedes all comfort-oriented decisions.
| Failure | Normal Authority | Disconnected Authority | Safety Override |
|---|---|---|---|
| VAV damper stuck | Zone Controller | Zone Controller | Full open if fire alarm |
| AHU fan failure | Floor Controller | Floor Controller | Smoke evacuation priority |
| Chiller fault | BMS | Floor Controllers coordinate | Maintain minimum cooling |
| BACnet network down | BMS diagnoses | Local fallback schedules | Fire systems on dedicated net |
| Floor controller crash | BMS restarts | Neighbor floor assists | Zone controllers autonomous |
MAPE-K at the zone controller level (8KB RAM, 16KB flash):
The zone controller implements a minimal MAPE-K loop:
Monitor (every 30 seconds (illustrative value)): Read temperature, CO2, occupancy sensors. Compute rolling average and deviation. Memory cost: 200 bytes (illustrative value) for 5-minute (illustrative value) history.
Analyze (event-triggered): Compare readings against setpoints and learned patterns. Flag anomalies when temperature deviates by more than \(2^\circ\)F for more than 5 minutes, when CO2 exceeds 1000 ppm (indicating poor ventilation), or when occupancy is detected while HVAC is in unoccupied mode.
Plan (on anomaly): Select from predefined healing actions: adjusting the VAV damper position as the primary response, requesting help from the floor controller if damper adjustment is insufficient, or overriding to failsafe (full cooling) for emergencies.
Execute (immediate): Send BACnet commands to actuators. Log action for later upload.
Knowledge (static + learned): Factory setpoints + learned occupancy patterns + healing action success rates.
Power-constrained healing parallels OUTPOST : Zone controllers operate on 24VAC power derived from HVAC transformers. When analyzing healing options, energy is not the binding constraint—actuator wear is, and the healing cost for action \(a\) is therefore the per-cycle wear cost multiplied by the number of actuator cycles the action requires.
VAV dampers are rated for 100,000 cycles (illustrative value). Excessive hunting (oscillating between positions) accelerates wear. The healing policy limits damper adjustments to once per 5 minutes (illustrative value) except for safety overrides.
Cascade prevention during chiller failure: When a chiller fails on a \(95^\circ\)F (illustrative value) day, 52 (illustrative value) floor controllers simultaneously demand maximum cooling from remaining chillers. Without coordination, this cascades to remaining chiller overload.
SMARTBLDG prevents cascade by allocating the available cooling capacity proportionally: each floor receives a share weighted by its priority factor and current occupancy, normalized across all floors.
Priority factors are 2.0 (illustrative value) for data center floors (equipment damage risk), 1.0 (illustrative value) for occupied offices, 0.3 (illustrative value) for unoccupied floors, and 0.1 (illustrative value) for storage and mechanical spaces.
This weighted allocation ensures critical spaces get cooling while preventing cascade. Floor controllers receive their allocation and independently manage distribution to zones.
BMS server failure healing protocol:
At T+0s, floor controllers detect the BMS heartbeat timeout (30s (illustrative value) threshold). At T+30s (illustrative value), each floor controller activates standalone mode. At T+35s (illustrative value), controllers fall back to the cached weekly schedule (last sync from BMS). At T+60s (illustrative value), floor controllers discover neighbors via BACnet broadcast. At T+90s (illustrative value), they elect a temporary coordinator for inter-floor decisions. At T+120s (illustrative value), temperature deadbands are widened by \(2^\circ\)F (illustrative value) to reduce hunting without BMS optimization.
Building remains comfortable for 8+ hours (illustrative value) in standalone mode. Occupants rarely notice BMS outages because floor controllers maintain local comfort.
Fire safety independence: Critical insight—fire and life safety systems operate on dedicated networks with independent controllers. SMARTBLDG ’s MAPE-K for HVAC/lighting never interferes with fire safety. When a fire alarm is active, the HVAC mode switches according to fire condition, overriding any comfort-oriented healing decision in progress.
This safety override supersedes all comfort-oriented healing. The healing hierarchy respects life safety as an inviolable constraint.
Economic benefit: Self-healing reduces comfort complaints during BMS outages, eliminates unnecessary maintenance dispatches, limits energy waste from actuator oscillation, and dramatically reduces time to restoration versus manual intervention.
Cognitive Map: SMARTBLDG demonstrates the hierarchical MAPE-K pattern at commercial scale: each tier runs its own loop, with dotted guidance from above and solid commands below. The key design insight is the safety layer separation — fire and life safety systems live on a dedicated network that the comfort HVAC loop never touches. This is the physical implementation of the capability hierarchy: L0 (life safety) is structurally isolated from L1–L4 (comfort, efficiency, learning). Zone controllers with 8 KB RAM run a complete MAPE-K loop — the same four-phase structure as the 47-drone swarm, just with 200-byte state windows and BACnet commands instead of gossip packets.
The Limits of Self-Healing
Every mechanism in this post has a boundary condition. Physical destruction, healing-loop corruption, adversarial exploitation, and irresolvably ambiguous diagnoses all define situations where autonomous healing must stop and either degrade gracefully or await human judgment. The required response is to recognize the limit explicitly — formalize when to stop trying, how to log state for later analysis, and how to stabilize in the least-risky configuration while waiting for human input. The fundamental tension is that stopping autonomous healing prematurely wastes the system’s self-recovery capacity, while stopping too late allows a failing healer to worsen the damage. The judgment horizon is mission-specific and cannot be derived analytically; it requires explicit design-time specification from mission planners.
Damage Beyond Repair Capacity
Some failures cannot be healed autonomously: physical destruction ( RAVEN drone collision), critical component failure without redundancy, and environmental damage (waterlogged OUTPOST sensor).
Self-healing must recognize when to stop trying. The system should abandon autonomous repair and defer to graceful degradation once the expected value recovered by healing falls below the expected cost — resource drain, risk of worsening the failure, and opportunity cost — of attempting it.
At this point, graceful degradation takes over. The component is abandoned, and the system adapts to operate without it.
Failures That Corrupt Healing Logic
If the failure affects the MAPE-K components themselves, healing may not be possible: a failed Monitor cannot detect problems; a failed Analyzer cannot interpret observations; a failed Planner cannot generate solutions; a failed Executor cannot apply solutions; and a corrupted Knowledge base drives wrong actions from wrong information.
Defense: Redundant MAPE-K instances. RAVEN maintains simplified healing logic in each drone’s flight controller, independent of main processing unit. If main unit fails, flight controller can still execute basic healing (return to base, emergency land).
Adversary Exploiting Healing Predictability
If healing behavior is predictable, an adversary can exploit it by triggering healing to consume resources (denial of service), timing attacks for when healing is in progress to exploit the vulnerability window, or crafting failures that healing makes worse through adversarial input.
Mitigations include randomizing healing parameters (backoff times, thresholds), rate-limiting healing actions, and detecting unusual healing patterns as a potential attack signature.
The Judgment Horizon
When should the system stop attempting autonomous healing and wait for human intervention?
Human judgment is needed when healing attempts have been exhausted without resolution, when multiple conflicting diagnoses have similar confidence, when potential healing actions would cross ethical or mission boundaries, or when the situation matches no known healing pattern.
At the judgment horizon , the system stabilizes in the safest available configuration, logs complete state for later analysis, awaits human input when connectivity allows, and avoids irreversible actions until guidance is received.
Anti-Fragile Learning
Each healing episode generates data on which failure was detected, which healing action was attempted, whether it succeeded, how long it took, and what side effects it produced.
This data improves future healing. Healing policies adapt based on observed effectiveness. Actions that consistently fail are deprioritized. Actions that work in specific contexts are preferentially selected.
The context-conditional success probability tracks how often action \(a\) has worked in this specific operational context, estimated as a simple empirical frequency.
Formal improvement condition: The system’s healing effectiveness improves after each failure episode if the expected success probability at the next timestep exceeds the current baseline:
This holds when failure provides information gain ( ), the policy update incorporates the observation ( ), and the failure mode is within the learning distribution ( ).
Uncertainty bound: (illustrative value) depending on novelty of failure mode. Novel failures outside training distribution may not yield improvement.
Cognitive Map: Self-healing has four hard boundaries. Physical destruction is beyond any software response — the terminal safety state handles it by acknowledging the loss and stabilizing. Healing-loop corruption requires the watchdog hierarchy (Section 2) to detect and bypass. Adversarial exploitation requires second-order defenses: rate-limiting, randomized parameters, and pattern detection on the healing pattern itself. The judgment horizon — when to stop trying — is the only limit that cannot be computed from first principles; it is a mission design input. Anti-fragile learning is the positive counterpart: every failure episode inside the learning distribution improves the next response, provided the failure mode is one the policy has seen before.
Irreducible Trade-offs
No design eliminates these tensions. The architect selects a point on each Pareto front.
Trade-off 1: Healing Aggressiveness vs. Stability
Multi-objective formulation:
The objective jointly maximizes recovery utility and stability utility while minimizing overshoot cost, with \(K_{\text{ctrl}}\) as the single parameter that moves the operating point along the Pareto front.
where \(K_{\text{ctrl}}\) is the controller gain.
Stability constraint:
The table below traces the Pareto front for controller gain \(K_{\text{ctrl}}\): moving down the rows buys faster recovery (lower recovery time) at the cost of a narrower stability margin and higher overshoot risk.
| Gain \(K_{\text{ctrl}}\) | Recovery Time | Stability Margin | Overshoot Risk |
|---|---|---|---|
| 0.2 | 15s | 1.37 | 0.02 |
| 0.5 | 6s | 1.07 | 0.08 |
| 0.8 | 4s | 0.77 | 0.18 |
| 1.0 | 3s | 0.57 | 0.31 |
Higher gain achieves faster recovery but risks oscillation and overshoot. Cannot achieve instant recovery with zero overshoot risk.
Trade-off 2: Local vs. Coordinated Healing
Multi-objective formulation:
The objective selects a healing mode \(m\) that simultaneously maximizes initiation speed, decision optimality, and fleet coordination quality — three objectives that cannot all be maximized under partition.
The Pareto front shows that each step toward better decision quality requires waiting longer, moving the operating point from instantaneous-but-suboptimal local decisions to slow-but-optimal fleet coordination.
| Healing Mode | Initiation Time | Decision Quality | Coordination |
|---|---|---|---|
| Local-only | <1s | Suboptimal | None |
| Cluster consensus | 2-5s | Better | Local |
| Fleet coordination | 10-30s | Optimal | Full |
Cannot achieve fast initiation AND optimal decision AND full coordination. Partition forces choice between speed and optimality.
Trade-off 3: Exploration vs. Exploitation (Action Selection)
Multi-objective formulation:
The exploration parameter \(c\) controls a direct trade-off: larger \(c\) improves long-term optimality by exploring more alternatives, but lowers short-term utility by deferring exploitation of the current best action.
where \(c\) is UCB exploration parameter.
UCB formula:
The exploration bonus grows as action \(a\) is under-tried (small \(n_a\)) and the parameter \(c\) scales how strongly that bonus drives selection toward unexplored actions.
The table shows how three representative values of \(c\) move the operating point along the exploration-exploitation spectrum and the resulting regret profile.
| \(c\) Value | Exploration | Exploitation | Regret Profile |
|---|---|---|---|
| 0.5 | Low | High | Fast convergence, possible local optimum |
| 1.0 | Medium | Medium | Balanced |
| 2.0 | High | Low | Slow convergence, global exploration |
Low \(c\) minimizes short-term regret (exploit current best). High \(c\) minimizes long-term regret (explore alternatives). No single \(c\) optimizes both.
Trade-off 4: Healing Depth vs. Cascade Risk
Multi-objective formulation:
Deeper healing actions are more thorough but touch more shared resources, so the objective balances thoroughness utility against the probability of triggering secondary failures, parameterized by healing depth \(d\).
where \(d\) is healing action depth.
The table below shows how the three canonical healing depths compare on thoroughness, cascade risk, and the expected fraction of root-cause failures resolved.
| Healing Depth | Thoroughness | Cascade Risk | Recovery Completeness |
|---|---|---|---|
| Shallow (restart) | Low | Low | 0.60 |
| Medium (reconfigure) | Medium | Medium | 0.80 |
| Deep (rebuild) | High | High | 0.95 |
Deeper healing is more thorough but risks triggering cascades. Shallow healing is safer but may not resolve root cause.
Cost Surface: Healing Under Resource Constraints
The total cost of a healing action decomposes into three terms — the direct action cost, the connectivity-dependent coordination cost, and the opportunity cost of resources diverted from mission — each of which varies with the connectivity regime \(\Xi\).
where:
- : Direct cost of healing action \(a\)
- : Coordination cost under connectivity \(\Xi\)
- : Cost of resources diverted from mission
The coordination cost grows with both the scope of the healing action and the degradation of connectivity: local actions in full connectivity cost \(O(1)\), cluster-wide actions under degraded connectivity cost \(O(\log n)\), fleet-wide actions under intermittent connectivity cost \(O(n)\), and fleet-wide actions under denied connectivity are infeasible.
Resource Shadow Prices
Shadow prices quantify the marginal value of each scarce resource to the healing system; a higher shadow price means that resource is the binding constraint on healing performance, so relaxing it yields the greatest improvement.
| Resource | Shadow Price \(\lambda\) (c.u.) | Interpretation |
|---|---|---|
| Healing compute | 0.15/action | Value of faster recovery |
| Coordination bandwidth | 1.80/sync | Value of coordinated healing |
| Mission capacity | 2.50/%-hr | Cost of diverted resources |
| Redundancy margin | 4.00/node | Value of spare capacity |
(Shadow prices in normalized cost units (c.u.) — illustrative relative values; ratios convey healing resource scarcity ordering. Healing compute (0.15 c.u./action) is the reference unit. Calibrate to platform-specific operational costs.)
Irreducible Trade-off Summary
Each row names a fundamental design tension, the two objectives that pull against each other, and the specific outcome that no implementation can achieve regardless of engineering effort.
| Trade-off | Objectives in Tension | Cannot Simultaneously Achieve |
|---|---|---|
| Speed-Stability | Fast recovery vs. no overshoot | Instant recovery with zero risk |
| Local-Coordinated | Fast initiation vs. optimal decision | Both under partition |
| Explore-Exploit | Short-term vs. long-term optimality | Both with finite samples |
| Depth-Cascade | Thorough healing vs. cascade safety | Deep healing with zero cascade risk |
Cognitive Map: These four trade-offs are structural — no implementation eliminates them. The stability gain condition quantifies the speed-stability boundary: faster feedback reduces \(\tau\), allowing higher \(K_{\text{ctrl}}\). The local-coordinated trade-off collapses under partition to a binary choice: act now with local information or wait for consensus that may never arrive. The explore-exploit trade-off requires knowing the time horizon: UCB with \(c = 1\) is Bayes-optimal for the KT regret bound; contextual bandits and deep RL shift the efficient frontier by exploiting state structure. The depth-cascade trade-off is managed by the Resource Priority Matrix ( Definition 43 ) and cascade prevention quota — these bound the cascading risk of deep healing without eliminating it. Every design choice in the framework above is a position on one or more of these Pareto fronts.
Related Work
Autonomic computing and self-adaptive systems. The vision of computing systems that manage themselves without human intervention was articulated by Kephart and Chess [1] , who identified the Monitor-Analyze-Plan-Execute loop as the canonical closed-loop structure. IBM’s architectural blueprint [2] gave that vision engineering form, specifying the MAPE-K reference model that Definition 36 directly instantiates. Huebscher and McCann [3] survey the subsequent decade of autonomic computing research, cataloguing degrees of autonomy and the modelling approaches that followed. The narrower but complementary literature on self-adaptive software — systems that modify their own behaviour at runtime in response to observed context — is surveyed by Salehie and Tahvildari [6] , who identify feedback-loop architectures as the dominant design pattern and establish the connection to classical control theory that Proposition 22 exploits. This article’s edge-specific contributions — gain scheduling by connectivity regime, staleness-aware threshold suppression, and the self-throttling survival law — address failure modes that arise in contested, disconnected environments outside the scope of those foundational works.
Control Barrier Functions and safety-critical control. Control Barrier Functions as a unified framework for enforcing safety constraints on continuous-time systems were introduced by Ames et al. [7] . The same group extended the theory to application-focused settings and provided the convergence and composition results that motivate the discrete-time adaptation in Definition 39 [8] . The connection to gain scheduling — selecting controller parameters as a function of operating regime — is classical; Rugh and Shamma [9] survey the theoretical foundations and engineering practice. The discrete Control Barrier Function formulation in Definition 39 , the CBF-derived refractory bound of Proposition 31 , and the Nonlinear Safety Guardrail of Proposition 25 adapt these continuous-time results to the tick-driven MAPE-K execution model, where safety can only be checked at discrete sample instants rather than continuously.
Sequential change detection and CUSUM. The problem of detecting an abrupt change in a sequentially observed process was formally posed by Page [12] , whose cumulative sum (CUSUM) statistic remains the canonical solution. Basseville and Nikiforov [13] provide the comprehensive treatment of change-point detection theory that underlies the sentinel formulation in the Proposition 31 discussion, including the ARL (average run length) analysis used to calibrate alarm thresholds \(h = 5k\) for the RAVEN platform. The application here — detecting persistent drift in the \(\rho_q\) stability margin rather than a jump in a scalar observation — is a direct instantiation of the two-sided CUSUM structure, with the slack parameter \(k\) derived from the \(3\sigma\) detection criterion and the rolling noise estimate serving as the reference value.
Edge computing and self-healing in contested environments. Satyanarayanan [4] and Shi et al. [5] establish the architectural argument for edge computing: latency constraints and bandwidth asymmetry make cloud offload structurally unviable for a significant class of real-time applications, requiring autonomous decision-making at the network edge. The self-healing framework in this article is a concrete response to that requirement — providing the stability guarantees and resource-management discipline that make autonomy safe rather than merely possible. The chaos engineering methodology used to validate Proposition 37 (the Weibull circuit breaker) follows Basiri et al. [14] , who established fault injection against production systems as the standard practice for validating resilience claims. Anti-fragility — the property that systems improve under stress rather than merely recovering — is the organizing concept drawn from Taleb [11] ; the lexicographic priority hierarchy (Survival \(\succ\) Autonomy \(\succ\) Coherence \(\succ\) Anti-fragility) embeds this concept as the highest tier of the capability stack.
Closing
Drone 23 landed safely. CONVOY vehicle 4 was towed to the objective. OUTPOST sensors reconfigured around the failed node. HYPERSCALE healed microservice failures autonomously. SMARTBLDG maintained comfort through central server outages.
The common thread: each system detected its own faults, selected a remediation strategy, and executed recovery without waiting for human authorization. The MAPE-K control loop—operating continuously at the speed of local computation, not the speed of communication—enabled this autonomy.
Three conditions made autonomous healing tractable. First, anomaly detection provided calibrated confidence estimates rather than binary alerts, enabling the confidence-threshold framework of Prop 84. Second, the capability hierarchy gave healing a clear priority ordering: MVS components before non- MVS , survival capability before mission capability. Third, the stability condition of Proposition 22 tied controller gain to feedback delay, preventing healing from oscillating.
What this framework does not address: healing succeeds locally, but independent local decisions can produce globally inconsistent state. When RAVEN ’s eastern cluster lost contact during the Drone 23 healing sequence, both clusters made correct decisions given their information. Their records diverged. That divergence—and the problem of reconciling it—is a distinct challenge from healing itself, one that requires different mechanisms. Fleet Coherence Under Partition addresses exactly that: CRDTs , causal ordering, and the authority tiers that determine who wins when clusters disagree.
References
[1] Kephart, J.O., Chess, D.M. (2003). “The Vision of Autonomic Computing.” IEEE Computer, 36(1), 41–50. [doi]
[2] IBM Research (2006). “An Architectural Blueprint for Autonomic Computing.” IBM White Paper, 4th Ed.
[3] Huebscher, M.C., McCann, J.A. (2008). “A Survey of Autonomic Computing — Degrees, Models, and Applications.” ACM Computing Surveys, 40(3), Article 7. [doi]
[4] Satyanarayanan, M. (2017). “The Emergence of Edge Computing.” IEEE Computer, 50(1), 30–39. [doi]
[5] Shi, W., Cao, J., Zhang, Q., Li, Y., Xu, L. (2016). “Edge Computing: Vision and Challenges.” IEEE Internet of Things Journal, 3(5), 637–646. [doi]
[6] Salehie, M., Tahvildari, L. (2009). “Self-Adaptive Software: Landscape and Research Challenges.” ACM Trans. Autonomous and Adaptive Systems, 4(2), Article 14. [doi]
[7] Ames, A.D., Xu, X., Grizzle, J.W., Tabuada, P. (2017). “Control Barrier Function Based Quadratic Programs for Safety Critical Systems.” IEEE Transactions on Automatic Control, 62(8), 3861–3876. [doi]
[8] Ames, A.D., Coogan, S., Egerstedt, M., Notomista, G., Sreenath, K., Tabuada, P. (2019). “Control Barrier Functions: Theory and Applications.” Proc. European Control Conference (ECC), 3420–3431. [doi]
[9] Rugh, W.J., Shamma, J.S. (2000). “Research on Gain Scheduling.” Automatica, 36(10), 1401–1425. [doi]
[10] Avizienis, A., Laprie, J.-C., Randell, B., Landwehr, C. (2004). “Basic Concepts and Taxonomy of Dependable and Secure Computing.” IEEE Transactions on Dependable and Secure Computing, 1(1), 11–81. [doi]
[11] Taleb, N.N. (2012). Antifragile: Things That Gain From Disorder. Random House.
[12] Page, E.S. (1954). “Continuous Inspection Schemes.” Biometrika, 41(1/2), 100–44. [doi]
[13] Basseville, M., Nikiforov, I.V. (1993). Detection of Abrupt Changes: Theory and Application. Prentice Hall. [pdf]
[14] Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., Rosenthal, C. (2016). “Chaos Engineering.” IEEE Software, 33(3), 35–62. [doi]