Free cookie consent management tool by TermsFeed Generator

Self-Measurement Without Central Observability


Prerequisites

The measurement problem addressed here emerges directly from the framework in Why Edge Is Not Cloud Minus Bandwidth. It does not exist independently of that foundation.

Three results from that foundation shape everything in this article. First, the Semi-Markov connectivity model ( Definition 12 ) establishes when measurement becomes the system’s only source of truth. During the None regime ( ), there is no external observability infrastructure - no central monitoring, no cloud metrics, no human operator in the loop. Every judgment the system makes about its own health must be drawn from local evidence alone. Self-measurement is not about building better dashboards; it is about survival during partition.

Second, the capability hierarchy ( ) establishes what measurement must protect. A system that cannot assess its own capability level cannot make sound decisions about which recovery actions to attempt, how aggressively to heal, or when to shed load. Accurate health knowledge is the prerequisite for any subsequent autonomous action.

Third, the inversion thesis - “design for disconnected, enhance for connected” - establishes the design constraint. The observation mechanisms developed here must function in complete isolation from day one. Reporting to a central collector, when connectivity permits, is an enhancement. It is never a dependency.

The state variables \(\Sigma(t)\) and \(\mathbf{H}(t)\) defined in Why Edge Is Not Cloud Minus Bandwidth exist formally in the model. This article addresses how they are actually estimated: from local sensor readings, from inter-node gossip when the mesh is intact, and from statistical inference when observations age. These mechanisms require survival capability already in place — stable power, safe-state defaults, and basic mission function in complete isolation. Anomaly detection on a node that cannot maintain power or safe state is unsound regardless of detection accuracy.

Throughout this series: Node = a logical autonomous control unit (a drone, vehicle, or embedded MCU running the autonomic stack). Device = the physical platform hosting a node. Sensor = a data source on a device. These are distinct roles: a single device may host one node and dozens of sensors.


Memory budget prerequisite. All definitions and propositions in this article assume the zero-tax autonomic tier ( Definitions 95–97 , The Constraint Sequence and the Handover Boundary) is active, providing a ~200-byte baseline for health tracking. On platforms where the full measurement stack cannot fit (< 4 KB SRAM), the definitions below cannot activate and the system degrades to heartbeat-only L0 observation.

Architectural prerequisite. This article assumes the Inversion Threshold condition ( Proposition 2 , Why Edge Is Not Cloud Minus Bandwidth) is satisfied: \(P(C(t) = 0) > \tau^*\). When partition probability falls below \(\tau^*\), cloud-connected diagnostics dominate and the local autonomy stack described here is over-engineered for that deployment profile.

Overview

Self-measurement lets autonomous systems know their own state without external infrastructure. Each concept in this article connects directly to a design consequence:

ConceptWhat It Tells YouDesign Consequence
Anomaly DetectionFlag anomaly when Set detection sensitivity from error costs
Gossip PropagationConvergence Size fleet from acceptable propagation delay
Staleness Theory Bound observation age from acceptable drift and diffusion coefficient
Byzantine Tolerance Trust-weight nodes to bound adversarial influence

Decoding the overview table.

This builds on fault detection (Cristian, 1991) and epidemic algorithms (also called gossip protocol or epidemic dissemination — terms used interchangeably in this series) [1] , applying both in contested edge environments where central infrastructure is unavailable.


Opening Narrative: OUTPOST Under Observation

Early morning. OUTPOST BRAVO’s 127-sensor perimeter mesh has been operating for 43 days. Without warning, the satellite uplink goes dark - no graceful degradation. Seconds later, Sensor 47 stops reporting. Last transmission: routine, battery at 73%, mesh connectivity strong. Then silence.

OUTPOST needs to answer: how do you diagnose this failure without external systems?

Hardware failure routes around the sensor; communication failure attempts alternative paths; environmental occlusion waits and retries; adversarial action triggers alert and defensive posture.

Each diagnosis implies different response. Without central observability, OUTPOST must diagnose itself — analyze patterns, correlate with neighbors, assess probabilities, decide on response. All locally. All autonomously.

The sensor cannot speak for itself. Everything OUTPOST knows about Sensor 47 must be inferred from what Sensor 47 is not saying, what its neighbors observed last, and what the baseline predicted it should be doing right now.

This is self-measurement: assessing health and diagnosing anomalies without external assistance. You can’t heal what you haven’t diagnosed, and you can’t diagnose what you haven’t measured.


The Self-Measurement Challenge

Cloud-native observability assumes continuous connectivity. The diagram below traces that pipeline from raw metrics to human-driven remediation. Every arrow is a network call — and every one fails when connectivity is denied.

    
    graph LR
    A[Metrics] -->|"network"| B[Collector]
    B -->|"network"| C[Storage]
    C -->|"network"| D[Analysis]
    D -->|"network"| E[Alerting]
    E -->|"network"| F[Human Operator]
    F -->|"network"| G[Remediation]

    style A fill:#e8f5e9
    style F fill:#ffcdd2
    linkStyle 0,1,2,3,4,5 stroke:#f44336,stroke-width:2px,stroke-dasharray: 5 5

Every arrow represents a network call. For edge systems, this architecture fails at the first arrow - when connectivity is denied, the entire observability pipeline is severed.

Read the diagram. Every arrow is a network call. In the Denied regime, the first arrow fails — and with it, the entire pipeline. Human operators and cloud analysis are not “later steps”; they are load-bearing columns. Remove them and the structure collapses.

The edge alternative inverts the data flow [2] : sensors, analysis, and actuation all reside on the device, and the feedback loop closes locally without any external network call.

    
    graph LR
    A[Local Sensors] --> B[Local Analyzer]
    B --> C[Health State]
    C --> D[Autonomic Controller]
    D --> E[Self-Healing Action]
    E -->|"feedback"| A

    style A fill:#e8f5e9
    style B fill:#c8e6c9
    style C fill:#fff9c4
    style D fill:#ffcc80
    style E fill:#ffab91

Analysis happens locally. Alerting goes to an autonomic controller, not human operators. The loop closes locally without external connectivity.

Read the diagram. No arrow leaves the device. The feedback loop closes from local sensors through local analysis to autonomic action and back. Cloud reporting, when connectivity permits, is an out-of-band enhancement — not a dependency.

The table below shows how every dimension of the observability problem differs between the two architectures — not just the technical constraints but also the economic cost structure of errors.

AspectCloud ObservabilityEdge Self-Measurement
Analysis locationCentral serviceLocal device
Alerting targetHuman operatorAutonomic controller
Training dataAbundant historical dataLimited local samples
Ground truthLabels from past incidentsUncertain, inferred
Compute budgetElastic (scale up)Fixed (device limits)
Memory budgetPractically unlimitedConstrained (MB range)
Response latencyMinutes acceptableSeconds required

Analysis must happen locally, and alerting must be autonomous. Waiting for human operators or external services is not an option. The system must detect, diagnose, and decide within the constraints of local compute and memory.

Physical translation. The asymmetry in the cost column is the design driver. A false positive triggers an unnecessary healing action — wasteful but recoverable. A false negative leaves a failure undetected — potentially cascading in environments where one missed sensor failure enables adversarial exploitation of the coverage gap. This asymmetry determines the detection threshold — the cost model drives the architecture.

Cognitive Map — Section 1. Three results from Why Edge Is Not Cloud Minus Bandwidth motivate self-measurement: Denied regime eliminates external observability, capability hierarchy requires accurate health to trigger recovery, inversion thesis requires measurement to function in complete isolation \(\to\) cloud observability pipeline has six network dependencies, all failing together \(\to\) edge local loop has zero external dependencies \(\to\) false-negative cost asymmetry drives detection threshold design.


Local Anomaly Detection

The Detection Problem

Anomaly detection is signal classification. The sensor produces a sequence of scalar observations \(x_1, x_2, \ldots, x_t\) indexed by time, where each \(x_t\) is a single reading (voltage, temperature, signal strength, etc.) at step \(t\).

At each timestep, the local analyzer must decide: is this observation normal, or anomalous?

This is a binary classification under uncertainty:

Definition 19 (Local Anomaly Detection Problem). Given a time series generated by process \(P\), the local anomaly detection problem is to determine, for each observation \(x_t\), whether \(P\) has transitioned from nominal behavior \(P_0\) to anomalous behavior \(P_1\), subject to:

In other words, the detector must classify each incoming reading as normal or anomalous using only a fixed-size memory footprint and constant work per sample — ruling out batch methods or model retraining on the fly.

Mode-Transition Safety Requirement. Upon any capability-level transition ( Definition 3 , Why Edge Is Not Cloud Minus Bandwidth), the anomaly detector must immediately recompute both the dynamic threshold \(\theta^*(t)\) and the stability-region lower bound \(\delta_q\) using the new mode’s parameters (\(T_\text{tick}(q)\), Stability Region from Definition 4 ). Do not carry over \(\theta^*(t)\) from the previous mode; recompute it with the current \(T_\text{acc}\) value and the new \(\delta_q\) floor. This ensures the detector never operates outside the new mode’s Stability Region boundary, as required by the series’ mode-switching stability guarantee.

T_acc carry-over. Since \(T_\text{acc}\) does not reset on mode transitions ( Definition 15 , Why Edge Is Not Cloud Minus Bandwidth), the threshold recomputation uses the old \(T_\text{acc}\) value with the new mode’s \(\delta_q\) and \(\gamma_\text{FN}\). The net effect: entering a more degraded mode mid-partition produces a tighter threshold (the new mode’s smaller Stability Region means \(\delta_q\) consumes more headroom) applied to the same accumulated partition age.

Simultaneous transition + partition crossing. When a mode transition and a partition boundary crossing occur in the same MAPE-K tick (e.g., battery drops to L0 while connectivity fails simultaneously): apply the mode transition first (compute new \(\theta^*(t)\) using the new mode’s \(\delta_q\) and current \(T_\text{acc}\)), then evaluate partition-dependent triggers ( Proposition 37 circuit breaker, threshold tightening). This order ensures the new mode’s Stability Region is respected before partition-specific thresholds are applied. Reversing the order would evaluate the circuit breaker against a pre-transition \(\delta_q\), potentially firing at the wrong threshold.

ConstraintCloud DetectionEdge Detection
ComputeGPU clusters, distributedSingle CPU, milliwatts
MemoryTerabytes for modelsMegabytes for everything
Training dataPetabytes historicalDays of local history
Ground truthLabels from incident responseInference from outcomes
FP costHuman review timeUnnecessary healing action
FN costDelayed responseUndetected failure, potential loss

The asymmetry of costs is critical. A false positive triggers an unnecessary healing action - wasteful but recoverable. A false negative leaves a failure undetected - potentially catastrophic in contested environments where undetected failures cascade.

The diagram below shows the local anomaly detection pipeline: raw sensor data flows through the Kalman estimator, the residual is compared to the adaptive threshold, and an anomaly triggers an alert while a normal reading updates the baseline.

    
    sequenceDiagram
    participant S as Sensor
    participant K as Kalman Estimator
    participant T as Threshold Engine
    participant A as Alert Bus
    S->>K: raw measurement x(t)
    K->>K: predict: x_hat(t|t-1)
    K->>K: update: gain, state, variance
    K->>T: residual z(t) = x(t) - x_hat(t|t-1)
    T->>T: compare |z(t)| vs theta_star(t)
    alt anomaly detected
        T->>A: raise alert(severity, timestamp)
    else normal
        T->>K: update baseline
    end

Statistical Approaches

MAPE-K [Kephart & Chess, 2003] (Monitor–Analyse–Plan–Execute with Knowledge Base): the four-phase autonomic control loop executing periodically at tick interval T_tick(q). The Monitor phase collects measurements; Analyse applies anomaly detection; Plan selects healing actions; Execute issues commands. The Knowledge Base K provides shared state across all four phases.

Edge anomaly detection requires algorithms that are computationally lightweight (O(1) per observation), memory-efficient (constant or logarithmic memory), adaptive (capable of adjusting to changing baselines without retraining), and interpretable (providing confidence values rather than binary classification).

Three approaches meet these requirements:

Exponential Weighted Moving Average (EWMA)

The simplest effective approach. The two equations below update the running mean \(\mu_t\) and running variance after each new observation \(x_t\), where \(\alpha \in (0,1)\) is the smoothing weight that trades recency for stability.

Physical translation: \(\mu_t\) is a weighted average that remembers recent readings at weight \(\alpha\) and forgets old ones at rate \((1 - \alpha)\). tracks how much the signal bounces around its own recent mean. Together they define “what normal looks like right now” — updating in two multiply-adds per observation tick.

The smoothing weight \(\alpha\) typically falls between 0.05 and 0.3 (threshold — the right value depends on the noise variance of the target metric); at \(\alpha = 0.1\) (illustrative value) the effective window spans approximately 10 samples, and larger values accelerate adaptation at the cost of noisier baselines.

Mode-indexed smoothing — the stability-region coupling. When changes across capability levels ( Definition 3 in Why Edge Is Not Cloud Minus Bandwidth), a fixed \(\alpha\) produces windows of different physical duration. At L1 with doubled, \(\alpha = 0.1\) tracks a 100-second window instead of 50 seconds, halving baseline responsiveness exactly when the system’s Stability Region ( Definition 4 in Why Edge Is Not Cloud Minus Bandwidth) is smallest.

To preserve a constant physical time constant across modes, the smoothing weight must be mode-indexed:

Physical translation: The smoothing coefficient grows with the tick interval — in survival mode where \(T_\text{tick}\) is longer, \(\alpha\) is larger, meaning the baseline updates faster per tick but is still updated less frequently in wall-clock time. This prevents a sudden switch to a low-frequency survival schedule from freezing the baseline at a stale value: the first post-switch observation still gets appropriate weight. Without this correction, a baseline appropriately tuned at L3 becomes underresponsive at L1 — suppressing the anomaly score exactly when the healing loop’s corrective authority is most reduced.

where is the process drift rate in s , calibrated from stationary field data.

Example: (illustrative value) gives (illustrative value) (5 s tick, 53 s effective window) and (illustrative value) (10 s tick, 55 s effective window) — windows matched to within 4% (illustrative value).

Without this correction, a baseline that is appropriately tuned at L3 becomes systematically underresponsive at L1, suppressing below the detection threshold precisely when the healing loop’s corrective authority ( Definition 40 in Self-Healing Without Connectivity) is most reduced.

(process drift rate, \(s^{-1}\)) is the Kalman model noise rate calibrated from stationary field data. Distinct from: Weibull scale \(\lambda_i\) ( Definition 13 ), gossip contact rate \(f_g\) ( Proposition 12 ), and information decay rate \(\lambda_c\) ( Definition 5b ).

\(\lambda\) notation legend. \(\lambda\) carries six distinct roles in this article; subscripts and function notation are the sole disambiguators at each occurrence:

  1. Gossip contact rate — \(f_g > 0\) (exchanges per second per node); see Definition 24 and Propositions 12–13 .
  2. Weibull scale / drift rate — \(\lambda_i\) or \(\lambda_W\); Weibull scale parameter for partition duration ( Definition 13 in Why Edge Is Not Cloud Minus Bandwidth).
  3. Adaptive L2 regularization coefficient — \(\lambda_\text{reg}\); used in the online SVM weight update.
  4. Distributional shift decay constant — \(\xi_\text{shift}\); model accuracy decay under covariate shift.
  5. Eigenvalue notation and \(\lambda_i(\cdot)\); standard linear algebra convention throughout.
  6. Shadow price ; Lagrangian multiplier on observability constraint \(g_i\).

Key distinction: \(\kappa_\text{drift}\) (local baseline drift rate, calibrated from stationary field data) and \(\lambda_c\) (partition-induced information decay rate, Why Edge Is Not Cloud Minus Bandwidth) are separate quantities. During partition, \(\lambda_c\) governs staleness of received peer data; \(\kappa_\text{drift}\) governs the local baseline update.

Physical translation: A drift rate of \(0.02\,s^{-1}\) sounds negligible — but in 60 seconds the baseline has shifted by a factor of \(3{\times}\) without adaptive correction. This is why the EMA window must track wall-clock elapsed time, not tick count: at L2 throttling (double the tick interval), the window must double too, or baseline drift appears as a false anomaly.

Where \(\alpha \in (0, 1)\) controls the decay rate. Smaller \(\alpha\) means longer memory. Note: variance uses to keep the estimate independent of \(x_t\), consistent with the anomaly score calculation.

The anomaly score \(z_t\) normalizes the current observation’s deviation from the running mean by the running standard deviation, yielding a dimensionless measure of surprise that can be compared against a fixed threshold regardless of the signal’s units or scale.

Physical translation: Divide the current reading’s distance from the running mean by the running standard deviation. A result of 2.5 means this observation is 2.5 standard deviations from what the sensor has been reporting recently — statistically unlikely under a normal distribution regardless of the signal’s units or absolute scale.

Anomaly Classification Decision Problem:

Objective Function: The formula selects the binary decision \(d\) (flag or not flag) that maximizes expected utility given the current observation \(x_t\), where false positive cost and false negative cost are the key parameters.

where \(d = 1\) indicates “anomaly detected”.

Optimal Decision Rule:

The system flags an observation as anomalous when . The optimal threshold \(\theta^*\) is the inverse-normal quantile that balances the prior probabilities of the two hypotheses against their respective error costs, shifting the decision boundary toward sensitivity when missed detections are more costly than false alarms.

Throughout this article, \(\theta^*\) denotes the anomaly-classification threshold (a dimensionless z-score boundary). In Why Edge Is Not Cloud Minus Bandwidth, \(\tau^*\) denotes the Inversion Threshold (a partition-probability boundary). When citing results from both articles in the same context, use for the anomaly threshold and for the inversion threshold.

Physical translation: \(\theta^*\) shifts the alarm boundary based on how bad each type of mistake is. If missing a fault costs ten times more than raising a false alarm ( ), the detector triggers on weaker signals, accepting more false alarms to avoid catastrophic misses. When both costs are equal, the formula reduces to the standard 50th-percentile boundary.

Proposition 9 (Optimal Anomaly Threshold). Applies the anomaly classification problem of Definition 19 to derive: given asymmetric error costs for false positives and for false negatives, the optimal detection threshold \(\theta^*\) satisfies the likelihood ratio condition:

Flag an anomaly when the anomalous distribution is more likely than normal, scaled by the relative cost of each mistake.

where \(f_1\) is the probability density under \(H_1\) (anomaly) and \(f_0\) under \(H_0\) (normal). The decision boundary lies where the anomaly likelihood exceeds the normal likelihood by the cost ratio.

In other words, the detector should flag an observation as anomalous exactly when the data is more likely to have come from the anomalous distribution \(H_1\) than from the normal distribution \(H_0\), scaled by the relative cost of each type of error.

Proof: The expected cost is the sum of false-positive cost weighted by the false-positive rate and false-negative cost weighted by the false-negative rate, both functions of the chosen threshold \(\theta\).

Setting the derivative of the expected cost with respect to \(\theta\) to zero gives the first-order condition, where \(f_0\) and \(f_1\) are the probability densities under \(H_0\) and \(H_1\) evaluated at the boundary point \(x_\theta\).

This yields the Neyman-Pearson condition. Equivalently, flagging when is identical to the posterior-probability rule below: declare anomaly when the probability of \(H_1\) given the current score exceeds the ratio of false-positive cost to total error cost (see Proposition 9 ).

Both formulations select the same decision boundary; the z-score form is computationally convenient while the posterior form makes the cost trade-off explicit. For tactical edge systems where , both shift toward sensitive detection.

Partition-Duration-Aware Threshold ( Definition 13 extension): Under the Weibull partition model, the relative cost of false negatives rises as partition duration grows — missed anomalies cannot be externally remediated. Let be the partition duration accumulator ( Definition 15 ) and the P95 planning threshold ( Definition 13 ). The cost ratio evolves as:

Physical translation: As the device stays disconnected longer, missing a fault becomes costlier — there is no central system to compensate. This term scales the false-negative cost upward proportionally with partition age, so the detector automatically becomes more sensitive the longer it operates in isolation.

where is the false-negative cost escalation rate (deployment parameter; recovers the static threshold). Substituting into the Neyman-Pearson condition yields the time-varying optimal threshold:

Physical translation: At partition start, — baseline sensitivity. As partition age approaches the P95 planning horizon , the alarm threshold shrinks by a factor of . When help is farthest away, the detector is most alert.

where is the static baseline threshold from the likelihood ratio condition above.

Interpretation: As , (partition just started, baseline sensitivity). As , — the detector becomes times more sensitive. For : the circuit breaker ( Proposition 37 ) fires, the system enters L0, and \(\theta^*(t)\) is frozen at its current value — further threshold drift is irrelevant since healing actions are suspended.

The numbers in this calibration block are illustrative values chosen for the OUTPOST scenario; they are not theoretical bounds.

OUTPOST calibration: With (example value — replace with your measured figure), the threshold drops from (example value) at partition start to (theoretical bound given the example values above) at the P95 boundary. This triggers more sensitive detection of health anomalies precisely when external recovery is least available.

Calibration example ( RAVEN ). Missed motor degradation costs 1000 J (lost drone); false alarm costs 50 J (wasted diagnostics). Cost ratio \(C_\text{FN}/C_\text{FP} = 20\) (illustrative value). With \(P(H_0) = 0.95\) (illustrative value) (nominal operation 95% of flight time), the optimal threshold \(\theta^*\) requires a posterior anomaly probability \(\geq 95.2\%\) — substantially tighter than a naive 50% threshold. The cost-ratio condition yields (illustrative value — derived from the \(C_\text{FN}/C_\text{FP} = 20\) and \(P(H_0) = 0.95\) choices above), where \(\hat{\sigma}\) is the estimated noise standard deviation from the baseline estimator.

As partition age grows toward \(Q_{95}\), the denominator reaches \((1 + \gamma_\text{FN})\): . For OUTPOST with \(\gamma_\text{FN} = 2\) (illustrative value), a threshold initially set at \(2.5\hat{\sigma}\) (illustrative value) becomes \(0.83\hat{\sigma}\) (theoretical bound given the illustrative inputs above) — nearly three times more sensitive. The system grows progressively trigger-happy as partition age approaches the planning horizon, prioritising missed-fault detection over false-alarm suppression.

Circuit breaker interaction. The threshold tightening in this proposition applies only while \(T_\text{acc} \leq Q_{95}\). When \(T_\text{acc}\) exceeds \(Q_{95}\) (the P95 partition duration from Definition 13 ), Proposition 37 (Self-Healing Without Connectivity) fires — the Weibull circuit breaker suspends all healing actions and halts threshold tightening. The system enters L0 monitoring-only mode regardless of currently detected anomalies. Treat \(Q_{95}\) as the hard ceiling on \(T_\text{acc}\) for the purposes of this proposition.

Worked example — OUTPOST with n = 127 sensors. For OUTPOST Weibull parameters \(k = 1.4\) (illustrative value) (shape) and \(\lambda_W = 14400\,\text{s}\) (illustrative value) (scale, 4 h), the P95 partition duration is (theoretical bound under the illustrative Weibull parameters above) (about 9 hours). With (illustrative value) and \(\gamma_\text{FN} = 2.0\) (illustrative value):

The \(\theta^*(t)\) state transitions follow four cases. On partition onset, \(\theta^*(t)\) tightens as \(T_\text{acc}\) accumulates, converging toward . At the circuit-breaker boundary (\(T_\text{acc} > Q_{0.95}\)), \(\theta^*(t)\) is frozen at its current value and healing actions are suspended. On reconnection (\(T_\text{acc}\) resets), \(\theta^*(t)\) unfreezes and resumes from its frozen value — the prior partition’s accumulated sensitivity is preserved as the new baseline. On re-partition with a previously frozen \(\theta^*\), tightening resumes from the frozen value, so a node that survived a prior long partition starts the new one already at heightened sensitivity.

Reconnection recovery rule. When the partition ends and \(T_\text{acc}\) resets to zero, the threshold does not snap back to . Instead it decays back toward baseline via:

where is a decay constant (default: \(1/T_\text{quarantine}\)) and \(T_\text{connected}\) is elapsed time since reconnection. The \(\max\) ensures the threshold does not fall below the frozen sensitivity level until the node has observed sufficient connected-regime data to justify a looser threshold, preventing chronic false alarms after repeated partitions while preserving the heightened vigilance earned through long isolation.

(Notation: \(\delta\)-subscripted symbols in this section carry distinct roles: \(\delta_q\) is the monitoring guard band (stability margin floor); , are chi-squared test bounds ( Proposition 11 ); is the SVM weight-norm bound; , are gossip convergence time parameters. Each is dimensionally distinct; subscripts are the sole disambiguator.)

Stability-region lower bound on \(\theta^*(t)\). The time-varying threshold tightens \(\theta^*(t)\) as partition grows. However, there is a lower bound below which tightening itself destabilizes the healing loop: if \(\theta^*(t)\) drops too far, the error signal grows even for an unchanged , consuming stability margin without any healing action firing. Under the Stability Region framework ( Definition 4 in Why Edge Is Not Cloud Minus Bandwidth), the minimum safe threshold is:

Physical translation: Every stability region has a minimum guard band \(\delta_q\). Tightening the alarm threshold below \(\delta_q\) makes the detector itself a source of instability — small deviations in \(z_t\) generate large error signals that consume control authority without triggering any healing action. Stop tightening at \(\delta_q\).

where \(\delta_q\) is the monitoring guard band for mode \(q\), is the largest eigenvalue of the Lyapunov matrix, and \(\varepsilon > 0\) is a small margin. Tightening \(\theta^*(t)\) past \(\delta_q\) pushes the initial error state toward before any healing action fires — the detector becomes a destabilizing input. In practice, for RAVEN L3 parameters, (illustrative value), and for L1 thermal throttle, (illustrative value) — the guard band is tighter exactly when the stability region is smaller.

What this means in practice: The optimal detection threshold is not a tuning knob — it is determined by the cost ratio between wrong action types. If a false positive costs \(10\times\) less than a missed detection, the boundary shifts toward sensitivity; if costs are equal, it reduces to the 50% posterior threshold.

Analogy: A smoke detector vs. a fire — set too sensitive, every burnt piece of toast triggers it; set too lenient, real fires go undetected. The threshold is calibrated from the relative cost of each error type, not from an arbitrary sensitivity preference.

Logic: Proposition 9 derives \(\theta^*\) via the Neyman-Pearson likelihood ratio condition: flag when \(f_1(\theta)/f_0(\theta) = C_{\text{FP}}/C_{\text{FN}}\), shifting the boundary toward sensitivity as the missed-detection cost grows relative to false-alarm cost.

Watch out for: the likelihood-ratio formula assumes both \(H_0\) and \(H_1\) are Gaussian; heavy-tailed sensor noise (vibration transients, electromagnetic interference) shifts the optimal boundary beyond what the Gaussian derivation predicts, so the actual false-negative rate exceeds what the cost ratio implies.

Constraint Set: Any algorithm implementing the optimal decision rule must satisfy these three resource limits; they rule out batch-processing and unbounded-memory approaches that would otherwise be valid statistical choices.

State Transition Model: The paired update rule below shows how the EWMA accumulates the new observation \(x_t\) into the running mean and variance estimates in a single constant-time step.

EWMA updates take O(1) time (two multiply-adds) and O(1) memory (storing only \(\mu\) and \(\sigma^2\)); adaptation is automatic through exponential decay with no retraining step.

Holt-Winters for Seasonal Patterns

For signals with periodic structure (day/night cycles, shift patterns), Holt-Winters captures level, trend, and seasonality. The three equations below update, respectively, the deseasonalized level \(L_t\), the local trend \(T_t\), and the seasonal correction \(S_t\), each controlled by its own smoothing coefficient (\(\alpha\), , ) and a period length \(p\). (We write \(\beta_{\text{hw}}\) for the trend coefficient rather than bare \(\beta\) to distinguish it from the bandwidth asymmetry ratio \(\beta = B_b/B_l\) used in the ingress filter later in this article.)

Physical translation: \(L_t\) strips the seasonal swing to expose the true underlying level. \(T_t\) tracks whether that level is rising or falling. \(S_t\) records the repeating up-and-down pattern for this time of day or week. Combine all three for a one-step-ahead forecast — flag an anomaly when the actual reading diverges from that forecast beyond the calibrated threshold.

Where \(L_t\) is level, \(T_t\) is trend, \(S_t\) is seasonal component, and \(p\) is period length.

Holt-Winters updates all three components continuously, requiring O(1) time per observation and O(p) memory to store one period of seasonal factors. The period length \(p\) is scenario-specific: RAVEN flight telemetry has no meaningful seasonality (\(p = 1\); use EWMA instead); CONVOY uses \(p = 24\) hours (illustrative value) for communication quality and \(p = 8\) hours (illustrative value) for engine thermal cycles; OUTPOST uses \(p = 24\) hours (illustrative value) for solar and thermal cycles and \(p = 7\) days (illustrative value) for perimeter activity patterns.

Isolation Forest Sketch for Multivariate

For multivariate anomaly detection with limited memory, streaming isolation forest assigns each point an anomaly score based on how quickly it can be isolated in a random tree ensemble. The formula below maps expected isolation path length \(E[h(x)]\) — normalized by the average path length \(c(n)\) in a random tree of \(n\) points — to a score between 0 and 1, where scores near 1 indicate anomalies that isolate unusually quickly.

Physical translation: Points that are easy to isolate (short path through random trees) score near 1 — they stand apart from the crowd and are anomalies. Normal points are hard to isolate (long paths through dense regions) and score near 0.5. The exponential mapping compresses path-length ratios into a 0–5 range regardless of tree depth or dataset size.

Where \(h(x)\) is path length to isolate \(x\), and \(c(n)\) is average path length in a random tree.

Scoring each new point requires O(log n) time per query and O(t) time per tree build, with \(O(t \times d)\) total memory for \(t\) trees at depth limit \(d\); reservoir sampling updates the tree ensemble online without retraining from scratch.

Parameter derivation for CONVOY :

Under assumption set (memory budget \(M \leq 32\)KB, anomaly rate \(\pi_1 \approx 0.02\), feature dimension \(d = 12\)), the three equations below derive the number of trees \(t\), maximum tree depth , and resulting memory footprint \(M\) from the stated constraints.

Detection rate derivation: Under , the formula below lower-bounds the true positive rate as a function of \(t\) and ; the approximation uses and evaluates to roughly 0.85 (illustrative value) for these parameters.

False positive rate (illustrative value) for threshold at 95th percentile.

CUSUM for Change-Point Detection

When the goal is detecting when a change occurred (not just that it occurred), CUSUM provides optimal detection for shifts in mean [3] . The statistic \(S_t\) accumulates evidence of a positive shift above the slack \(k\) relative to nominal mean \(\mu_0\), resetting to zero whenever evidence goes negative, and triggers an alarm when it exceeds threshold \(h\).

Physical translation: \(S_t\) accumulates evidence of a persistent upward shift. Each sample contributes \(x_t - \mu_0 - k\): negative when the reading is within the allowable slack \(k\), positive when it consistently exceeds it. The \(\max(0, \cdot)\) reset forgets evidence when readings return to normal, so CUSUM only alarms when the shift is sustained — not just occasional — making it ideal for catching slow-onset degradation that EWMA adapts to and misses.

where \(\mu_0\) is the nominal mean and \(k\) is the allowable slack. Alarm when \(S_t > h\).

Detection speed comparison: For a shift of magnitude \(\delta\), the formula below gives the expected number of samples CUSUM needs before triggering, as a function of alarm threshold \(h\), slack \(k\), and shift \(\delta\).

EWMA with smoothing \(\alpha\) detects when crosses the control limit . The average run length under \(H_1\) (ARL\(_1\)) depends on both \(\delta\) and \(\alpha\) and has no simple closed form — it is computed from Markov chain approximations or simulation. For standard operating parameters (\(\delta = \sigma\), \(\alpha = 0.3\), control limit \(L = 2.5\)), the standard ARL table entry is shown below.

Detection speedup analysis:

Under assumption , . The speedup increases with shift magnitude because CUSUM is parameterized for a known step change (\(k = \delta/2\) is optimal for shift \(\delta\)) while EWMA is a general-purpose smoother not tuned to any specific shift. CUSUM’s optimality for step changes (proven by Moustakides, 1986) means it dominates EWMA for the abrupt sensor failure scenario.

Error rate derivation:

With threshold where \(z_\alpha = 2.5\) (illustrative value) and anomaly score , the two-sided normal distribution gives the following false positive and false negative rates; the FNR assumes a shift of from the anomalous distribution.

Detection latency: EWMA effective memory spans \(1/\alpha\) to observations. For \(\alpha = 0.3\) (illustrative value): \(N \in [3, 5]\) (illustrative value) observations contribute meaningfully to the statistic.

OUTPOST Sensor 47 uses EWMA for primary detection: temperature, motion intensity, battery voltage each tracked independently. Cross-sensor correlation uses a lightweight covariance estimate between Sensor 47 and its mesh neighbors.

Adaptive Change-Point Detection: From Static to Kalman-Optimal Baseline

The CUSUM statistic above uses a fixed nominal mean \(\mu_0\). Sensor baselines drift in practice: OUTPOST thermal sensors track diurnal temperature cycles, CONVOY engine metrics shift with load and altitude, and RAVEN RF interference patterns change with formation geometry. A stale \(\mu_0\) turns baseline drift into a continuous false alarm stream. The fix is to replace the static \(\mu_0\) with a Kalman-optimal adaptive estimator that tracks “normal” as it evolves.

Definition 20 (Adaptive Baseline Estimator). Given a sensor time series , the adaptive baseline is the Kalman-optimal estimate of the true instantaneous mean \(\mu_t\) under the first-order drift model [4] :

The recursive Kalman update at each timestep is:

The Kalman anomaly score (normalized innovation) is:

Under \(H_0\) (no anomaly) at steady state: .

Design parameter (drift-to-noise ratio) controls the adaptation rate:

(Notation: is the drift-to-noise ratio used in this adaptive estimator. This is distinct from \(\rho = T_d/T_s\) — the compute-to-transmit energy ratio defined in Why Edge Is Not Cloud Minus Bandwidth’s Notation Legend.)

Connection to EWMA: The update is structurally identical to the EWMA update , with \(K_t\) in place of \(\alpha\). The critical difference: EWMA uses a fixed \(\alpha\); the Kalman gain \(K_t\) starts large (high initial uncertainty, learns fast) and converges to a smaller steady-state value \(K_\infty\) (tracks at the optimal rate for the observed noise level). Fixed-\(\alpha\) EWMA is a degenerate Kalman filter with forced constant at \(\alpha R/(1-\alpha)\) every step.

Physical translation: The adaptive estimator tracks the sensor’s “normal” behavior using exponential smoothing. A new reading updates the mean estimate, with the learning rate \(\alpha_q\) scaled by how fast the connectivity regime is changing: when the node is transitioning between regimes ( large), the estimator adapts quickly to the new baseline. When connectivity is stable, it adapts slowly — preventing false alarms from random fluctuations around a steady operating point.

Empirical status: The cost ratio and the RAVEN example value of 20 are engineering estimates from mission-cost analysis, not statistically derived from field data. The Neyman-Pearson threshold formula is exact given the ratio; the ratio itself requires calibration per deployment. The partition-duration escalation parameter has no published empirical baseline — treat as a design choice until validated in integration testing.

Compute Profile: CPU: per sample — four scalar operations (prediction, gain update, state update, variance update). Memory: — fixed-size state vector and covariance scalar at steady-state gain. The bottleneck at high sample rates is memory bandwidth from sensor DMA, not filter arithmetic.

Analogy: A hospital monitor tracking a patient’s personal baseline vitals — it alerts when you deviate from your own normal, not the population average. A fit patient with a resting heart rate of 48 bpm will not trigger an alarm at 55 bpm; the system learned that 48 is your normal and adjusts its alert window accordingly.

Logic: Definition 20 replaces the EWMA’s fixed smoothing weight \(\alpha\) with a Kalman gain \(K_t\) that starts large (fast learning during warm-up) and converges to \(K_\infty\), the optimal steady-state value derived from the ratio \(Q/R\) of process noise to measurement noise.

Proposition 10 (Kalman Baseline Convergence Rate). The Kalman gain sequence converges geometrically to the steady-state value \(K_\infty\), where:

The adaptive learning rate settles to the unique value balancing process noise against sensor noise.

Here is the process noise variance and is the sensor noise variance; a larger ratio produces a gain closer to 1, meaning the filter trusts new measurements more than its own model.

For small drift : — the steady-state gain scales as the square root of the process-to-noise ratio. Convergence to \(K_\infty\) is geometric with rate from any initial \(P_0\), reaching steady state in approximately \(1/K_\infty\) samples.

Proof: Substitute into the Riccati recursion and solve the resulting quadratic . Convergence rate follows from linearization of the recursion near \(P_\infty\). \(\square\)

Design consequence: After a transient of approximately \(1/K_\infty\) samples, the false positive rate for the Kalman anomaly score is exactly \(2\Phi(-\theta^*)\) — not an approximation. The EWMA-based score with fixed \(\alpha\) is only asymptotically calibrated and may carry excess false-alarm rate during the warm-up period.

Validity condition — white noise assumption: The Kalman gain convergence (Prop 24) and the \(H_0\) distribution both require measurement noise \(v_t\) to be approximately i.i.d. Gaussian with stationary variance \(R\). Real MEMS sensors violate this: \(1/f\) noise dominates below ~1 Hz; variance is temperature-correlated ( for thermistors, where \(\beta_T\) is the temperature sensitivity coefficient — distinct from the Holt-Winters trend coefficient \(\beta_{\text{hw}}\) above and the bandwidth asymmetry ratio \(\beta\) later in this article); aging causes slow \(R\) drift.

The false-alarm guarantee is void if \(R\) is off by more than 50% (threshold — requires accurately calibrated sensor noise variance) — the actual false-alarm rate scales as .

OUTPOST calibration: Temperature sensors drift at (illustrative value) with sensor noise (illustrative value). At (sensor sampling rate): (illustrative value), \(R = 0.01\,\text{K}^2\) (illustrative value), giving (illustrative value) and (illustrative value). Baseline adapts on a timescale of (illustrative value) — slow enough to track seasonal drift without following measurement noise.

Empirical status: The steady-state formulas for \(P_\infty\) and \(K_\infty\) are exact given \(Q\) and \(R\). The OUTPOST drift figure of and \(R = 0.01\,\text{K}^2\) are representative sensor-datasheet values; actual calibration requires pre-deployment measurement. The “50% \(R\) misestimation voids the false-alarm guarantee” bound is analytically derived, not empirically measured.

Watch out for: \(K_\infty\) and the \((1-K_\infty)^t\) convergence rate both assume stationary measurement variance \(R\); if temperature or aging drives \(R\) beyond 50% of its calibrated value, the steady-state gain is miscalibrated and the false-alarm rate guarantee from the Design Consequence paragraph does not hold until \(R\) is re-estimated.

Definition 21 (Bayesian Surprise Metric). The Bayesian Surprise statistic is the adaptive-baseline generalization of CUSUM, accumulating Kalman log-likelihood ratios:

where the log-likelihood ratio under a \(\delta\)-standard-deviation shift is:

and \(\kappa > 0\) is the allowance that prevents indefinite accumulation. Alert condition: for threshold \(h\).

When the Kalman filter tracks a drifting baseline, each day’s stays near zero even as the raw sensor mean climbs — the accumulator only fires when a genuine anomaly diverges from the tracked trend, not when the trend itself shifts.

Difference from static CUSUM: The static form uses a fixed \(\mu_0\). Definition 21 replaces \(x_t - \mu_0\) with , computed from the Kalman innovation normalized by the current innovation variance . When the baseline drifts by \(5\,^\circ\text{C}\) over a season, throughout (the Kalman filter tracks the drift), while the static form accumulates \(S_t \propto 5/\sigma\) — triggering continuous false alarms.

Bayesian interpretation: is a discounted accumulation of log Bayes factors [5] . An alarm at corresponds to posterior odds — the detector declares a change when the Bayesian evidence ratio exceeds \(e^h\).

Proposition 11 (Sensor Death Override Condition). The Brownian diffusion confidence interval ( Proposition 14 ) is derived under the assumption that the sensor innovation is . Two sensor death modes violate this assumption in opposite directions; both are detected by the sample chi-squared statistic over window \(w\):

A stuck or exploding sensor is caught by checking whether recent innovation variance is too small or too large.

Under \(H_0\) (alive sensor): . The diffusion model is overridden — and the node is flagged P_CRITICAL regardless of staleness — whenever:

Failure modes detected:

Proof: Under \(H_0\), asymptotically ( Proposition 10 ), so (chi-squared with \(w\) degrees of freedom). For window \(w = 30\): and — false override rates are negligible. \(\square\)

Calibration: Set (illustrative value), (illustrative value), (\(\tau_\text{max}\) is the Maximum Useful Staleness bound, formally derived at Proposition 14 later in this article; a provisional calibration value of is used here for OUTPOST parameters), where \(f_s\) is the sensor sampling rate in Hz. The window must span at least seconds (Prop 5) to ensure the chi-squared test has power against slow-onset flatline failures.

Failure mode\(\chi^2_w(t)\) signatureCI behaviorOverride effect
Alive sensor\(\approx 1.0\)Correct widthNo override
Flatline death\(\to 0\)Falsely narrow (high false confidence)Flags P_CRITICAL
Noise death\(\gg 1\)Wide but misleadingFlags P_CRITICAL

OUTPOST scenario: In the 127-sensor perimeter mesh, flatline and noise deaths were the two most common hardware failure modes in field trials, making this chi-squared override the primary health guard for sensor trust decisions.

Empirical status: The thresholds and and window \(w = 30\) are analytically justified by chi-squared tail probabilities (\(<10^{-10}\) false-override rate). The claim that flatline and noise deaths dominate OUTPOST failure modes is a scenario illustration, not a measured failure-mode frequency from field data.

Watch out for: the chi-squared tail probabilities in the proof assume \(z_t^K \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0,1)\), which holds only after the Kalman gain has converged (Proposition 10); during warm-up, the chi-squared statistic is inflated, so the override fires with higher probability than the \(<10^{-10}\) bound implies.

The minimum viable sensor set required for continued operation and the transition to reduced capability mode under growing P_CRITICAL count are addressed in Self-Healing Without Connectivity ( Definition 121 , Minimum Viable System ).

Optional: Game-Theoretic Extension — Adversarial Threshold Selection

Proposition 9 derives \(\theta^*\) against a non-strategic anomaly distribution. An adversary who controls a compromised sensor (as in the OUTPOST scenario) can output continuously, evading detection with zero effort. The correct defense is a randomized threshold - the Nash equilibrium of the inspection game.

Inspection game: Inspector selects threshold \(\theta\) (possibly mixed strategy \(\pi(\theta)\)); evader selects signal pattern to minimize detection probability.

At the Nash equilibrium \((\pi^*, a^*)\), the inspector is indifferent over all thresholds in the support of \(\pi^*\), and the evader is indifferent over all evasion strategies. The equation below states the equilibrium condition: the mixed threshold strategy \(\pi^*(\theta)\) must produce the same expected detection probability against every evasion action \(a\) the adversary might choose, so the adversary gains nothing by switching strategies.

Physical translation: A fixed threshold of \(\theta^* = 2.5\sigma\) is an open invitation — an adversary-controlled sensor need only output exactly \(2.4\sigma\) indefinitely. Using a randomized threshold drawn fresh each round forces the adversary to erase any calibration advantage: they cannot tune their evasion signal to a moving target. The Nash equilibrium mixes over a range of thresholds so that every evasion strategy faces the same expected detection probability, closing the calibration exploit entirely.

Cross-sensor defense: For the OUTPOST mesh, cross-sensor consistency - checking whether sensor \(i\)’s report is consistent with what \(i\)’s neighbors’ models predict - defeats the threshold-calibration attack, since exploiting it requires simultaneous compromise of multiple sensors.

Practical implication: In adversarial settings, draw fresh each detection round rather than using a fixed \(\theta^*\). For OUTPOST ’s 127-sensor mesh, cross-sensor consistency checks are the primary Byzantine detection layer; randomized thresholds are a secondary defense for individual sensor evaluation.

Distinguishing Failure Modes

Detection answers “is something wrong?” Diagnosis answers “what is wrong?”

For Sensor 47’s silence, the fusion node must distinguish four failure modes. Sensor hardware failure shows gradual degradation before silence (increasing noise, drifting calibration) with neighboring sensors unaffected and unusual power consumption before failure. Communication failure shows abrupt silence with no prior degradation, multiple sensors in the same mesh region affected, and common relay nodes degraded. Environmental occlusion affects specific sensor types (e.g., optical but not acoustic) in a geographic pattern (flooding, debris) with intermittent function as conditions change. Adversarial action shows precise silence with no RF emissions, a tactical pattern (sensors on approach path silenced), and timing coordinated with other events.

The fusion node maintains causal models for each failure mode. Given observed evidence \(E\), the formula below applies Bayes’ theorem to compute the posterior probability of each candidate cause, combining the prior likelihood of that failure mode with the likelihood of observing the evidence given that cause.

Priors come from historical failure rates. Likelihoods come from the signature patterns.

For Sensor 47, the abrupt silence with no degradation weights against hardware failure; neighbors functioning normally weights against communication failure; a single sensor affected weights against environmental occlusion; and its location on the approach path weights toward adversarial action.

The diagnosis is probabilistic, not certain. Self-measurement provides confidence levels, not ground truth.

Machine Learning Approaches for Edge Anomaly Detection

ML extends detection to multivariate anomalies under edge constraints (<1MB models, <10ms inference, milliwatts power budget).

Lightweight Autoencoder for Multivariate Anomaly Detection

Autoencoders learn to compress and reconstruct normal behavior; anomalies produce high reconstruction error.

Architecture for edge deployment; the bottleneck at the latent layer (dim=3) forces the encoder to discard information that cannot be recovered by the decoder, so reconstruction error is high precisely when the input lies outside the learned normal manifold.

    
    graph LR
    subgraph "Encoder"
        I["Input
d=12 sensors"] E1["Dense 8
ReLU"] E2["Dense 4
ReLU"] L["Latent
dim=3"] end subgraph "Decoder" D1["Dense 4
ReLU"] D2["Dense 8
ReLU"] O["Output
d=12"] end I --> E1 --> E2 --> L --> D1 --> D2 --> O style L fill:#fff3e0,stroke:#f57c00

Read the diagram: Sensor readings enter as a 12-dimensional vector. The encoder compresses them through layers of decreasing width down to a 3-number latent summary (highlighted). The decoder then tries to reconstruct the original 12 values from those 3 numbers alone. When the system operates normally, reconstruction is accurate and the error is small. When something is wrong, the decoder cannot recover the anomalous pattern — the reconstruction error becomes the anomaly score.

For CONVOY (12-sensor vehicle telemetry), the model takes a 12-dimensional input (engine temp, oil pressure, RPM, coolant, transmission temp, brake pressure, fuel flow, battery voltage, alternator output, exhaust temp, vibration, GPS quality), passes it through the architecture, totaling weights — 280 bytes (illustrative value) when quantized to INT8 — with inference requiring 280 multiply-adds at under 0.1 ms (illustrative value) on ARM Cortex-M4.

The anomaly score for input \(x\) is the squared reconstruction error \(|x - \hat{x}|^2\) normalized by the baseline variance estimated from validation data, so that a score near 1 indicates normal behavior and scores well above 1 indicate anomaly.

where and is computed from validation data.

The edge autoencoder is trained offline on historical normal data with INT8 quantization-aware training to avoid accuracy loss under fixed-point arithmetic. Detection threshold calibration occurs on-device from the first 1000 observations (illustrative value) after the model is active — this is an operational constant, not a design parameter, because the ambient noise floor and sensor characteristics are only fully observable in-deployment.

Performance bounds (derived from model capacity analysis): Under assumption set — anomaly distribution \(P_1\) separable from normal \(P_0\) with overlap \(\epsilon\), model capacity \(C_m\), sample complexity \(n\) — the table below gives worst-case precision and recall bounds, per-inference computational complexity, memory footprint, and drift protection mechanism for each edge-viable detector family.

MethodPrecision BoundRecall BoundComplexityMemoryDrift Protection
EWMA (per-sensor) \(O(1)\)96 bytesKalman baseline tracking
Isolation Forest \(1 - (1-1/2^d)^t\)\(O(\log n)\)25 KBPeriodic forest replacement
Autoencoder (INT8) \(O(d^2)\)280 bytes recalibration
One-Class SVM\(1 - \nu\) \(O(d)\)20 bytes
Ensemble 376 bytesComponent-wise drift checks

Utility improvement of autoencoder over EWMA: The net utility gain \(\Delta U\) decomposes into a precision improvement term (joint detection catches more true positives per alarm) minus a recall-reciprocal term penalizing the relative miss rate of each detector weighted by the false-negative cost .

when : autoencoder captures correlated deviations (e.g., simultaneous small shifts in engine temp, oil pressure, RPM) that per-sensor EWMA misses because joint anomaly probability exceeds the product of marginal probabilities.

Tiny Neural Network for Failure Classification

Beyond detection, classification identifies which failure mode is occurring. The formula below gives the probability distribution over failure classes produced by a two-layer network: weights \(W_1, b_1\) map the anomaly feature vector \(x\) to a hidden layer, \(W_2, b_2\) map to class logits, and softmax normalizes these into a probability vector summing to 1.

The RAVEN failure classifier takes an 8-dimensional anomaly feature vector (motor currents, IMU residuals, GPS error, battery voltage deviation), classifies into 5 output classes (motor degradation, sensor drift, communication fault, power issue, unknown) through an architecture, with weights totaling 78 bytes in INT8.

Classification accuracy bound: The bound below gives the minimum guaranteed classification accuracy as a function of the number of fault classes \(K\) (distinct from the Kalman gain \(K_t\) in this article and from the EXP3-IX arm count \(K\) in Anti-Fragile Decision-Making at the Edge), the VC dimension of the hidden layer , the training sample count \(n\), and the confidence parameter \(\delta\).

For sufficient training samples (\(n > 100 \cdot K\) (illustrative value)) and well-separated failure modes, the lower bound exceeds \(0.9\).

Diagnosis-to-action mapping. Classification without a prescribed response is observational theater. Each fault class maps directly to a prescribed autonomic action; without this mapping, the classifier is inert.

Fault class to autonomic action. Priority order for simultaneous faults: Power issue > Communication fault > Motor degradation > Sensor drift > Unknown. The highest-priority fault’s prescribed action governs the current MAPE-K tick; lower-severity actions are deferred to the next tick, preventing action thrashing.

Fault classImmediate actionAutonomic consequence
Motor degradationReduce mission authorityFall back to L2 heartbeat-only behavior
Sensor driftReweight sensor to 50% confidence in aggregationEscalate gossip priority for this metric
Communication faultIncrease retry window for peer contactsAssume local isolation; do not propagate stale data
Power issueTrigger load shedding ( Definition 44 )Enter L1 capability level
UnknownEnter Safety Mode ( Definition 53 )Freeze actuator outputs; await self-test pass on two consecutive MAPE-K ticks
Multiple faults simultaneouslyApply highest-severity action above; defer lower-severity actions to next tickPriority order as above

Safety Mode entry (Unknown fault class) persists until two consecutive MAPE-K ticks pass with all signals within normal thresholds and no new anomalies detected ( Definition 53 , Self-Healing Without Connectivity).

One-Class SVM for Novelty Detection

When anomalies are rare and diverse, one-class SVM learns the boundary of normal behavior. The objective below finds the weight vector \(w\) and margin \(\rho\) that enclose the training data as tightly as possible, with the hyperparameter \(\nu \in (0,1)\) bounding the fraction of training points allowed outside the boundary and \(\phi(x_i)\) the feature mapping for point \(x_i\). (\(\rho\) here is the SVM margin hyperparameter. Later in this article \(\rho(\cdot)\) denotes spectral radius of the weight-update Jacobian, \(\rho_i[j]\) denotes the gossip observation-age tracker, and \(\rho_{\text{max}}\) denotes the anti-flood rate ceiling — subscripts and function notation differentiate all four at every occurrence.)

For edge deployment, use linear kernel with explicit feature mapping. The feature vector \(\phi(x)\) below summarizes a raw observation \(x\) into five scalars derived from the running EWMA statistics, trend estimate, CUSUM accumulator, and nearest-neighbor cross-correlation.

This 5-dimensional feature space captures statistical summaries, enabling efficient linear SVM:

Drift-aware weight update: Under non-stationary conditions, the distribution of “normal” shifts over time. Without regularization, weights drift toward the current distribution and lose generalization across connectivity regimes. The regularized gradient penalizes deviation from the baseline weight vector \(w_0\) calibrated on clean training data:

where scales with local input variation as an edge-practical proxy for distributional shift. (This is a point-to-point heuristic; a sliding-window variance estimate is more robust when compute permits.)

Jacobian stability check: Define the weight update map . Its Jacobian is , where is the regularized Hessian. The autonomic layer becomes a noise generator when the spectral radius exceeds 1:

(\(\rho(\cdot)\) here is the spectral radius — see the \(\rho\) notation note at the one-class SVM objective above for the full disambiguation. For \(\lambda\) disambiguation — gossip contact rate, Weibull scale, L2 regularization, distributional shift, eigenvalue, and shadow price — see the \(\lambda\) notation legend in the Notation note above.)

If , weights are diverging and the SVM must be recalibrated or reverted to \(w_0\). For edge deployment with limited compute, estimate the dominant eigenvalue via power iteration:

Convergence rate is , where is the dominant spectral gap. For \(d = 5\) features with well-separated eigenvalues, typically fewer than 10 iterations (illustrative value).

Detection rate derivation for OUTPOST :

For one-class SVM with \(\nu\)-parameterization, the fraction of training points outside the boundary is at most \(\nu\). Setting \(\nu = 0.02\) (illustrative value) (the expected anomaly rate), the bounds below give the worst-case false positive rate and the minimum true positive rate as a function of VC dimension and training size \(n\).

For \(d=5\) features and \(n > 500\) training samples (illustrative value): (theoretical bound). The low FPR is critical for battery-constrained sensors where false positives waste power on unnecessary transmissions.

Drift detection trigger: Recalibrate or revert SVM weights when either condition holds:

Typical parameters: \(\varepsilon = 0.1\) (illustrative value), (illustrative value). The spectral condition catches divergence before it compounds; the weight-norm condition catches slow persistent drift that Jacobian monitoring alone may miss.

RFF Extension: Non-Linear Boundary Approximation

The linear kernel assumption holds when failure modes cluster in linearly separable regions of the 5-dimensional feature space. In practice, bursty RF interference, partial jamming onset, and intermodulation products produce overlapping, non-convex clusters that a flat hyperplane cannot separate. Two drop-in replacements keep the same SRAM footprint.

Definition 22 (Random Fourier Feature Map). For the RBF kernel , a D-dimensional RFF approximation draws offline with and , then maps each pre-scaled feature vector \(\phi\) to:

The one-class SVM operates on \(z(\phi)\) in place of \(\phi\) with the same linear objective (above), yielding decision value . By Bochner’s theorem, for all \(\phi, \psi\), so a linear classifier in RFF space approximates the full RBF kernel classifier.

At D=4 on an edge MCU (illustrative value), the working set comprises \(5 \times 4 = 20\) Q15 values for W (40 bytes SRAM), 4 values for b (8 bytes), and 4 values for w (8 bytes) — 56 bytes total, larger than the 20-byte linear weight budget but within a 128-byte SRAM block allocation. The kernel bandwidth is set to where is the empirical variance of \(\phi\) on clean training data; for OUTPOST sensors this gives (illustrative value). When RFF code overhead is unacceptable, a Decision Stump Forest of five single-threshold stumps — one per \(\phi_i\) at 4 bytes each (2-byte Q15 threshold, 1-bit direction, 1-byte signed weight) for 20 bytes total — provides a piecewise-constant boundary with zero trigonometric computation.

Definition 23 (Q15 Pre-Scaling for RF Feature Dimensions). Before any classification step, each of the five feature dimensions is normalized to the fixed-point range using stored per-dimension statistics (Q15 mean) and (Q15 encoding of ):

All subsequent arithmetic — RFF projection, dot product, confidence comparison — operates on 16-bit signed integers. No floating-point unit is required.

Q15 fixed-point is chosen because a 16-bit multiply followed by an arithmetic right-shift in ARM Thumb-2 is exactly two instructions; division by is precomputed as at calibration time, so inference is multiply-shift only. The calibration constants and require only 20 bytes and may reside in flash, leaving SRAM free for working buffers. The 3-sigma clamp also improves classification: concentrating 99.7% of normal-condition samples across the full dynamic range pushes anomalous samples to the rail and increases the decision margin for both the linear and RFF classifiers without touching training data.

Confidence Gate and Safety Mode Fallthrough

The decision value is a signed margin distance. On ambiguous RF conditions — partial jamming onset, marginal connectivity, sensor degradation — the value may be non-negative but small, indicating a boundary-straddling sample. Acting on an inconclusive score risks both false positives (wasted power on spurious transmissions) and false negatives (missed threat transitions).

Fallthrough rule: The Q15 integer equivalent of a 0.6 confidence threshold is . When the scaled decision value falls below this gate, the node does not act on the current classification — treating it as inconclusive — and enters Safety Mode by freezing actuator outputs at their last confirmed-safe state, reducing MAPE-K to the minimum viable tick rate ( Definition 52 ), and suppressing non-critical radio transmissions. After one refractory window ( Definition 46 ), if the confidence gate is passed on two consecutive ticks the node resumes normal operation; otherwise it escalates to the Terminal Safety State ( Definition 53 ).

If the confidence gate has not cleared within \(N_\text{ref,max} = 5\) consecutive refractory cycles, the node elevates to the lowest-cost available safe posture locally: it freezes its anomaly classifier outputs (returning only the last known classification), logs a stale-output flag in its health vector, and defers escalation to the Terminal Safety State ( Definition 53 , Self-Healing Without Connectivity) until the next connectivity window. This local dwell bound prevents indefinite Safety Mode persistence during prolonged partitions.

Wall-clock maximum: regardless of connectivity state, if the node has been in Safety Mode for longer than (default: 4 hours), it must declare a local emergency state and freeze all outputs at their last known-good values pending connectivity recovery or manual intervention. This prevents indefinite Safety Mode persistence during multi-hour denied-regime partitions.

Once a node enters the local emergency state, four behaviors engage unconditionally. The physical-layer heartbeat (L0 watchdog ping) is never suspended, even in emergency state — a silent node is indistinguishable from a dead node, and the heartbeat allows the fleet to distinguish “alive but isolated” from “failed.” All actuator command outputs hold at the last confirmed-safe value; no new classifications, setpoints, or control actions are issued until the state is exited, and outputs marked stale-output in the health vector remain frozen. A compact distress packet (node ID, \(T_\text{acc}\), last health score) is broadcast on every available radio channel once per \(T_\text{quarantine}\), rate-limited to prevent collapsing the bandwidth budget during fleet-wide emergencies. The emergency state clears automatically when the connectivity regime returns to degraded or connected (i.e., \(T_\text{acc}\) resets) or when an authenticated operator override is received; on exit, the node resumes the normal MAPE-K loop from its frozen state — not from the initial baseline — and applies the reconnection recovery rule for \(\theta^*(t)\).

The 0.6 threshold (illustrative value) is calibrated empirically: at \(\nu = 0.02\) and \(n > 500\) training samples (illustrative value), the margin distribution of normal samples has a median near 0.85 (illustrative value) in the normalized scale; 0.6 sits two standard deviations below the median, catching genuine boundary straddlers while passing well-separated normal readings.

RFF Inference Pipeline: Code Budget Verification

The three sequential stages — pre-scale, RFF projection, and classification with confidence gate — occupy fewer than 150 bytes (illustrative value) of compiled ARM Thumb-2 code. The budget is verified by enumerating the dominant instruction patterns per stage.

StageAlgorithmThumb-2 instructionsCode budget
Pre-scale (Definition 23)5 Q15 multiply-shifts followed by saturating clamp to \([-32767, +32767]\)\(\approx 10\)20 B
RFF projection (Definition 22)4 inner products \(W_j^T \varphi + b_j\) (Q15 multiply-accumulate, 5 terms each); 16-entry symmetric cosine LUT with 2-bit quadrant sign flip\(\approx 28\)56 B
Classify + confidence gateInner product \(w_{\mathrm{svm}}^T z\) (Q15, 4 terms); compare to 19660 (\(\lfloor 0.6 \times 32767 \rfloor\)); tail-call to Safety Mode if below gate\(\approx 30\)60 B
Total\(\approx 68\)136 B — below the 150 B target

Stack during classification (freed on return): 5 pre-scaled Q15 values \(\varphi[5]\) = 10 B plus 4 RFF features \(z[4]\) = 8 B = 18 B stack. The Safety Mode call is a tail-call with no additional frame. Non-stack SRAM (parameters W, b, w_svm, cosine LUT) totals 72 B and may reside in flash read-only data.

Implementation notes: (i) cos_lut[16] holds one symmetric quarter-period of cosine in Q15 (values from cos(0)=32767 down to cos(\(\pi/2\))=0); full-period wrapping uses bits 14–79 of the index for sign. The maximum LUT approximation error is , well above Q15 quantization noise ( ) but negligible for anomaly detection where margin differences of drive classification. (ii) enter_safety_mode() is a tail-call and does not add to the classification code budget. (iii) Instruction counts assume ARM Cortex-M0+ (no single-cycle 32-bit multiply); Cortex-M4 is cheaper by roughly 30% due to the hardware MAC unit in rff_map.

Temporal Convolutional Network (TCN) for Sequence Anomalies

Some anomalies are only visible in temporal patterns - normal individual readings but abnormal sequences. The diagram below shows the tiny TCN architecture: three dilated Conv1D layers with exponentially increasing dilation rates (1, 2, 4) extend the receptive field to 15 timesteps without adding parameters, followed by global average pooling and a sigmoid output that produces an anomaly probability.

    
    graph LR
    subgraph "Dilated Convolutions"
        C1["Conv1D
k=3, d=1
8 filters"] C2["Conv1D
k=3, d=2
8 filters"] C3["Conv1D
k=3, d=4
4 filters"] end subgraph "Output" P["GlobalAvgPool"] O["Dense 1
Sigmoid"] end I["Input
32 timesteps
4 channels"] --> C1 --> C2 --> C3 --> P --> O style C1 fill:#e8f5e9,stroke:#388e3c style C2 fill:#e8f5e9,stroke:#388e3c style C3 fill:#e8f5e9,stroke:#388e3c

Read the diagram: A 32-timestep window of 4 sensor channels enters from the left. Three dilated convolution layers process it with increasing gaps between positions (dilation 1, 2, 4), letting each layer see longer temporal context without adding parameters — the receptive field grows to 15 timesteps at no extra cost. Global average pooling collapses the temporal dimension into a single summary vector; the final sigmoid outputs an anomaly probability between 0 and 1.

The TCN takes \(32\) timesteps \(\times 4\) channels (160 samples at 5 Hz) as input, achieves a receptive field of \(1 + 2\times(1+2+4) = 15\) timesteps (3 seconds), requires 388 bytes (illustrative value) of parameters plus a 128-byte ring buffer for a total footprint of approximately 520 bytes (illustrative value), and runs inference in under 1 ms on Cortex-M4 (illustrative value).

Energy feasibility on RAVEN : Definition 2 establishes local dominance when \(n_c < T_s/T_d\). For the RAVEN platform (\(T_d = 50\,\mu\text{J}\), ) the threshold is \(n_c < 100\) inference passes. The TCN uses \(n_c = 1\) — one forward pass per anomaly check — and the 9,000 internal MACs determine the cost of that pass, not the value of \(n_c\). At approximately 5 nJ per MAC on a Cortex-M4, one TCN inference costs . The energy ratio versus a single radio transmission is:

Running at 5 Hz, the continuous inference power is . Avoiding a single unnecessary radio transmission (5 mJ) recovers 22 seconds of continuous inference — a favorable exchange whenever detection accuracy suppresses even one spurious transmission per 22-second window.

Energy-adaptive scheduling: For deployments where the energy margin is tighter than the RAVEN reference, scale anomaly detection frequency with the connectivity regime. The radio-savings justification weakens as connectivity degrades; inference frequency should follow:

RegimeDetection ratePrimary methodEnergy logic
Connected (\(C \approx 1.0\))5 HzTCN ensembleRadio available as fallback; ML precision maximizes detection quality
Degraded (\(C \approx 0.5\))1 HzTCN + EWMAReduced inference budget; EWMA fills inter-TCN intervals
Intermittent (\(C \approx 0.25\))0.2 HzEWMA + CUSUMConserve for mission-critical windows only
None (\(C = 0\))On-demandCUSUM onlyMinimal power; inference triggers only when CUSUM threshold is crossed

This schedule keeps across all connectivity states by reducing inference frequency proportionally with the radio-savings justification.

Application: RAVEN motor anomaly detection . Individual current readings appear normal, but the temporal signature of a failing bearing shows characteristic oscillation.

Utility improvement of TCN over EWMA: The formula expresses the gain entirely as a recall improvement — since both models produce the same value per true positive, the difference is how many additional anomalies the TCN catches by exploiting temporal context that the per-sample EWMA cannot see.

TCN’s receptive field (15 timesteps) captures oscillation patterns with period \(T \leq 15/f_s\). EWMA operates per-sample without temporal context. For anomalies where , TCN achieves higher recall by design: .

Model Ensemble Strategy

Production edge systems combine multiple models into a single weighted anomaly score. The formula below is a linear combination of the four individual model scores — EWMA z-score, autoencoder reconstruction error, TCN output, and one-class SVM decision value — with weights \(w_i\) that balance each detector’s contribution.

The table below shows the weights learned via logistic regression on validation anomalies, together with the rationale for each model’s relative contribution.

ModelWeightRationale
EWMA0.25Fast, catches obvious anomalies
Autoencoder0.35Catches multivariate correlations
TCN0.25Catches temporal patterns
One-class SVM0.15Catches novel out-of-distribution

Ensemble utility derivation:

For \(K\) independent models with per-model recall \(R_i\), the ensemble recall under the union rule is strictly at least as high as the best individual model’s recall, because a true anomaly is detected if any model flags it.

Utility improvement: The net gain from using the ensemble over the single best model equals the additional recall it achieves multiplied by the detection value, minus the overhead cost of running multiple models.

when models detect different anomaly subsets (low correlation). For \(K=4\) models (illustrative value) with \(R_i \approx 0.8\) (illustrative value) and (illustrative value): (theoretical bound).

Model Update and Drift Management

Edge models degrade as operating conditions change. Detecting and managing model drift:

Three drift indicators signal when a model needs updating: reconstruction error baseline shift (mean reconstruction error increasing more than 20% over 7 days suggests a stale model), false positive rate increase (tracked via operator feedback loop), and confidence calibration drift (predicted probabilities should match empirical rates).

Update strategies: The table orders four responses by increasing intervention severity; the Connectivity Required column is the key constraint — the first two strategies work entirely offline, while the last requires a connected interval.

StrategyTriggerMethodConnectivity Required
Threshold adjustmentFP rate >5%Local recalibrationNone
Incremental updateDrift detectedOnline gradient stepNone
Full retrainMajor driftFederated learningIntermittent
Model replacementArchitecture obsoleteOTA updateConnected

Drift handling strategy derivation:

When covariate shift occurs ( ), detection accuracy degrades exponentially with the KL-divergence between deployment and training distributions, as shown below; is the sensitivity of the particular model to distributional shift.

Local threshold adjustment (recomputing \(\theta\) from recent \(X\)) restores accuracy when \(P(Y|X)\) is unchanged. Full retraining required when \(P(Y|X)\) shifts.

Cognitive Map: Local anomaly detection scales from a 2-line EWMA update (constant time, constant memory) through CUSUM (optimal for step changes and sustained drift), Kalman adaptive baselines (optimal when the “normal” baseline itself evolves), and ML ensembles (optimal for correlated multi-sensor faults that no single-sensor method catches). The threshold \(\theta^*\) is not a fixed value — it tightens automatically as partition age grows, and has a hard lower bound at the stability-region guard band \(\delta_q\). Every algorithm in this section fits in under 1 KB and runs on a microcontroller. Next: gossip protocols aggregate these local health scores across the fleet without any central coordinator.


Distributed Health Inference

Phase-0 attestation is the initial fleet commissioning window. Every node registers its identity, public key, and baseline health metrics with its direct neighbors before autonomous operation begins. These records become the trust root for subsequent Byzantine detection and ejection decisions.

Gossip-Based Health Propagation

Individual nodes detect local anomalies. Fleet-wide health requires aggregation without a central coordinator.

Scope note — H\(^\text{fleet}\) vs H\(^\text{node}\): The fleet health vector in this definition is indexed over the \(n\) nodes in the fleet — distinct from the per-component \(\mathbf{H}^\text{node}(t)\) in Why Edge Is Not Cloud Minus Bandwidth, which is indexed over the subsystems within a single node (antenna, battery, CPU, etc.). For RAVEN: \(\mathbf{H}^\text{fleet}\) has 47 elements; \(\mathbf{H}^\text{node}(t)\) has 6 elements. Do not confuse these when sizing gossip payloads or CRDT health-state records.

Definition 24 ( Gossip Health Protocol). A gossip health protocol is a tuple \((\mathbf{H}, f_g, M, T)\) where [1] :

In other words, every node keeps a score between 0 and 1 for each fleet member, periodically swaps that list with a random neighbor, and combines the two copies using a merge rule that discounts older entries via the staleness function \(T\).

Algebraic contract on M. For gossip to converge regardless of message ordering, the merge function must be commutative, associative, and idempotent. Simple averaging and last-write-wins both fail at least one property under out-of-order delivery.

Required properties of the merge function M. Conservative merge \(M(h_a, h_b) = \max(h_a, h_b)\) satisfies all three. Staleness-weighted blend with is idempotent only when both observations carry equal staleness weights.

PropertyFormal requirementWhy it matters
Commutativity\(M(H_a, H_b) = M(H_b, H_a)\)Node A merging B’s view must reach the same result as B merging A’s
Associativity\(M(H_a, M(H_b, H_c)) = M(M(H_a, H_b), H_c)\)Three messages that cross mid-network resolve to the same state regardless of arrival order
Idempotence\(M(H, H) = H\)Duplicate messages from gossip fan-out do not shift the estimate

Metadata overhead analysis. Each gossip packet carries two mandatory overhead fields: a Unified Autonomic Header (UAH, ~20 bytes, constant per packet) and a Dotted Version Vector (DVV, \(2 \times N\) bytes, scaling with fleet size). For RAVEN (N = 47), total overhead is ~114 bytes (illustrative value); a 10-byte temperature payload achieves only ~8% (illustrative value) protocol efficiency. For OUTPOST (N = 127), efficiency drops to ~4% (illustrative value).

DVV scaling warning. Above \(N \approx 100\), the DVV alone can exceed LoRaWAN/BLE MTUs (\(\leq 255\) bytes). Evaluate Bloom-clock or epoch-scoped DVV when fleet size N > 100 and typical health payload size < 20 bytes. The autonomic detection logic is independent of this choice; the overhead affects transport, not detection.

DVV strategy comparison.

StrategyDVV sizeTrade-off
Full DVV2N bytesExact causal ordering; impractical above N \(\approx\) 100
Bloom-clock (probabilistic)16–80 bytes fixed0.1–5% false-positive causal violation rate
Epoch-scoped DVV (partition-local)\(2 \times K\) bytes (K = partition subgroup size)Exact within partition; requires epoch reconciliation on reconnect

Compute Profile: CPU: per gossip round — merge of two -entry health vectors plus staleness weight computation. Memory: — one health vector per node. The binding cost at large is the DVV comparison, not merge arithmetic.

Definition 36 (Synthetic Observability — Proxy-Observer). For -incompatible hardware without native health APIs, define the proxy health signal:

where:

The proxy confidence bound is where is estimated from calibration measurements. The legacy device is admitted to the autonomic fleet when and , establishing Phase 0 (Hardware Trust) without requiring a native self-health API.

Typical deployments: Modbus RTU sensors, Serial/RS-485 actuators, GPIO-only embedded controllers. Current draw and heartbeat timeout are available from virtually any embedded device; vibration is optional and replaced by doubling its weight if absent.

The protocol operates in rounds: node \(i\) updates \(h_i\) based on local anomaly detection , selects a random peer \(j\), the two nodes exchange health vectors, and each merges the received vector with its local knowledge.

The diagram below illustrates a single gossip exchange: before the exchange each node holds only its own health vector, and after the merge both nodes hold the combined view of the pair.

    
    graph LR
    subgraph Before Exchange
    A1["Node A: H_A"] -.->|"sends H_A"| B1["Node B: H_B"]
    B1 -.->|"sends H_B"| A1
    end
    subgraph After Merge
    A2["Node A: merge(H_A, H_B)"]
    B2["Node B: merge(H_A, H_B)"]
    end
    A1 --> A2
    B1 --> B2

    style A1 fill:#e8f5e9
    style B1 fill:#e3f2fd
    style A2 fill:#c8e6c9
    style B2 fill:#bbdefb

Read the diagram: Before the exchange (left), Node A knows only its own health and Node B knows only its own. Both send their full vector to the other. After the merge (right), both nodes hold the union of what either knew — each exchange doubles the amount of current knowledge held at each endpoint. Repeat this across the fleet and health propagates like an epidemic.

The merge function must handle three challenges: staleness (older observations are less reliable), conflicts (different nodes may observe different values), and adversarial injection (compromised nodes may inject false health values).

The merge function combines two health estimates for node \(k\) into a single value by taking a trust-weighted average, where each source’s weight \(w\) reflects how recently its observation was made.

Physical translation: When two nodes disagree on a third node’s health, neither report is simply discarded. Both estimates are blended, with the fresher estimate counting more. A 1-second-old reading outweighs a 30-second-old reading by the ratio of their exponential weights.

The weight assigned to each observation decays exponentially with its staleness \(\tau_\text{stale}\) ( Definition 26 ) at rate \(\gamma_s\), so a 10-second-old reading contributes far less than a fresh one.

With \(\tau_\text{stale}\) as time since observation ( Definition 26 ) and \(\gamma_s\) as decay rate (distinct from the gossip rate \(\lambda\), from the false-negative cost escalation rate in Proposition 9 , and from \(\gamma\) in the Holt-Winters equations above where it denotes the seasonal smoothing coefficient).

Proposition 12 ( Gossip Convergence). For a gossip protocol with contact rate \(f_g\) and \(n\) nodes in a fully-connected network (any node can reach any other), the expected time for information originating at one node to reach all nodes is [6] :

The expected time for information to reach all n nodes is \(O(\ln n / f_g)\). For a \(1 - 1/n\) high-probability guarantee (via Markov’s inequality), plan for .

Physical translation: Doubling the fleet adds only seconds to convergence — not double the time. At , adding 47 drones to a 47-drone swarm adds roughly 3.5 seconds (illustrative value) to health convergence, not the 39 seconds (illustrative value) a linear model would predict.

In other words, fleet-wide awareness scales only logarithmically with fleet size: doubling the number of nodes adds a fixed seconds to convergence, not a proportional delay.

Analogy: Rumor spreading in an office — even if each person only tells two others, within \(\log_2(N)\) rounds everyone knows, because every informed person becomes a new spreader. The information doubles with each round, so the total time grows only logarithmically with headcount.

Logic: Proposition 12 shows \(\mathbb{E}[T_{\text{convergence}}] = O(\ln n / f_g)\) because each gossip round the informed set grows as \(dI/dt = f_g I(1-I)\) — logistic dynamics whose solution reaches 1 in logarithmic time.

For sparse topologies with network diameter \(D\), convergence scales as since information must traverse \(D\) hops.

Proof sketch: The information spread follows logistic dynamics \(dI/dt = f_g I(1 - I)\) where \(I\) is the fraction of informed nodes. Solving with initial condition \(I(0) = 1/n\) and computing time to reach \(I = 1 - 1/n\) yields . Corollary 2. Doubling swarm size adds only seconds to convergence time, making gossip protocol s inherently scalable for edge fleets.

The lossless fully-connected model of Proposition 12 is a lower bound. Real edge meshes are sparse and contested: OUTPOST operates on a 127-sensor mesh with diameter \(D \approx 8\) hops under sustained jamming at . The actual convergence time is not \(O(\ln n / \lambda)\) but a function of both topology and loss rate.

Empirical status: The \(O(\ln n / f_g)\) bound is a mathematically proven result for fully-connected lossless networks. The RAVEN figure of adding ~3.5 s for 47 additional drones at \(f_g = 0.2\,\text{Hz}\) is a direct application of the formula, not a measurement. Actual gossip convergence in field deployments depends on radio topology and interference — treat this as a lower bound.

Watch out for: the \(O(\ln n / f_g)\) bound assumes a fully-connected topology with no packet loss; in contested meshes with diameter \(D > 1\) and loss probability \(p_{\text{loss}} > 0\), the correct bound is \(O(D \ln n / (f_g(1-p_{\text{loss}})))\) (Proposition 13), which can be larger by orders of magnitude — using the fully-connected bound for a sparse contested topology produces dangerously optimistic convergence estimates.

Proposition 13 ( Gossip Convergence on Lossy Sparse Mesh). Let \(G = (V, E)\) be a connected graph with \(n\) nodes and edge conductance:

Sparse topology and packet loss slow gossip multiplicatively; the bottleneck is graph conductance, not just fleet size.

Under push-pull gossip with contact rate \(f_g\) and independent per-message loss probability , the expected convergence time satisfies:

in expectation. For any connected graph with diameter \(D\), the operational bound gives:

Proof sketch: Let \(S_t\) denote the informed set at gossip round \(t\), with \(|S_t| = k\). By definition of \(\Phi\), the number of boundary edges is at least \(\Phi \cdot k(n-k)/n\). Each boundary edge activates — an informed node contacts an uninformed neighbor and the message arrives — with probability per round (\(\bar{d}\) = average degree). The expected growth satisfies .*

This is the discrete logistic equation with rate . The logistic ODE solution \(dI/dt = r \cdot I(1-I)\) reaches \(I = 1 - 1/n\) from \(I = 1/n\) in \(T = (2\ln(n-1))/r\). Applying \(\Phi \geq 1/D\) gives the diameter bound.

The bound holds in expectation; the stopping time is a non-negative random variable, so by Markov’s inequality . For operational planning, use as the minimum safety budget; budget when \(1-1/n\) coverage is required. Chernoff-style analysis with bounded increments improves the tail to . \(\square\)

Probability tail caveat — Markov vs. Chernoff.

Markov’s inequality gives for any non-negative random variable. The \(1-1/n\) probability guarantee therefore holds only at \(c = n\): the convergence time must be bounded by , not itself.

At the mean , Markov guarantees only — trivially true but operationally useless for planning.

For a high-probability bound at the \(O(\ln n / \lambda)\) scale, the correct tool is a Chernoff or Azuma-Hoeffding concentration inequality applied to the martingale \(|S_t|/n\). Under the logistic growth model, convergence time concentrates around the mean with sub-Gaussian tails:

Practical implication: budget as the \(1-1/n\) operational target — the factor-3 overhead covers the gap between median convergence time and the high-probability tail. The OUTPOST calibration table below uses the correct diameter bound directly; the \(1-1/n\) language in the proposition statement should be read as an asymptotic characterization, not a tight guarantee at .

Specializations:

Graph topology\(\Phi\)Expected convergence
Fully connected, lossless\(1\) — recovers Prop 4
\(k\)-regular expander, lossless\(\Omega(1)\)
Grid ( ), lossless
OUTPOST mesh (\(D=8\), )\(\geq 1/8\)

OUTPOST calibration gap: At , Proposition 12 predicts \(T \approx 9.7\,\text{s}\); Proposition 13 predicts (theoretical bound under the illustrative \(D=8\), \(p_\text{loss}=0.35\) parameters) under jamming. Designing for 10-second health awareness and receiving 4-minute convergence is a mission-critical gap — the two propositions formalize the cost of contested topology: every doubling of diameter or packet loss fraction drives the bound linearly wider.

Physical translation — Proposition 12 vs. Proposition 13 . In a lab environment (full mesh, negligible packet loss), health converges in \(O(\ln n / f_g)\) seconds per Proposition 12 . In field conditions (contested mesh, jamming, sparse links), convergence is per Proposition 13 , where \(D\) is mesh diameter. OUTPOST example: \(D = 8\) hops (illustrative value), \(p_\text{loss} = 0.35\) (illustrative value), \(\lambda = 0.5\) Hz transforms Proposition 12 ’s predicted 9.7 s into Proposition 13 ’s theoretical bound of 238 s (theoretical bound under the illustrative parameters above) (4 minutes).

Watch out for: the bound assumes independent Bernoulli packet loss at each edge; correlated loss events — burst jamming or synchronized interference that silences multiple edges simultaneously — violate the independence assumption underlying the conductance argument, causing actual convergence time to exceed the diameter formula.

Corollary 3 (Loss-Rate Gossip Budget). To maintain convergence within target time \(T^*\) under loss probability on a diameter-\(D\) mesh, the minimum gossip rate is:

Watch out for: the \(1/(1-p_{\text{loss}})\) scaling factor assumes loss probability is the same on every edge; in adversarial environments where a jammer targets the bottleneck edges connecting different sub-clusters, a few edges dominate convergence time and the network-average \(p_{\text{loss}}\) systematically underestimates the required gossip rate — measure per-edge loss on the critical path, not fleet-average loss, when sizing the gossip budget.

Gossip Rate Selection: Formal Optimization

Objective Function: The formula finds the gossip rate \(\lambda^*\) that best balances convergence speed (which benefits from higher \(\lambda\)) against communication power cost (which scales linearly with \(\lambda\)).

where is convergence time and is power consumption.

Constraint Set: Three hard limits bound the feasible rate range — the binding constraint (whichever is most restrictive) determines .

Optimal Solution: The formula below takes the minimum of the three per-constraint maximum rates; whichever constraint is most restrictive determines , and the other two are automatically satisfied.

Physical translation: Take the three per-constraint speed limits — the maximum rate your power budget allows, the maximum rate your bandwidth allows, and the minimum rate needed to keep data fresh enough — and pick the smallest. Whichever constraint binds first sets ; the other two are automatically satisfied.

State Transition Model: The rule below describes how a node’s staleness either resets to zero (when a fresh gossip exchange occurs, with probability proportional to rate \(\lambda\)) or grows by \(\Delta t\) when no exchange happens in the current interval.

For tactical parameters (\(n \sim 50\), \(f_g \sim 0.2\) Hz), Proposition 12 gives \(T = 2\ln(49)/0.2 \approx 39\) seconds (illustrative value) — convergence within 30–40 seconds (illustrative value), fast enough to establish fleet-wide health awareness within a single mission phase. Broadcast approaches scale linearly with \(n\), which is why gossip wins at scale.

For strategic health reporting scenarios where nodes have incentives to misreport, see below.

Empirical status: The Proposition 13 bound is analytically derived. The OUTPOST values \(D = 8\), are scenario parameters, not measured from a deployed system. The “\(3\times\) overhead” recommendation for operational planning is a conservative engineering heuristic, not derived from measured tail distributions in adversarial radio environments.

Optional: Game-Theoretic Extensions

Strategic Health Reporting

The gossip merge assumes truthful health reporting. Nodes competing for limited healing resources have incentives to under-report health (appear more sick) to attract healing attention.

Cheap-talk game (Crawford-Sobel): Node \(i\) with true health \(h_i\) sends report \(\hat{h}_i\). Healing resources are allocated proportional to reported sickness \(1 - \hat{h}_i\). If node \(i\) values healing resources, the equilibrium report satisfies \(\hat{h}_i < h_i\) - systematic under-reporting.

Crawford-Sobel equilibrium: With \(k\) nodes, reports are only coarsely informative - the equilibrium partitions the health space into \(k\) intervals, revealing only which interval each node’s health falls in, not the exact value.

Incentive-compatible allocation: Replace proportional allocation with a Groves mechanism for healing priority: each node reports health and the mechanism allocates healing proportional to the marginal value of healing (not reported sickness). Truthful reporting becomes a dominant strategy when the node’s healing benefit is fully internalized.

Comparative health reporting — where nodes rank their own health relative to neighbors rather than reporting absolute values — resists strategic manipulation and preserves the ordering needed for healing priority assignment while removing the incentive for absolute-value inflation.

Gossip as a Public Goods Game

The gossip rate optimization assumes a central planner selects \(\lambda\). In an autonomous fleet, each node independently selects its gossip rate - and gossip is a public good: each message costs the sender (power, bandwidth) but benefits all nodes’ health awareness.

Public goods game: Node \(i\) selects rate \(\lambda_i \geq 0\). The formula below expresses aggregate health quality \(Q\) as a function of the mean gossip rate across all \(n\) nodes, where \(t\) is elapsed time; quality rises toward 1 as increases but each individual node bears the full cost of its own transmissions while sharing the benefit equally with all peers.

Physical translation: Fleet health quality is a shared resource that approaches 1 as the average gossip rate rises, but each node pays the full radio and battery cost of its own transmissions while the benefit is split equally among all \(n\) nodes. Every selfish node has an incentive to under-gossip and free-ride on neighbors’ transmissions — exactly as in a public goods game. The result is a fleet that under-monitors itself in the absence of explicit coordination.

Node \(i\) captures only \(1/n\) of the benefit of its own gossip . The Nash equilibrium satisfies , while the social optimum satisfies . Since \(1/n < 1\), the comparison below holds and the equilibrium rate falls short of the social optimum.

Physical translation: Left to their own devices, autonomous nodes gossip at \(1/n\) of the rate that maximizes fleet health quality. Uncoordinated gossip achieves roughly \(1/n\) of the optimal gossip frequency, yielding convergence times approximately \(n\times\) longer than a coordinated protocol. For RAVEN (\(n=47\)), this means gossip convergence takes roughly \(47\times\) longer than it would under a centrally coordinated schedule — making coordination subsidies economically justified even at significant per-message overhead cost.

For RAVEN (\(n = 47\)), autonomous gossip equilibrium provides approximately \(1/47\) of the socially optimal convergence rate.

VCG mechanism: A Groves mechanism assigns task-allocation transfers to nodes proportional to their gossip contribution: nodes that gossip more receive fewer computational tasks (reducing effective cost). Under this mechanism, truthful power-budget reporting is a dominant strategy and the social optimum is achieved.

Physical translation: Reward high gossip contributors with lighter computational workloads — the battery cost of extra transmissions is offset by fewer CPU-intensive inference tasks. A node reporting a tight power budget gets fewer gossip assignments but also fewer heavy tasks; reporting a false low battery gains nothing because the coordinator sees through it and assigns proportionally. During the next connected window, these rate targets are distributed fleet-wide and hold through the following partition.

Practical implication: During connected intervals, compute gossip rate assignments centrally and distribute them as target rates. The VCG transfer - differential task assignment - incentivizes nodes to maintain their assigned rates during partition. Priority gossip multipliers should be set to cover the \(1/n\) free-rider discount, not arbitrary priority levels.

Commercial Application: AUTODELIVERY Fleet Health

AUTODELIVERY operates autonomous delivery vehicles across a metropolitan area. Vehicles navigate urban canyons, parking structures, and dense commercial districts with intermittent cellular connectivity. Each vehicle must maintain fleet health awareness - vehicle availability, road conditions, charging status - without continuous cloud connectivity.

The gossip architecture implements hierarchical health propagation: local gossip between nearby vehicles, zone aggregation at hub gateways, and fleet-wide propagation when connected.

Local gossip (vehicle-to-vehicle): Vehicles within DSRC range (approximately 300 meters in urban environments) exchange health vectors at 0.5 Hz. Each vehicle maintains the fields below; the Staleness Threshold column gives the maximum age at which each field still supports useful decisions — longer-lived fields like charging station status remain valid for 10 minutes because infrastructure changes slowly.

Health FieldSizeUpdate FrequencyStaleness Threshold
Vehicle ID + Position12 bytesEvery exchange30s
Battery SoC + Range4 bytesEvery exchange60s
Current Task Status8 bytesOn change120s
Road Hazard Reports16 bytesOn detection300s
Charging Station Status8 bytesOn visit600s

Zone-level aggregation: Hub gateways (vehicles stationed at distribution centers) aggregate zone health and gossip between zones via longer-range V2X communication. Zone summaries include:

Fleet-wide propagation: From Proposition 12 , , so typical metropolitan fleets achieve full health convergence in under a minute, enabling real-time rebalancing of delivery assignments.

Position validation in urban environments: AUTODELIVERY faces spoofing risks from GPS multipath in urban canyons and potential adversarial spoofing from competitors or theft attempts. The function below classifies each vehicle’s claimed position \(p_i\) as true (corroborated by a nearby peer), suspect (no peer within validation range), or false (contradicted by a peer’s observation beyond the kinematically possible travel distance).

where \(\epsilon = 50m\) (illustrative value) is the validation tolerance and is the maximum distance a vehicle could have traveled since last validated position.

Vehicles with sustained position validation failures are flagged for operational review and excluded from sensitive tasks (high-value deliveries, access to secure facilities).

Delivery coordination under partition: When a vehicle enters an underground parking garage (complete cellular blackout), it continues operating with cached task assignments. Upon emergence:

  1. Gossip exchange with first encountered peer
  2. Receive updates accumulated during blackout
  3. Reconcile any conflicting task assignments (first-commit-wins semantics)
  4. Resume normal gossip participation

Average underground dwell time: 4.2 minutes (illustrative value). With 60-second staleness threshold, vehicles emerge with stale but still-useful health data - well within the maximum useful staleness for task rebalancing decisions.

Priority-Weighted Gossip Extension

Standard gossip treats all health updates equally. In tactical environments, critical health changes (node failure, resource exhaustion, adversarial detection) should propagate faster than routine updates.

Priority classification:

Accelerated propagation protocol:

The gossip rate \(\lambda_p\) for priority-\(p\) messages scales the base rate by a factor proportional to priority level, where \(\eta\) is the acceleration coefficient and \(p = 1, 2, 3\) for normal, urgent, and critical messages respectively.

where \(\eta\) is the acceleration factor (typically 2–3 (illustrative value)). Critical messages gossip at \(3\times\) normal rate.

Message prioritization in constrained bandwidth:

When bandwidth is limited, each gossip exchange prioritizes by urgency. The protocol proceeds as follows:

Step 1: Merge local and peer health vectors into a unified update set.

Step 2: Sort updates by priority (descending), then by staleness (ascending) within each priority class.

Step 3: Transmit updates in sorted order until the bandwidth budget is exhausted. The condition below permits update \(u_i\) only when all previously selected updates plus \(u_i\) itself still fit within .

Step 4: Critical override — the implication below unconditionally forces transmission of any update regardless of whether the budget would be exceeded.

This ensures safety-critical information propagates regardless of bandwidth constraints, accepting temporary budget overrun. In each gossip round, at most concurrent \(P_\text{CRITICAL}\) overrides are processed (default: (illustrative value)); excess critical messages queue for the next round. This aggregate cap prevents a burst of simultaneous critical events from collapsing the bandwidth budget entirely while still guaranteeing timely propagation of every critical message within additional rounds, where \(N_\text{crit}\) is the number of queued critical messages.

Convergence improvement: For RAVEN with \(\eta = 2\), priority-weighted gossip triples the effective critical gossip rate, as the formula below shows by substituting into the general rate equation.

Since convergence time scales inversely with effective rate, the ratio of normal to critical convergence times equals the ratio of critical to normal gossip rates, giving a theoretical 3x speedup for \(n = 47\) drones.

Accounting for message overhead and collision backoff, the expected speedup is approximately \(2.6\times\) (illustrative value). Under the assumptions (uniform message sizes, bounded collision probability), critical updates converge in versus for normal updates, where \(D\) is network diameter.

Anti-flood protection: To prevent priority abuse by a Byzantine node that floods messages, the node’s historical rate of critical messages must not exceed ; the condition below enforces this per-source rate limit.

where messages/second (illustrative value). Exceeding this rate triggers trust decay.

Combined fleet critical-message rate bound: with \(n\) nodes each at the per-source rate limit \(\rho_\text{max}\), the maximum combined critical-message rate is . For RAVEN (\(n=47\), ), this yields at most 0.47 forced-critical messages per second (illustrative value) across the fleet — approximately 15–20 bytes/s overhead (illustrative value) assuming 40-byte messages, well within the minimum bandwidth floor \(B_\text{min}\). Deployments with must lower \(\rho_\text{max}\) proportionally.

Bandwidth Asymmetry and Ingress Filtering

The gossip prioritization above assumes backhaul bandwidth is scarce but nonzero. At the extreme — when the radio link is a tiny fraction of the local sensor bus — prioritization alone is insufficient. The node must also decide which metrics are worth transmitting at all.

Define the bandwidth asymmetry ratio:

where \(B_b\) is backhaul bandwidth (radio uplink) and \(B_l\) is local bus bandwidth (intra-node sensor bus). Typical edge values: (illustrative value) (tactical HF radio), (illustrative value) (sensor bus), giving (illustrative value). At \(\beta < 0.01\), the backhaul is less than 1% of local capacity. Sending everything the node observes locally is physically impossible.

Physical translation: The backhaul link carries only 1 Mbps when the local sensor bus runs 100 Mbps — a 10,000:1 gap. You must discard 99.99% of local observations before they can leave the device. This is not a cost optimisation; it is a physical constraint that makes centralised aggregation architecturally impossible at full sensor resolution.

Definition 25 (Bandwidth-Asymmetry Ingress Filter). The ingress filter determines whether metric observed at time \(t\) is transmitted:

Here is the local processing capacity in bytes/s; the filter drops input that exceeds bytes per scheduling window.

Physical translation: On a severely bandwidth-constrained uplink ( ), only a metric that has changed by more than 10% of its full operating range is worth transmitting — anything smaller is noise relative to the channel’s information capacity. The filter enforces this automatically: safety-critical events always get through ( override), stale readings are forced through at to keep the MAPE-K loop alive, and everything else is silenced until the change is large enough to be actionable. The result is a 100,000-fold reduction in radio transmissions on a 72-hour mission with no loss of decision-relevant information.

where is the last transmitted value of \(m\), is the metric’s operational dynamic range, \(\theta_\Pi\) is a baseline sensitivity parameter, \(\beta = B_b/B_l\) is the bandwidth asymmetry ratio, and is the maximum useful staleness bound from Proposition 14 .

Three conditions trigger transmission (any one suffices): P_CRITICAL metrics bypass the filter entirely so safety-critical information always transmits; even a slowly changing metric transmits at least once per so the MAPE-K loop never starves on stale P2/P3 inputs (a metric silent beyond carries zero confidence and must refresh, tying directly to Proposition 14 ); and as the normalized-change threshold , so only extreme deviations transmit in normal operation.

Calibration example for OUTPOST ( , \(\theta_\Pi = 0.001\)):

MetricNormal thresholdFiltered threshold ( )Interpretation
Temperature drift\(0.1\,^\circ\text{C}\) (0.1% of \(100\,^\circ\text{C}\) range)\(10\,^\circ\text{C}\) (10% of range)Only transmit on significant excursion
Battery state-of-charge1% change10% changeCoarse reporting only
Seismic amplitudeany spikealways (P_CRITICAL)Bypasses filter
Mesh link quality5% drop50% dropCatastrophic degradation only

The filter preserves the P0–P2 observability hierarchy: availability (P0) and resource exhaustion (P1) metrics carry P_CRITICAL priority and are never dropped; performance (P2) and anomaly (P3) metrics are subject to the \(\beta\)-scaled threshold.

Energy connection: Each filtered-out metric saves \(T_s\) joules ( Definition 2 ). Over a 72-hour partition with 1,000 sensor metrics updating at 1 Hz, filtering to reduces transmissions from 259 million potential packets to fewer than 2,600 — a 100,000x reduction in radio energy expenditure, directly extending battery life.

Gossip Under Partition

Fleet partition creates isolated gossip domains. Within each cluster, convergence continues at rate . Between clusters, state diverges until reconnection.

Remark (Partition Staleness). For node \(i\) in cluster \(C_1\) observing node \(j\) in cluster \(C_2\), staleness - the elapsed time since observation - accumulates from partition time \(t_p\):

The staleness grows unboundedly during partition, eventually exceeding any useful threshold.

The diagram below shows two gossip clusters separated by a hard partition: gossip continues normally within each cluster (solid edges), but the severed link (crossed dashed edge) blocks all cross-cluster exchanges.

    
    graph LR
    subgraph Cluster_A["Cluster A (gossip active)"]
    A1[Node 1] --- A2[Node 2]
    A2 --- A3[Node 3]
    A1 --- A3
    end
    subgraph Cluster_B["Cluster B (gossip active)"]
    B1[Node 4] --- B2[Node 5]
    B2 --- B3[Node 6]
    B1 --- B3
    end
    A3 -.-x|"PARTITION
No communication"| B1 style Cluster_A fill:#e8f5e9 style Cluster_B fill:#e3f2fd

Read the diagram: Within each cluster (green and blue subgraphs), gossip continues normally — solid edges show active exchanges. The crossed dashed edge between Cluster A and Cluster B is severed; no health information crosses that boundary. Each cluster converges to a locally-consistent but globally-stale view. Cross-cluster staleness accumulates for as long as the partition persists.

Cross-cluster state tracking:

Each node maintains a partition vector \(\rho_i\) that records, for every other node \(j\), either zero (still reachable) or the timestamp of the last confirmed contact (if unreachable), enabling staleness calculations when connectivity is later restored.

where is the HLC timestamp ( Definition 61 ) of the last confirmed contact, recorded as an HLC timestamp rather than wall-clock time, to preserve causal ordering across nodes with clock drift.

When \(\rho_i[j] > 0\) and , node \(i\) marks its knowledge of node \(j\) as uncertain rather than stale.

Reconciliation priority:

Upon reconnection, nodes exchange partition vectors. The formula below assigns reconciliation priority to each node \(j\) as the product of how long it has been partitioned ( ) and its operational importance weight, so cluster leads and critical sensors are updated first.

Nodes with longest partition duration and highest importance (cluster leads, critical sensors) reconcile first.

Confidence Intervals on Stale Data

Health observations age. A drone last heard from 30 seconds ago may have changed state since then.

Definition 26 (Staleness). The staleness \(\tau_\text{stale}\) of an observation is the elapsed time since the observation was made. An observation with staleness \(\tau_\text{stale}\) has uncertainty that grows with \(\tau_\text{stale}\) according to the underlying state dynamics.

In other words, the older a health reading is, the less reliable it becomes — not because the data was wrong when recorded, but because the underlying system may have changed in the intervening time.

Model health as a stochastic process. If health evolves with variance \(\sigma^2\) per unit time, the formula below gives the \((1-\alpha)\) confidence interval around the last known health value : the half-width grows as , so uncertainty widens slowly at first and then accelerates.

Physical translation: The confidence interval widens as the square root of elapsed time \(\tau_\text{stale}\). A node last seen 4 seconds ago has twice the uncertainty of a node last seen 1 second ago. At 100 seconds, the interval has grown 10x from its initial width. When the interval spans a capability-level boundary, you can no longer make reliable decisions about that node.

Where:

Assumption: Health evolves as a Brownian diffusion with variance \(\sigma^2\) per unit time, so the \((1-\alpha)\) confidence interval grows as . This assumption breaks for strongly mean-reverting or bounded metrics (e.g., binary health indicators), where alternative staleness models should be used.

Implications for decision-making:

The CI width grows as — a consequence of the Brownian motion model. This square-root scaling means confidence degrades slowly at first but accelerates with staleness .

When the CI spans a decision threshold (like the capability boundary), you can’t reliably commit to that capability level . The staleness has exceeded the decision horizon for that threshold - the maximum time at which stale data can support the decision.

Different decisions have different horizons. Safety-critical decisions with narrow margins have short horizons. Advisory decisions with wide margins have longer horizons. The system tracks staleness against the relevant horizon for each decision type.

When confidence is insufficient, three response strategies apply: attempt direct communication to get a fresh observation, assume health at the lower bound of the CI as a conservative fallback, or escalate observation priority by increasing the gossip rate for that node.

Proposition 14 (Maximum Useful Staleness). For a health process modeled as Brownian diffusion with volatility \(\sigma\) (as in Definition 26 ), and a decision requiring discrimination at precision \(\Delta h\) with confidence \(1 - \alpha\), the maximum useful staleness is:

Beyond this age, uncertainty has grown too wide to distinguish normal from anomalous — old data misleads rather than informs.

Physical translation: Maximum sensor reading age before the confidence interval widens enough to straddle the decision threshold. After , the reading contributes less certainty than not measuring at all — old data becomes an active liability, not just a passive gap.

where is the standard normal quantile and \(\Delta h\) is the acceptable drift. Beyond , the confidence interval spans the decision threshold and the observation cannot support the decision.

Proof: Under the diffusion model ( Definition 26 ), health evolves as a Brownian process with variance \(\sigma^2\) per unit time. Given the last observation at time \(t_0\), the current state at time \(t_0 + \tau\) lies within a \((1-\alpha)\) confidence interval of half-width . Setting this equal to the required decision precision \(\Delta h\) and solving for \(\tau\) gives the result. Under the diffusion model, the staleness bound is independent of observation rate \(\lambda\) — more frequent observations do not reduce uncertainty about the current state between observations, only about the state at each observation time.

Corollary 4. The quadratic relationship implies that tightening decision margins dramatically reduces useful staleness . Systems with narrow operating envelopes must refresh observations more frequently — not because more observations narrow the diffusion uncertainty, but because each observation must occur before the health state drifts by \(\Delta h\).

Time-varying \(\sigma\) caveat: Prop 5 assumes constant measurement volatility \(\sigma\). OUTPOST thermistors exhibit — three times higher (illustrative value) at than at .

For sensors with temperature-correlated variance, substitute as a conservative upper bound on . This produces a shorter, conservative staleness limit.

Decision systems with narrow operating envelopes ( ) will find below 10 seconds (illustrative value) in cold conditions — requiring far higher gossip rates than a lab-temperature calibration implies.

Empirical status: The formula is exact given the Brownian diffusion model and the parameters \(\sigma\), \(\Delta h\), \(\alpha\). The OUTPOST temperature-variance model is a representative sensor-datasheet curve; actual values vary by manufacturer and operating history. The “under 10 seconds” characterization for fast-moving deployments is an illustration, not a measured bound.

Watch out for: the Brownian diffusion model captures gradual drift but not abrupt failures; a sensor that jumps discontinuously to a failed state exceeds \(\Delta h\) in a single tick regardless of \(\tau_{\max}\), so the staleness bound does not constrain step-function fault modes.

Byzantine-Tolerant Health Aggregation

In contested environments, some nodes may be compromised. They may inject false health values to:

Definition 27 ( Byzantine Node). A node is Byzantine if it may deviate arbitrarily from the protocol specification, including sending different values to different peers, reporting false observations, or selectively participating in gossip rounds [7] .

In other words, a Byzantine node is one that cannot be assumed to behave honestly in any predictable way — unlike a crashed node, it may actively lie, and it may lie differently to different neighbors simultaneously.

Scope: this threat model applies within a single partition epoch — specifically to gossip health aggregation while nodes are co-partitioned and can observe each other’s messages. Post-reconnection Byzantine treatment (reputation-weighted CRDT admission, quorum-gated state merge) builds on this foundation but introduces additional per-epoch mechanisms addressed in Fleet Coherence Under Partition.

Compute Profile: CPU: per aggregation round — trust-weighted trimmed mean requires sorting reporters by trust weight before trimming. Memory: — consistency-history buffer of rounds and nodes; this buffer becomes the binding constraint above nodes.

The aggregation function uses a trust-weighted trimmed mean: the bottom and top \(f/n\) weight fractions are excluded before computing the weighted average. This makes the aggregate robust to up to \(f\) Byzantine contributors.

Weighted voting based on trust scores. The formula below computes a trust-weighted average of each node’s reported health for member \(k\), where \(T_i\) is the accumulated trust score of reporting node \(i\); nodes with low or decayed trust contribute proportionally less to the aggregate.

Physical translation: This is a weighted average where each reporter’s vote is scaled by its accumulated trust. A node with trust 0.9 influences the aggregate nine times more than a node with trust 0.1. A node whose trust has decayed to 0.01 — due to repeated inconsistent reports — is almost completely silenced without being formally ejected.

Where \(T_i\) is the trust score of node \(i\). Trust is earned through consistent, verifiable behavior and decays when inconsistencies are detected.

Outlier detection on received health reports: A report from node \(i\) about node \(k\) is flagged suspicious when the absolute deviation from the current consensus value exceeds the outlier threshold .

Repeated suspicious reports decrease trust score for node \(i\).

Isolation protocol for nodes with inconsistent claims:

  1. Track history of claims per node (sliding window of \(W\) rounds; memory: \(O(n \cdot W)\) bits)
  2. Compute consistency score: fraction of claims matching consensus
  3. If consistency below threshold, quarantine node from health aggregation
  4. Quarantined nodes can still participate but their reports are not trusted

Memory bound: For \(n = 50\) nodes (illustrative value) and \(W = 100\) rounds (illustrative value), history storage requires \(50 \times 100 / 8 = 625\) bytes (illustrative value) using 1-bit flags per observation.

Proposition 15 ( Byzantine Tolerance Bound). With trust-weighted aggregation, correct health estimation is maintained if the total Byzantine trust weight is bounded [7, 8] :

Correct consensus holds as long as compromised nodes collectively carry less than one-third of total trust weight (theoretical bound — derived from the structure of trust-weighted BFT aggregation; not an empirical measurement).

This generalizes the classical \(f < n/3\) bound: with uniform trust weights \(T_i = 1\), this reduces to \(f < n/3\) (fewer than one third of nodes are Byzantine ). With trust decay on suspicious nodes, Byzantine influence decreases over time, allowing tolerance of more compromised nodes provided their accumulated trust is low.

Equal initial weights give an attacker a free \(1/n\) share of influence from the first gossip round; hardware-attested enrollment delays that influence until the device earns it through observed behavior.

This is not foolproof - a sophisticated adversary who understands the aggregation mechanism can craft attacks that pass consistency checks. Byzantine tolerance provides defense in depth, not absolute security.

Bootstrap dependency: Trust weights \(w_i\) require an initialization source. Without a functional PKI at deployment time, the only option is uniform \(w_i = 1\), which reduces Prop 6 to the classical \(f < n/3\) bound. A Byzantine node that corrupts its weight record before the reputation system accumulates any observations inflates its influence above the \(1/3\) threshold from the first gossip round.

The operational implication: trust weight initialization requires a hardware root of trust — secure boot attestation or a pre-deployment enrollment step that cryptographically binds \(w_i\) to a device identity. Systems without enrollment have no Byzantine tolerance guarantee at startup. Prop 6 applies only after each node has accumulated sufficient legitimate observations to build a meaningful trust differential (in practice: \(\geq 20\) (illustrative value) gossip exchanges with a given peer).

Trust accumulation attack: The \(f < n/3\) bound is instantaneous. An adversary can compromise nodes gradually, with each behaving honestly until sufficient trust accumulates. When approaches , coordinated Byzantine behavior can dominate aggregation before detection triggers trust decay.

Trust budget decay bounds the maximum trust any coalition can accumulate: total system trust decreases over time unless re-earned through verified behavior:

ConditionStrategyComplexityByzantine tolerance
< 10 nodes, all trustedWeighted voting (uniform trust)\(O(n)\)\(f < n/3\) with equal weights
> 10 nodes, mixed trust historyTrust-weighted trimmed mean + reputation\(O(n \log n)\)\(f < n/3\), adaptive over ~20 rounds (illustrative value)
Adversarial enrollment suspectedReputation filter + behavioral fingerprint consistency check\(O(n) + O(w)\) window\(f < n/3\) guaranteed after consistency filter

Empirical status: The \(f < n/3\) bound is a proven impossibility result (Lamport-Shostak-Pease). The trust-weighted generalization is analytically derived. The parameters (illustrative value), (illustrative value), , and the 20-round warm-up figure are engineering design choices, not empirically validated values.

Watch out for: the bound is instantaneous — it holds only while Byzantine trust weight remains below \(1/3\) at every moment; if trust weights are not updated continuously, a slow accumulation attack can cross the one-third threshold before the decay mechanism responds.

Optional: Game-Theoretic Extension — Byzantine Reporting as a Signaling Game

Proposition 15 ’s fraction bound assumes Byzantine behavior is a fixed fraction, not a strategic choice. A strategic Byzantine node maximizes its trust weight to amplify its influence.

Signaling game: Each node \(i\) has true health . A Byzantine node selects reported health \(\hat{h}_i\) to maximize detection error. The trust weight rewards freshness - a Byzantine node maintaining \(\tau \approx 0\) (frequent fresh reports) achieves maximum trust weight while reporting inverted health values .

The staleness -decay flaw: The current trust model rewards Byzantine node s who invest in frequent reporting. The corrected trust weight below multiplies the staleness factor by a hard consistency indicator — zero weight is assigned whenever the node’s report contradicts neighbor-model predictions, regardless of how fresh the report is.

Reputation update: The EWMA-like update below maintains a reputation score \(r_j(t) \in [0,1]\) for each node, blending the previous score (weight \(\alpha\)) with a binary consistency indicator for the current round (weight \(1-\alpha\)), where is the neighbor-model prediction and \(\delta\) is the consistency tolerance.

where is the prediction from neighbor models. Nodes with consistent reports (honest or genuinely healthy) maintain high \(r_j\); Byzantine node s whose inversions conflict with neighbor cross-validation see \(r_j \to 0\) over time.

Practical implication: Replace staleness -only trust weights with reputation-weighted trust. For OUTPOST ’s 127-sensor mesh, this catches both adversarial Byzantine sensors and genuinely malfunctioning sensors without false Byzantine labels - failing sensors produce noisy (not inverted) reports, which are distinguishable from strategic inversion.

Trust Recovery Mechanisms

Trust decay handles misbehaving nodes, but legitimate nodes may be temporarily compromised (e.g., sensor interference, transient fault) and later recover. A purely decaying trust model permanently punishes temporary failures.

Trust recovery model:

Trust evolves according to a mean-reverting process: each round it either decays multiplicatively toward zero on an inconsistent report or recovers toward on a consistent one, with and controlling the respective speeds.

where (illustrative value) (fast decay) and (illustrative value) (slow recovery). The asymmetry ensures that building trust takes longer than losing it — appropriate for contested environments.

Recovery conditions:

Trust recovery does not begin immediately after one good report; a node becomes eligible only when its consistency fraction over the recent window \(W\) exceeds the threshold .

where \(W\) is typically 50–100 gossip rounds (illustrative value) and (illustrative value). A node with even 5% inconsistent reports continues decaying.

Sybil attack resistance:

An adversary creating multiple fake identities (Sybil attack) can attempt to dominate the trust-weighted aggregation. Four countermeasures compose the defense. Identity binding requires nodes to prove identity through cryptographic challenge-response or physical attestation (GPS position consistency over time). Trust inheritance limits ensure new nodes start with where \(\beta < 0.5\), so no node can spawn high-trust children; the minimum inheritance bound (illustrative value) prevents a sponsoring node from setting \(\beta \approx 0\) as a denial-of-service against fleet expansion, placing the usable range at \(\beta \in [0.05, 0.5)\) (illustrative value). A global trust budget caps the sum of trust scores across all nodes at :

New node admission therefore requires either trust redistribution or explicit authorization. Behavioral clustering groups nodes with suspiciously correlated behavior (same reports, same timing) and caps their combined trust at the maximum of any single member — not the sum — preventing a coalition of colluders from accumulating more influence than one honest node:

Trust recovery example:

CONVOY vehicle V3 experiences temporary GPS interference causing inconsistent position reports for 10 minutes, dropping trust from 1.0 to 0.35. After interference clears, trust rises to 0.42 in the first 5 minutes of consistent reports, to 0.58 over the following 10 minutes, to 0.78 by 30 minutes, and returns to 0.95 after 1 hour of consistency. The slow recovery prevents adversaries from rapidly cycling between attack and “good behavior” phases.

Behavioral Fingerprinting and Proof of Useful Work

Trust decay and cross-validation (above) detect nodes that report inconsistent health values. They do not detect a more sophisticated adversary: a node that reports plausible health values — passing every consistency check — while its anomaly detector has been disabled, replaced with a random-number generator, or is producing outputs uncalibrated to actual sensor data. A heartbeat proves the node is alive; cross-validation proves the node is reporting plausibly; neither proves the node is doing useful work.

Definition 28 (Behavioral Fingerprint). The behavioral fingerprint of node \(i\) over observation window \([t - w, t]\) is the tuple :

Proof of Useful Work (KS test on anomaly score distribution). Node \(i\) passes the fingerprint test if the Kolmogorov-Smirnov statistic \(D_w\) is below the critical value:

where is the empirical CDF of and is the KS critical value at significance \(\alpha\) with window \(w\).

What each component catches:

Adversary behavior\(\mathcal{F}_i\) signature\(\mathcal{K}_i\) signatureDetected?
Dead detector (always \(z=0\))\(D_w \approx 0.5\) (point mass at 0) Yes — \(\mathcal{F}_i\) fails KS
Frozen detector (constant \(z\))\(D_w \approx 0.5\) Yes — \(\mathcal{F}_i\) fails KS
Spoofed: fake \(\mathcal{N}(0,1)\) draws\(D_w \approx 0\) — passes \(\mathcal{F}_i\) — wrongYes — \(\mathcal{K}_i\) fails
Calibrated Byzantine (inverted)\(D_w \approx 0\) matchesNo — actions \(\mathcal{R}_i\) inconsistent
Genuine useful work matchesPasses all three

To pass both (genuine normal distribution) and (correct spatial correlation with all neighbors), an adversary must either run a genuinely calibrated detector on actual sensor data — which is the useful work required — or learn the full spatial correlation structure for all neighbors and generate synthetic correlated noise, which requires communicating with all neighbors continuously, making the adversary detectable via traffic analysis and increasing their energy expenditure above Definition 2 ’s threshold.

Physical translation: A node that is genuinely running an anomaly detector on real sensor data will produce anomaly scores that look like a standard normal distribution (by construction of the Z-score) and will be spatially correlated with neighbors measuring the same physical environment. A node that is dead, frozen, or generating synthetic scores will fail one of these three checks within 30 minutes of data collection. The key practical consequence: this fingerprint catches malfunctioning sensors that accidentally mimic Byzantine behavior — frozen sensors, dead sensors, calibration drift — without requiring any dedicated Byzantine detection protocol.

Connection to Definition 27 ( Byzantine Node): Def 7 identifies Byzantine nodes as those that may deviate arbitrarily from the protocol. Prop 6’s trust bound stops them from dominating aggregation. Definition 28 adds a third line of defense that complements both: it is triggered not by what a node reports but by whether the reporting process itself is consistent with genuine sensor-coupled inference. A Byzantine node that understands and avoids Prop 6’s trust threshold can still be caught by the fingerprint’s spatial correlation test, provided the physical environment is not under the adversary’s full control.

OUTPOST calibration: Window \(w = 1800\) samples (30 minutes at ), \(\alpha = 0.01\) — giving . For a dead detector: \(D_w = 0.5 \gg 0.038\) — detected in 30 minutes. For a fake- generator: passes, but vs. expected (thermal correlation between adjacent sensors) — detected by Fisher z-test on the correlation difference within the same window.

Fingerprint Warm-Up Quarantine

Any node that re-enters the fleet after a partition of duration is placed in Provisional Status for a minimum of \(T_\text{fp,window}\) before its gossip messages are admitted to fleet consensus. During Provisional Status: (1) the node’s messages are forwarded as read-only advisory input — visible to peers but not updating the Welford estimator or gossip health vectors; (2) the rehabilitation gate ( Trust-Root Anchor protocol) applies concurrently, requiring \(r\) consecutive clean observations; (3) both conditions must be satisfied before full fleet membership is restored.

For OUTPOST, the combined entry barrier is — 30 minutes regardless of tick rate.

Physical translation: A returning node earns read-only observer status immediately but cannot influence fleet decisions for 30 minutes. This is the same logic a hospital uses for a patient returning from isolation — they can speak and be heard, but cannot handle shared equipment until cleared.

Federated Learning for Distributed Health Models

Individual nodes learn anomaly detection models from local data. But local data is limited - each node sees only its own failures and operating conditions. Federated learning enables fleet-wide model improvement without centralizing sensitive telemetry data.

The Federated Learning Problem for Edge Health:

(Notation: in this section and throughout federated learning, \(\theta\) and denote ML model parameter vectors — weights of the anomaly detection model. This is distinct from \(\theta^*\) in Propositions 9–19 , where it denotes the scalar anomaly detection threshold. Context distinguishes the two; the subscript \(\theta_k\) (node-local model) vs. \(\theta^*\) (optimal threshold) disambiguates when both appear.)

Traditional ML requires centralized data. The objective below finds model parameters that minimize total loss across all \(n\) labeled samples \((x_i, y_i)\) pooled in one place.

Federated learning distributes the same optimization across \(K\) nodes so that only gradient updates — not raw telemetry — travel over the network. Each node \(k\) minimizes its own local loss over its private dataset \(D_k\) of size \(n_k\), and the contributions are weighted by data-set size to approximate the global optimum .

Each node \(k\) computes local gradients; only gradients (not data) are shared.

Federated Averaging (FedAvg) for Edge Deployment: The diagram shows one complete round — server broadcast, parallel local SGD on each node, then weighted aggregation back to the server; data never leaves each node, only gradient updates travel.

    
    graph TD
    subgraph "Round t"
        S1["Server broadcasts
global model theta_ml(t)"] C1["Node 1: Local SGD
theta_ml1(t+1) = theta_ml(t) - eta*dL1"] C2["Node 2: Local SGD
theta_ml2(t+1) = theta_ml(t) - eta*dL2"] C3["Node K: Local SGD
theta_mlk(t+1) = theta_ml(t) - eta*dLk"] A1["Server aggregates
theta_ml(t+1) = sum(nk/n)*theta_mlk(t+1)"] end S1 --> C1 S1 --> C2 S1 --> C3 C1 --> A1 C2 --> A1 C3 --> A1 A1 --> S2["Round t+1"] style S1 fill:#e3f2fd,stroke:#1976d2 style A1 fill:#e3f2fd,stroke:#1976d2 style C1 fill:#e8f5e9,stroke:#388e3c style C2 fill:#e8f5e9,stroke:#388e3c style C3 fill:#e8f5e9,stroke:#388e3c

Read the diagram: The server broadcasts the current global model to all nodes (blue). Each node runs several steps of local gradient descent on its own private data (green) — raw telemetry never leaves the node. Nodes return only their updated weights. The server aggregates these weighted by dataset size and produces a better global model for the next round. Raw sensor data stays local; only model weights travel.

Adaptation for contested connectivity:

Standard FedAvg assumes synchronous communication — all nodes participate in each round. Edge systems require asynchronous federated learning with three key adaptations. Each round includes only nodes with connectivity; the formula below aggregates the available subset \(S_t\) using data-size weights \(n_k\), so the update is still an unbiased estimator of the full-data gradient when participation is random:

where is the set of participating nodes in round \(t\). Nodes may also contribute gradients computed from stale global models; the staleness -discounted weight \(w_k\) reduces a node’s contribution exponentially with the number of rounds \(\tau_k\) elapsed since its last synchronization:

where \(\tau_k\) is the model staleness (rounds since last sync) and \(\gamma_{\text{fed}} \in (0.9, 0.99)\) is the per-round staleness discount for federated gradient aggregation. (We write \(\gamma_{\text{fed}}\) rather than bare \(\gamma\) to distinguish from \(\gamma_s\), the staleness decay rate in the gossip weight formula, and from the Holt-Winters seasonal coefficient \(s_{\text{hw}}\).) For large fleets, two-level hierarchical aggregation reduces coordination: during partition, each connected cluster performs intra-cluster aggregation independently, and cross-cluster aggregation resumes upon reconnection with staleness weighting. The diagram below shows the two-level structure: individual nodes feed cluster aggregators, which in turn feed a fleet-level aggregator that produces the global model \(\theta^*\).

    
    graph TD
    subgraph "Local Clusters"
        N1["Node 1"] --> L1["Cluster 1
Aggregator"] N2["Node 2"] --> L1 N3["Node 3"] --> L2["Cluster 2
Aggregator"] N4["Node 4"] --> L2 end subgraph "Fleet Level" L1 --> G["Fleet
Aggregator"] L2 --> G end G --> M["Global Model
theta*"] style L1 fill:#fff3e0,stroke:#f57c00 style L2 fill:#fff3e0,stroke:#f57c00 style G fill:#e3f2fd,stroke:#1976d2

Read the diagram: During partition, each cluster runs its own local aggregation loop independently (orange nodes). When connectivity returns, cluster-level models flow up to the fleet aggregator (blue), which produces the global model \(\theta^*\). Cluster-internal updates continue at full gossip rate even when the fleet-level link is severed.

CONVOY Federated Anomaly Detection:

12 vehicles, each with local autoencoder (280 bytes). Federated learning improves detection by pooling training data without centralization.

Convergence analysis under assumption set :

The table below shows how the expected loss evolves with round count: it decreases at an \(O(1/t)\) rate toward the global optimum , with a residual term \(O(\sigma^2/K)\) that shrinks as more nodes participate.

Round \(t\)Expected LossConvergence Bound
0\(\mathcal{L}_0\)Baseline (local only)
\(t\)\(\mathcal{L}_t\)
\(\mathcal{L}^*\)Optimal for aggregated data

Utility improvement derivation: The formula decomposes the net gain from federated learning into two labeled terms — the recall improvement from pooling training data across all nodes, minus the gradient-communication cost over \(T\) rounds.

because:

  1. Effective training set size increases from \(n_k\) to , reducing generalization error by
  2. Communication cost bytes/round (illustrative value) is negligible vs. detection value

Communication efficiency: Model updates require \(280 \times 2 = 560\) bytes (illustrative value) per round. For connectivity probability \(p_c = 0.1\), expected rounds per vehicle-day: .

Differential Privacy for Sensitive Telemetry:

Vehicle telemetry may reveal sensitive information (location patterns, operational tactics). Differential privacy adds Gaussian noise scaled by the gradient clipping bound \(C\) and privacy parameter \(\sigma\) to each gradient \(g_k\) before it is shared, producing noisy gradient \(\tilde{g}_k\) that prevents inference of individual data points.

where \(C\) is the gradient clipping threshold and \(\sigma\) is calibrated for \((\epsilon, \delta)\)-differential privacy.

Privacy-utility tradeoff derivation: For \((\epsilon, \delta)\)-differential privacy with the Gaussian mechanism, the required noise multiplier \(\sigma\) is determined by the privacy budget \(\epsilon\) and the failure probability \(\delta\) as follows.

Convergence-privacy relationship: The ratio below quantifies how many more training rounds the noisy private model needs compared to the baseline, as a function of the noise level \(\sigma\), the model dimension \(d\), and the gradient magnitude .

The table below evaluates the convergence-privacy trade-off at four privacy levels, from no noise ( ) to strong privacy (\(\epsilon = 0.1\)); the Utility Bound column shows how detection accuracy degrades as noise \(\sigma\) increases relative to gradient magnitude.

\(\epsilon\)Noise \(\sigma\)Slowdown FactorUtility Bound
\(\infty\)0\(1\times\)\(U^*\) (optimal)
10\(O(0.1C)\) \(U^* - O(\sigma^2/n)\)
1\(O(C)\) \(U^* - O(d\sigma^2/n)\)
0.1\(O(10C)\)\(>10\times\)Utility-dominated by noise

For tactical systems, \(\epsilon = 10\) (illustrative value) balances privacy (gradient direction obscured) against utility (convergence within \(1.5\times\) (illustrative value) baseline).

Personalization Layers:

Fleet-wide models capture common failure patterns, but node-specific baselines and environmental offsets require local adaptation. The formula below decomposes node \(k\)’s prediction function into a federally trained shared component parameterized by and a locally maintained residual parameterized by \(\theta_k\).

The federated layers handle generic feature extraction and common failure patterns; the local (non-shared) layers handle node-specific baselines and environmental adaptation.

For CONVOY autoencoders:

Handling Non-IID Data:

Edge nodes experience different operating conditions (non-IID data). CONVOY vehicles in mountain terrain see different failure modes than those in desert. Three strategies address this. FedProx augments the local loss with a proximal penalty that pulls node \(k\)’s parameters \(\theta\) toward the current global model , with \(\mu\) controlling how tightly local updates are anchored:

Clustered Federated Learning identifies node clusters with similar data distributions and federates only within each cluster. The diagram below shows three terrain-based clusters for CONVOY vehicles — mountain, desert, and urban — each maintaining its own shared model while nodes within a cluster exchange gradient updates with each other.

    
    graph LR
    subgraph "Cluster A (Mountain)"
        V1["V1"]
        V3["V3"]
        V7["V7"]
    end

    subgraph "Cluster B (Desert)"
        V2["V2"]
        V5["V5"]
        V9["V9"]
    end

    subgraph "Cluster C (Urban)"
        V4["V4"]
        V6["V6"]
        V8["V8"]
    end

    V1 <--> V3 <--> V7
    V2 <--> V5 <--> V9
    V4 <--> V6 <--> V8

    CA["Model A"] --- V1
    CB["Model B"] --- V2
    CC["Model C"] --- V4

    style CA fill:#e3f2fd,stroke:#1976d2
    style CB fill:#fff3e0,stroke:#f57c00
    style CC fill:#e8f5e9,stroke:#388e3c

Read the diagram: Vehicles operating in similar terrain — mountain, desert, urban — form separate federated clusters. Within each cluster, gradient updates flow between peers (double-headed arrows). Each cluster maintains a distinct shared model calibrated to its terrain’s failure modes. This prevents a vehicle tuned for mountain pass conditions from contaminating the model for urban delivery routes.

Multi-task learning treats each node as a related task, sharing representations while allowing task-specific outputs.

Convergence Guarantees Under Partition:

The bound below gives the expected squared gradient norm after \(T\) rounds of asynchronous federated learning as a function of the participation rate \(p\) and maximum staleness \(\tau_{\text{rounds}}\) — written \(\tau_{\text{rounds}}\) rather than \(\tau_{\max}\) to distinguish it from the maximum useful staleness bound in physical seconds ( Proposition 14 ), which counts wall-clock seconds rather than synchronization rounds; both terms decrease with \(T\), confirming that convergence is guaranteed whenever staleness is bounded.

Key insight: convergence slows but is still guaranteed if staleness is bounded. For CONVOY with \(\tau_{\text{rounds}} = 5\) rounds (illustrative value) and \(p = 0.6\) (illustrative value), convergence to \(\epsilon = 0.05\) (illustrative value) gradient norm requires \(T \approx 30\) rounds (illustrative value) — achievable within one operational month.

Under the Weibull connectivity model ( Definition 13 , Why Edge Is Not Cloud Minus Bandwidth), the expected participation rate is where \(F_W\) is the Weibull CDF at the tick interval, and maximum round-staleness \(\tau_{\text{rounds}}\) is bounded by the Weibull P95 partition duration expressed in synchronization rounds. Denser connectivity (lower \(T_{\text{acc}}\)) yields higher \(p\) and lower \(\tau_{\text{rounds}}\), improving convergence.

Cognitive Map: Gossip is the protocol that turns per-node anomaly scores into fleet-wide health awareness — without any central server. The convergence guarantee degrades gracefully from in ideal meshes to in contested sparse ones. Staleness imposes a hard deadline \(\tau_{\max}\) on every health observation (physical seconds; Proposition 14 ); Byzantine tolerance requires the compromised fraction of trust weight to stay below \(1/3\). Federated learning completes the picture: nodes improve their local detectors together without sharing raw telemetry. Next: what measurements matter most, and in what order, when resources are scarce.


The Observability Constraint Sequence

Hierarchy of Observability

With limited resources, what should be measured first?

The observability constraint sequence prioritizes metrics by importance, ensuring that the most survival-critical information is collected first whenever resources are scarce. The table below lists the five priority levels from P0 (fundamental liveness) to P4 (root-cause diagnosis), together with representative metrics and their resource cost tier.

LevelCategoryExamplesResource Cost
P0AvailabilityIs it alive? Responding?Minimal (heartbeat)
P1Resource exhaustionPower, memory, storage remainingLow (counters)
P2Performance degradationLatency, throughput, error ratesMedium (aggregates)
P3Anomaly patternsUnusual behavior, driftMedium-High (models)
P4Root cause indicatorsWhy is it behaving this way?High (correlation)

A well-resourced system implements all levels. A constrained system implements as many as resources allow, starting from P0.

Resource Budget for Observability

Observability competes with the primary mission for resources. The hard budget constraint below states that combined observability cost and mission cost must not exceed the device’s total available resources .

Where:

The objective below selects the allocation that maximizes the combined value of mission output and health-knowledge quality , subject to that budget constraint.

Subject to

Typically, mission value exhibits diminishing returns — more resources yield proportionally less capability — while health value has threshold effects: below a minimum allocation health knowledge becomes useless, and above it marginal gains decrease.

The optimal allocation gives sufficient resources to observability for reliable health knowledge, then allocates remainder to mission.

OUTPOST allocation example:

Allocation:

This 15% (illustrative value) observability overhead enables reliable self-measurement while preserving 85% (illustrative value) of resources for mission function.

Commercial Application: PREDICTIX Manufacturing Monitoring

PREDICTIX monitors CNC machines in aerospace manufacturing. Each machine generates continuous telemetry: spindle vibration, coolant temperature, tool wear, power consumption. Challenge: detect failures before costly component defects, with unreliable plant floor connectivity. Machines must self-diagnose during disconnection.

The observability constraint sequence for PREDICTIX maps directly to manufacturing priorities:

LevelCategoryPREDICTIX ImplementationBusiness Impact
P0Machine aliveHeartbeat every 5s, cycle counter incrementingProduction stoppage detection
P1Resource exhaustionCoolant level, tool life remaining, chip bin capacityPrevent mid-cycle failures
P2Quality degradationSurface finish variance, dimensional driftCatch defects before scrap
P3Anomaly patternsVibration signatures, power profile deviationsPredict failures 2-8 hours ahead
P4Root causeCross-machine correlation, supply chain integrationSystemic issue identification

Local anomaly detection architecture: Each edge controller routes its four sensor streams through three parallel detection algorithms that feed a weighted-voting aggregator. The diagram below shows which sensors feed which detectors and how the outputs are fused into a single anomaly score.

    
    graph LR
    subgraph "Sensor Inputs"
        VIB["Vibration
10 kHz sampling"] TEMP["Temperature
1 Hz sampling"] PWR["Power
100 Hz sampling"] DIM["Dimensional
Per-part"] end subgraph "Detection Algorithms" EWMA["EWMA
Fast drift detection
15 KB memory"] HW["Holt-Winters
Shift pattern modeling
48 KB memory"] IF["Isolation Forest
Multivariate anomaly
200 KB memory"] end subgraph "Fusion" AGG["Anomaly Aggregator
Weighted voting"] end VIB --> EWMA TEMP --> EWMA PWR --> HW VIB --> IF TEMP --> IF PWR --> IF DIM --> IF EWMA --> AGG HW --> AGG IF --> AGG style EWMA fill:#e8f5e9,stroke:#388e3c style HW fill:#fff3e0,stroke:#f57c00 style IF fill:#e3f2fd,stroke:#1976d2 style AGG fill:#fce4ec,stroke:#c2185b

Read the diagram: Four sensor streams enter from the left. Vibration and temperature feed EWMA (fast scalar spike detection). Power feeds Holt-Winters (shift pattern modeling with 8-hour and 24-hour seasonal components). All four streams feed the Isolation Forest for correlated multivariate anomalies. The three algorithm outputs feed the weighted-voting aggregator (pink) which produces a single anomaly score for the MAPE-K loop.

Algorithm selection rationale: EWMA covers thermal and vibration baselines, catching sudden shifts in spindle temperature or bearing vibration within 5–10 samples (illustrative value) at 15 KB (illustrative value) for 50 monitored parameters. Holt-Winters covers power consumption, capturing 8-hour shift patterns, 24-hour maintenance cycles, and weekly production planning effects at 48 KB (illustrative value) for 168-hour seasonality. Isolation Forest covers multivariate anomalies — detecting combinations such as normal vibration and normal temperature alongside abnormal power consumption, which signals an imminent bearing seizure — at 200 KB (illustrative value) for a 50-tree ensemble.

Anomaly confidence fusion: The combined score is a weighted sum of the three algorithm outputs, where each weight reflects that algorithm’s relative detection power and score stability for PREDICTIX ’s failure modes.

Weight derivation: Weights are set proportional to each detector’s AUC divided by its score variance ( ), rewarding detectors that are both accurate and consistent. Applying this to PREDICTIX gives the specific values below.

Isolation Forest receives highest weight because multivariate detection (joint anomaly space) has higher AUC than marginal detectors for correlated failure modes.

Staleness and decision authority: As staleness increases and the cloud becomes unreachable, the edge controller assumes progressively more decision authority. The table below maps each staleness band to the corresponding cloud connectivity state and the scope of decisions the local controller takes autonomously.

StalenessCloud StatusLocal Authority
0-30sConnectedAdvisory only; cloud makes decisions
30s-5minDegradedLocal P0-P2 decisions; P3+ queue for cloud
5-30minIntermittentLocal P0-P3 decisions; aggressive caching
>30minDeniedFull local authority; conservative thresholds

At 30+ minutes disconnection, the edge controller tightens anomaly thresholds by 20% (illustrative value) (more false positives, fewer missed failures) because the cost of a missed failure without cloud backup is higher than during connected operation.

Cross-machine gossip : Machines on the same production line exchange health summaries via local Ethernet every 30 seconds. Each machine’s summary is a three-field tuple capturing current anomaly level, estimated time to next maintenance, and parts produced since last inspection.

The line controller combines per-machine summaries into a single production-line health score that is bottlenecked by the weakest machine’s availability and averaged over all machines’ quality scores.

Line health below threshold triggers automatic workload rebalancing - shifting high-precision parts away from degraded machines.

Utility analysis:

System utility \(U\) is production value minus the three cost categories that self-measurement can reduce: unplanned downtime , scrapped parts , and inspection overhead .

Utility improvement from self-measurement: The formula decomposes the net economic gain into three labeled terms — downtime avoided, scrap avoided, and false-alarm inspection cost — so the break-even condition for each cost type can be evaluated separately.

when:

For manufacturing where ( (illustrative value) for high-value components), even moderate recall (\(R > 0.8\)) with high FPR (\(< 0.1\)) yields \(\Delta U > 0\). The observability constraint sequence delivers economic value when detection value exceeds false alarm cost.


RAVEN Self-Measurement Protocol

The RAVEN drone swarm requires self-measurement at two levels: individual drone health and swarm-wide coordination state.

Per-Drone Local Measurement

Each drone continuously monitors:

Power State

Sensor Health

Link Quality

Mission Progress

EWMA tracking on each metric with \(\alpha = 0.1\) (illustrative value) (10-second effective memory (illustrative value)). Anomaly threshold at \(3\sigma\) (illustrative value) for critical metrics (power, flight controls), \(2\sigma\) (illustrative value) for secondary metrics (sensors, links).

Swarm-Wide Health Inference

Gossip protocol parameters:

Relationship: The staleness threshold (30s) marks where data begins degrading meaningfully - decisions based on 30s-old data have ~90% confidence. The maximum useful staleness (60s) marks where confidence falls below 50% - beyond this, the data provides little more than a guess. These are design parameters chosen for the RAVEN mission envelope; from Proposition 14 , scales as \((\sigma/\Delta h)^2\) so tightening decision margins reduces useful staleness rapidly.

Health vector per drone contains:

Merge function uses timestamp-weighted average for numeric values, latest-timestamp-wins for categorical values. In contested environments where clock drift is measurable, replace wall-clock LWW with HLC -aware merge ( Definition 62 , Fleet Coherence Under Partition) using the Hybrid Logical Clock ordering to determine recency; node-ID tiebreakers resolve simultaneous HLC values.

Convergence guarantees: With logarithmic propagation dynamics, fleet-wide health convergence occurs within 30–40 seconds (illustrative value) — fast enough to track operational state changes while remaining robust to individual message losses.

Anomaly Detection and Self-Diagnosis

Cross-sensor correlation matrix maintained locally. Example correlations:

Self-diagnosis follows a structured decision process, mapping each combination of anomaly type and cross-sensor correlation to the most likely failure cause and the recommended autonomous response.

Observation PatternDiagnosisAction
Power anomaly with neighbors unaffected or recent maneuverLocal power issueReduce power consumption, report to swarm
Sensor anomaly with cross-sensor consistencyEnvironmental conditionContinue with degraded confidence
Sensor anomaly with cross-sensor inconsistencySensor failureDisable sensor, rely on alternatives
Communication anomaly affecting multiple neighborsEnvironmental interference or jammingIncrease transmit power, switch frequencies
Communication anomaly affecting only selfLocal radio failureAttempt radio restart, fall back to minimal beacon

The diagnosis is probabilistic - the table represents the most likely paths, but confidence levels are maintained throughout.


CONVOY Self-Measurement Protocol

The CONVOY ground vehicle network operates with different constraints: vehicles have more resources than drones but face different failure modes.

Convoy-Level Health Inference

Hierarchical aggregation runs in two modes: in primary mode the lead vehicle collects health from all vehicles, computes the aggregate, and distributes a summary; if the lead is unreachable, the convoy falls back to peer-to-peer gossip among reachable vehicles.

Lead vehicle aggregation:

Fallback gossip parameters:

Anomaly Detection Focus

Position spoofing detection:

Each vehicle tracks its own position via GPS, INS, and dead reckoning, and also receives claimed positions from neighbors. The discrepancy below measures how far vehicle \(i\)’s claimed position deviates from where neighbor \(j\) independently observed it; a large discrepancy across multiple neighbors indicates spoofing.

If exceeds threshold for vehicle \(i\) as observed by multiple neighbors \(j\), vehicle \(i\) is flagged for position anomaly.

Vehicle \(i\) is flagged as potentially spoofed if \(\geq k\) neighbors (\(k = \lceil n/3 \rceil\)) each independently report . A suitable threshold is where is the maximum distance a vehicle can travel during one gossip period — any discrepancy larger than this cannot be explained by legitimate movement.

Communication anomaly classification:

Distinguishing jamming from terrain effects requires examining correlation patterns: jamming affects all frequencies simultaneously, correlates with adversarial activity, and affects multiple vehicles; terrain effects are path-specific, correlate with geographic features, and are predictable from maps.

Use convoy’s position history to build terrain propagation model. Deviations from model suggest adversarial interference.

Integration with Semi-Markov connectivity model:

From the Semi-Markov connectivity model, the expected sojourn time distributions and embedded transition probabilities are known. A transition is flagged as anomalous when its model-predicted probability falls below threshold , meaning the observed regime change is too abrupt or too frequent to be explained by natural causes.

Unexpectedly rapid transitions from connected to denied suggest adversarial action rather than natural degradation.


OUTPOST Self-Measurement Protocol

The OUTPOST sensor mesh operates with the most extreme constraints: ultra-low power, extended deployment durations (30+ days), and fixed positions that make physical inspection impractical.

Per-Sensor Local Measurement

Each sensor node continuously monitors with minimal power:

Power State

Environmental Monitoring

Proposition 16 (Power-Aware Measurement Scheduling). For a sensor with solar charging profile and measurement cost \(C_m\) per measurement, the optimal measurement schedule maximizes information gain while maintaining positive energy margin:

Measure as often as information justifies, subject to never depleting the battery below a safety reserve.

where \(I(m_t)\) is the information gain from measurement at time \(t\) and is the required energy reserve.

A greedy heuristic — sort measurements by information-gain-per-watt ratio \(I(m)/C_m\) and schedule in decreasing order until the power budget is exhausted — achieves \(O(n \log n)\) computation complexity. The gap to the optimal schedule depends on environmental correlation structure.

Physical translation: A sensor reading carries value only if the energy budget allows it to be processed before the battery forces a downgrade. The proposition computes the exact measurement rate that maximizes information gain within the energy envelope — measuring faster when power is abundant and slowing to a trickle in survival mode rather than stopping completely, which would blind the healing loop entirely.

Empirical status: The optimization formulation is exact given the solar profile and cost model \(C_m\). The OUTPOST greedy prioritization (passive seismic through radar) is an illustrative ordering, not derived from a measured information-gain-per-watt experiment. Solar profiles vary by geography, season, and panel orientation.

Watch out for: the schedule is optimal given the forecast solar profile \(P_{\text{solar}}(t)\); forecast error shifts the schedule against the available energy budget, and the positive-energy-margin constraint can be violated mid-mission if actual irradiance falls below the predicted profile.

Observer Parsimony: The Overhead Budget Constraint

Proposition 16 schedules measurements to preserve energy margin. A deeper constraint applies to the autonomic layer itself: the Observer Parsimony Condition. If the monitoring layer consumes more CPU and RAM than the margin separating the current capability level from the next downgrade, the observer becomes the cause of the level transition it exists to prevent. This is the Heisenberg Observation Problem for resource-constrained autonomic systems — the act of observing accelerates the failure being observed.

Definition 29 (Autonomic Overhead Budget). For each capability level , the overhead budget is a triple \((\pi_q,\, M_q,\, T_q)\) specifying the maximum CPU fraction, RAM reservation, and minimum MAPE-K tick interval permitted to the entire autonomic observation layer:

LevelRegimeCPU \(\pi_q\)RAM \(M_q\)Tick floor \(T_q\)Permitted algorithm tier
\(\mathcal{L}_0\)Survival — complete isolation0.1%4 KB60 sQ15 fixed-point EWMA + CUSUM only; no FPU; no gossip
\(\mathcal{L}_1\)Minimal mission0.5%64 KB10 sQ15 fixed-point EWMA + CUSUM; no Kalman; no Isolation Forest
\(\mathcal{L}_2\)Degraded operation1.0%256 KB5 sFloating-point EWMA; Holt-Winters; no Isolation Forest
\(\mathcal{L}_3\)Normal operation2.0%1 MB1 sFull scalar Kalman; Isolation Forest sketch (\(\leq 50\) trees)
\(\mathcal{L}_4\)Full connectivityNo local limitNo local limit0.1 sAny algorithm; offload to cloud

The budgets are monotone: and . The ceiling of 2% CPU and 1 MB RAM is the design constraint for the full Self-Measurement stack described in this article.

Physical translation: At (battery critically low, all connectivity severed), a node running floating-point Kalman updates at 1 Hz would consume roughly 60 FP multiply-adds per second on ARM Cortex-M4 — not catastrophic in isolation, but the covariance update, gossip Welford estimators, and Isolation Forest inference stack together into a measurable fraction of a 64 MHz core. The Q15 fixed-point EWMA replaces all of this with 12 integer operations per tick — under \(1\,\mu\text{s}\) at 64 MHz — leaving the CPU in WFI (Wait For Interrupt) sleep the remaining 59.999 seconds.

Proposition 17 (Observer Parsimony Condition). Let \(u(t) \in [0,1]\) be the node’s current CPU utilization fraction, \(u_q\) the utilization fraction that triggers downgrade from to , and \(\pi_q\) the observation overhead fraction at level \(q\). The autonomic observation layer satisfies the parsimony condition iff:

Observing must consume less CPU headroom than remains before the next forced downgrade — the watcher must not trigger what it watches for.

That is, the overhead consumed by observing must be strictly less than the CPU margin remaining before the next forced downgrade. If the condition is violated, observing at level \(q\) guarantees a transition to .

Note: \(u(t)\) denotes CPU utilization fraction here (distinct from the compute-to-transmit energy ratio \(\rho = T_d/T_s\) from Proposition 1 in Why Edge Is Not Cloud Minus Bandwidth, which is a hardware constant, not a runtime measurement).

Proof: If \(u(t) + \pi_q \geq u_q\), then enabling level-\(q\) observation pushes utilization past the downgrade trigger \(u_q\), causing an immediate transition to . This transition is itself observed by the MAPE-K Monitor loop, potentially triggering further recursive observation overhead — a positive feedback path to . \(\square\)

Physical translation: An autonomic manager that spends 3% of CPU on health monitoring when the CPU is already at 98% utilization will itself push the node over the edge into survival mode. The observation budget must be sized against the current CPU margin, not against the theoretical maximum. Proposition 17 is the formal statement of “the doctor must not kill the patient with the examination.”

(Scope note: The parsimony condition uses CPU utilization fraction \(u(t)\) — a single-resource signal intentionally scoped to CPU only. The capability-level downgrade trigger in Proposition 7 uses the composite resource state \(R(t)\) (battery SOC, free memory, CPU). These gates are complementary rather than redundant: the composite \(R(t)\) gate prevents downgrade under adequate-CPU-but-depleted-battery conditions; the parsimony condition prevents the observation layer itself from consuming the last CPU margin. A node can satisfy yet violate parsimony if CPU is near \(u_q\) regardless of battery state. Both conditions must be evaluated at each MAPE-K tick.)

Assumption set : CPU utilization \(u(t)\) is measured by a hardware cycle counter (DWT on ARM Cortex-M) with negligible overhead (\(<0.001\%\)), not by the autonomic layer itself. The downgrade thresholds \(u_q\) are static design parameters, not runtime estimates.

OUTPOST scenario: At , the full Self-Measurement stack consumes approximately 2% (illustrative value) of the 64 MHz Cortex-M4 core; violating parsimony near the 80% (illustrative value) CPU threshold would trigger immediate cascade to survival mode.

Empirical status: The parsimony condition is analytically exact. The OUTPOST figure of ~2% CPU for the Self-Measurement stack at and the 80% downgrade threshold are design parameters, not measured CPU profiles; actual overhead depends on MCU clock speed, compiler optimization, and workload mix.

Watch out for: the parsimony condition is evaluated at a point in time; if CPU utilization \(u(t)\) is measured by a procedure that itself consumes overhead \(\pi_q\), the condition becomes circular — measuring whether you can observe consumes the headroom it is trying to protect.

Switched-Mode Stability Extension: Limit-Cycle Prevention

Proposition 17 identifies when a single downgrade transition occurs. It does not bound the number of transitions: if utilization oscillates near the downgrade threshold \(u_q\), the system can enter an infinite downgrade-upgrade cycle — revising to , recovering, promoting back to , then downgrading again — which permanently precludes a stable operating level without ever reaching . The extension below reframes the multi-level transition logic as a discrete-time switched linear system, constructs a Common Lyapunov Function, and derives the minimum dwell-time that eliminates limit cycles regardless of workload variability or changes in .

Definition 30 (Discrete-Time Switched Utilization System). Let be the active capability level. In mode \(q\) with tick interval \(T_q\) (Definition 29), CPU utilization evolves as:

where is the per-tick utilization decay factor (workload dissipation), is the observer overhead at level \(q\) ( Definition 29 ), and \(w(t)\) is bounded external disturbance. The unique mode-\(q\) equilibrium is:

Switching rules: downgrade when ( Proposition 17 violated); upgrade eligible only when AND dwell-time has elapsed since the last transition.

Here where is the CPU load decay time constant; for burst compute tasks on OUTPOST sensors, at each level (illustrative value), giving independent of \(q\).

Proposition 18 (Switched-Mode CLF Stability and Dwell-Time Bound). Under Definition 30 with and disturbance bound :

Each observation mode is individually stable; switching between modes is safe provided the node waits a minimum dwell time before switching again.

(i) Intra-mode contraction. Let . Then is a Lyapunov function for each mode in isolation:

The error decays geometrically at rate \(\alpha\) per tick:

The number of ticks to reach is , independent of the value of \(T_q\). The wall-clock convergence time scales with \(T_q\), but the Lyapunov decrease per tick — and therefore the tick count — does not.

(ii) Inter-mode boundedness. At a downgrade , the state \(u(t)\) is continuous. The error in the new mode satisfies:

Each downgrade injects at most of error. Since there are at most four downgrades, total error injection across the full sequence is bounded by .

(iii) Dwell-time lower bound. Define the safe re-upgrade margin:

A limit cycle between levels \(q\) and \(q-1\) requires an upgrade attempt before \(u(t)\) has decayed below — which immediately triggers another downgrade. The minimum dwell-time at level \(q-1\) that eliminates this possibility is:

Proof of (i): From Definition 30 with :

Therefore and for . The convergence rate \(\alpha\) does not appear in the expression for \(T_q\), so changing \(T_q\) does not affect the tick-by-tick Lyapunov decrease. With bounded disturbance , the bounded real lemma gives ultimate boundedness: in steady state. \(\square\)

Proof of (iii): After a downgrade at \(t_0\) with , the trajectory at level \(q-1\) is:*

Setting this equal to and solving for \(k\) gives the stated expression. Any upgrade attempted before tick \(k^*\) satisfies , so , and the Schmitt-trigger upgrade gate ( Definition 47 in Self-Healing Without Connectivity) with hysteresis blocks the transition. \(\square\)

OUTPOST calibration ( (illustrative value), (illustrative value)):

Transition Tick floor Auto-satisfied?
0.800.050.7615 s5 sYes — 1 tick
0.600.0250.56110 s10 sYes — 1 tick
0.500.0050.46160 s60 sYes — 1 tick

The dwell-time requirement is always met by waiting one tick at the lower level. The tick floor of Definition 29 is the natural dwell-time enforcer — no additional timer infrastructure is required. The existing MAPE-K tick clock implements the constraint by construction.

Gain-bound invariance across changes. The MAPE-K loop stability criterion ( Proposition 22 in Self-Healing Without Connectivity) requires control gain . As increases (capability downgrade), the bound relaxes monotonically: a gain \(K\) satisfying the tightest constraint at the highest level automatically satisfies the constraint at all lower levels. If \(K\) is calibrated at ( ), it remains stable at ( ) without recalibration.

Corollary (Error Convergence Under Changing ). Suppose the switching sequence undergoes \(m \leq 4\) downgrades before reaching stable level . The error at time \(t\) after the last transition satisfies:

The error converges to zero exponentially regardless of the path and regardless of how many times changed along the way. The convergence exponent \(\alpha\) is a hardware characteristic of the MCU’s workload dissipation, not a function of .

Empirical status: The Lyapunov stability result for individual modes is mathematically exact given the linear model. The OUTPOST dwell-time table (\(\alpha = 0.80\) (illustrative value), (illustrative value)) uses illustrative design parameters; the actual convergence rate \(\alpha\) is a property of the MCU workload profile. The “one tick is sufficient” result holds only when the tick floor exactly equals the computed .

Watch out for: the dwell-time bound is derived under the assumption that mode transitions respect the minimum inter-switch interval; if the healing loop triggers successive transitions faster than \(\Delta t_{\text{dwell}}\) allows, the CLF value at each new mode entry has not decreased by the required amount, and the Lyapunov stability argument fails.

Fixed-Point Algorithm Tier ( )

Definition 31 (Fixed-Point Sensor State). At and , the scalar Kalman filter (Definition 20) is replaced by a Q15 fixed-point EWMA. The sensor observation record for sensor \(i\) is a triple of 16-bit integers:

where the superscript \((15)\) denotes Q15 fixed-point representation (1 sign bit, 15 fractional bits; range \([-1, +1)\), resolution ). The Q15 EWMA update for one observation is:

where is the pre-computed integer smoothing weight and \(\gg 15\) is a right-shift (equivalent to dividing by ). Variance update follows the same pattern with the squared deviation. The CUSUM accumulator uses integer subtraction and a hardware-assisted compare-with-reset.

The implementation cost is minimal: 6 16-bit multiply-shifts plus 2 comparisons give 12 ARM Cortex-M cycles (illustrative value) per sensor per tick with no FPU required, and at the full 127-sensor OUTPOST mesh occupies 762 bytes (illustrative value) — fitting in a single cache line. The Q15 approximation tracks the true Kalman steady-state mean to within fractional error for OUTPOST thermal parameters; at , where the only question is “still alive?” rather than “how anomalous?”, this precision is entirely sufficient.

Physical translation: Replacing the Kalman filter with a Q15 EWMA at is not a degraded fallback — it is the correct algorithm for the task. The Kalman covariance update ( ) adds precision in tracking the noise model, but at the noise model does not matter: the binary question is whether . Two multiply-shifts answer this question; six floating-point operations do not answer it better.

Variable-Fidelity Monitor Schedule

Definition 32 (Variable-Fidelity Monitor Schedule). The MAPE-K Monitor phase at level \(q\) selects an algorithm tier based on current CPU availability \(1 - u(t)\) (where \(u(t)\) is CPU utilization as in Proposition 17), independently of the capability level transition threshold. (We write \(\ell_{\text{tier}}\) rather than \(\tau\) because \(\tau\) is reserved for time-valued staleness quantities throughout this article.)

Default thresholds: . Tier \(\ell_{\text{tier}}(t)\) is re-evaluated at each MAPE-K tick before the Monitor phase executes — the manager is itself autonomic, scaling its own cost to the available margin. Tier selection takes 2 integer comparisons (\(<10\) cycles) and precedes all other Monitor work.

The variable-fidelity schedule operates at finer granularity than capability level transitions: a node at running a temporary burst computation drops to tier 1 during the burst without triggering a level transition, preserving detection coverage at minimal cost rather than either suspending detection or forcing a premature downgrade. A dead-band of applies to tier transitions (Schmitt Trigger hysteresis — Definition 47 in Self-Healing Without Connectivity) to prevent oscillation between tiers during CPU load ramps.

Physical translation: Variable fidelity means the monitoring system trades accuracy for energy: during degraded conditions, it samples less frequently and uses simpler detection algorithms. A RAVEN drone on 20% battery switches from the full anomaly detector to a lightweight threshold check — it may miss subtle anomalies, but it will still catch gross failures while preserving enough power to complete the mission.

Event-Driven vs. Polling

The MAPE-K Monitor tick fires at fixed interval (polling). At with , this already achieves near-optimal power on ARM Cortex-M by keeping the CPU in WFI between ticks. One further refinement eliminates the tick entirely for sensors whose readings remain in the normal band: hardware ADC threshold interrupts.

Under hardware-interrupt-driven sampling — where the CPU remains in wait-for-interrupt sleep until a threshold crossing is detected — the expected CPU activity for sensors that remain in-band (the typical case at \(\mathcal{L}_0\)) is:

where \(f_s\) is ADC sampling rate, \(\pi_1 \approx 0.02\) is the anomalous-sample fraction, cycles/event, and is the cost of the 60-second housekeeping tick.

Physical translation: At 1 Hz ADC sampling with 2% anomalous fraction, the interrupt-driven ISR runs 1.2 times per minute — 12 cycles each — consuming CPU fraction, or roughly 0.000000038% CPU. The 0.1% budget is larger by a factor of \(2.6 \times 10^6\). The autonomic layer at is, in every practical sense, invisible to the host application.

DMA-assisted ADC ring buffer: Under DMA-driven sampling, the transfer engine fills a ring buffer autonomously while the CPU reads only the tail pointer — one load instruction. This separates sampling (hardware-driven, zero CPU) from analysis (software-triggered by timer or threshold interrupt), allowing the CPU budget for the \(\mathcal{L}_0\) monitor to approach its theoretical floor.

Per-Algorithm Complexity Audit

The table below audits each algorithm in this article against the Definition 29 overhead budgets, identifying the minimum capability level at which each algorithm is viable and the correct / replacement.

AlgorithmCPU complexityRAMMin level\(\mathcal{L}_0\)/\(\mathcal{L}_1\) replacement
Q15 EWMA + CUSUM (Def 84)\(O(1)\), 12 cycles6 B/sensor\(\mathcal{L}_0\)— (this is the replacement)
FP EWMA (Def 23, simplified)\(O(1)\), ~20 cycles8 B/sensor\(\mathcal{L}_2\)Q15 EWMA
Scalar Kalman (Def 23)\(O(1)\), ~60 cycles FP32 B/sensor\(\mathcal{L}_3\)Q15 EWMA
CUSUM, static \(\mu_0\)\(O(1)\), 10 cycles4 B/sensor\(\mathcal{L}_0\)— (already integer-safe)
Holt-Winters (period \(p\))\(O(1)\), ~80 cycles FP\(8p\) B\(\mathcal{L}_2\)Skip (period data unavailable at \(\mathcal{L}_0\)–\(\mathcal{L}_1\))
Isolation Forest (\(t\) trees, depth \(d\))\(O(t)\), ~200 cycles\(t \cdot d \cdot 4d\) B\(\mathcal{L}_3\)Skip (19 KB at \(t=50, d=8\))
Bayesian Surprise (Def 24)\(O(1)\), ~40 cycles FP8 B/sensor\(\mathcal{L}_3\)CUSUM (\(S_t^K \to S_t\) at \(\mathcal{L}_0\))
Gossip (Def 5, per round)\(O(1)\) local, \(O(\ln n)\) fleet6 B/peerSuspend at \(\mathcal{L}_0\)Frozen weights from Def 60
Welford estimator (Def 58)\(O(1)\), ~30 cycles FP24 B/peerFreeze at \(\mathcal{L}_0\)Trust-root anchor (Def 60)

Notes on the audit:

Fleet-level complexity: Gossip convergence is \(O(\ln n / \lambda)\) ( Proposition 12 ). Per-node gossip state is \(O(n)\) — the Welford estimator ( Definition 33 ) requires \(24\,\text{bytes} \times n\) peers. For OUTPOST (\(n=127\)): \(127 \times 24 = 3{,}048\) bytes at , frozen at 0 bytes at . The \(O(n)\) growth is bounded by fleet size, which is a fixed deployment parameter — not a runtime variable. Fleet size does not affect per-observation compute complexity (which remains \(O(1)\)) and only affects the static RAM allocation.

OUTPOST calibration: \(127 \times 6\,\text{bytes}\) (Q15 state) = 762 bytes RAM (illustrative value). DMA ring buffer: \(64 \times 2\,\text{bytes}\) = 128 bytes (illustrative value). Gossip state: frozen, 0 bytes active. Total autonomic observation footprint: 890 bytes (illustrative value) against the 4 KB \(M_0\) budget (22% utilization (illustrative value)). CPU: active tick/60 s = 200 cycles/hour at 64 MHz = CPU fraction (illustrative value)three orders of magnitude below the 0.1% ceiling (illustrative value). The observer parsimony condition ( Proposition 17 ) is satisfied with a margin of \(> 99.99\%\).

Mesh-Wide Health Inference

OUTPOST uses hierarchical aggregation with fusion nodes; the diagram shows a simplified three-layer structure where the backup link between fusion nodes (F1 to F2) allows intra-mesh health exchange to continue even when the satellite uplink is severed.

    
    graph TD
    subgraph Sensors["Sensor Layer (distributed)"]
    S1[Sensor 1]
    S2[Sensor 2]
    S3[Sensor 3]
    S4[Sensor 4]
    S5[Sensor 5]
    S6[Sensor 6]
    end
    subgraph Fusion["Fusion Layer (aggregation)"]
    F1[Fusion A]
    F2[Fusion B]
    end
    subgraph Command["Command Layer (satellite)"]
    U[Uplink to HQ]
    end
    S1 --> F1
    S2 --> F1
    S3 --> F1
    S4 --> F2
    S5 --> F2
    S6 --> F2
    F1 --> U
    F2 --> U
    F1 -.->|"backup link"| F2

    style U fill:#c8e6c9
    style F1 fill:#fff9c4
    style F2 fill:#fff9c4
    style Sensors fill:#e3f2fd
    style Fusion fill:#fff3e0
    style Command fill:#e8f5e9

Read the diagram: Sensors report to their local fusion node (Fusion A or B, yellow). Both fusion nodes forward to the satellite uplink (green). The dashed backup link between F1 and F2 enables intra-mesh health exchange to continue even when the uplink is severed — the mesh can still maintain local awareness and consensus even under complete satellite denial.

(Simplified illustration; OUTPOST operates 127 sensors in practice)

Gossip parameters for OUTPOST (power-optimized for extended deployment):

Tamper Detection

Fixed sensor positions make physical tampering a significant threat. Multi-layer detection:

Physical indicators:

Logical indicators:

Response protocol:

  1. Log tamper indicators with timestamp
  2. Increase reporting frequency if power permits
  3. Alert fusion node with tamper confidence level
  4. Continue operation unless tamper confidence exceeds threshold
  5. Switch to quarantine when accumulated tamper confidence , computed as a Bayesian update over the listed physical and logical indicators
  6. At high confidence: switch to quarantine mode (report but don’t trust own data)

Cross-Sensor Validation

OUTPOST leverages overlapping sensor coverage for cross-validation. The formula below computes a confidence score for sensor \(s_i\) as the trust-weighted fraction of agreement with its coverage-overlapping neighbors , where is the link trust weight and measures detection correlation between the two sensors.

where is the set of sensors with overlapping coverage, and is the fraction of common time windows where both sensors agree on event presence/absence within a spatial-temporal tolerance window.

Low confidence triggers:

Cross-validation doesn’t determine which sensor is correct - it identifies sensors requiring investigation.


The Limits of Self-Measurement

Self-measurement has boundaries. Recognizing these limits is essential for correct system design.

Novel Failure Modes

Anomaly detection learns from historical data. A failure mode never seen before - outside the training distribution - may not be detected as anomalous.

Example: OUTPOST sensors are trained on hardware failures, communication failures, and known environmental conditions. A new adversarial technique - acoustic disruption of MEMS sensors - produces sensor behavior within “normal” ranges but with corrupted data. The anomaly detector sees normal statistics; the semantic content is compromised.

Mitigation: Defense in depth. Multiple detection mechanisms with different assumptions. Cross-validation between sensors. Periodic ground-truth verification when connectivity allows.

Adversarial Understanding

An adversary who understands the detection algorithm can craft attacks that evade detection.

If the adversary knows we use EWMA with \(\alpha = 0.1\), they can introduce gradual drift that stays within \(2\sigma\) at each step but accumulates to significant deviation over time. The “boiling frog” attack.

Mitigation: Ensemble of detection algorithms with different sensitivities. Long-term drift detection (comparing current baseline to baseline from days ago). Randomized detection parameters.

Cascading Failures

Self-measurement assumes the measurement infrastructure is functional. But the measurement infrastructure can fail too.

If the power management system fails, anomaly detection may lose power before it can detect the power anomaly. If the communication subsystem fails, gossip cannot propagate health. The failure cascades faster than measurement can track.

Mitigation: P0/P1 monitoring on dedicated, ultra-low-power subsystem. Watchdog timers that trigger even if main processor fails. Hardware-level health indicators independent of software.

The Judgment Horizon

When should the system distrust its own measurements?

At the judgment horizon , self-measurement must acknowledge its limits. The system should:

  1. Log that it has reached measurement uncertainty limits
  2. Fall back to conservative assumptions
  3. Request human input when connectivity allows
  4. Avoid irreversible actions until confidence is restored

Sensor 47 Resolution

Return to our opening scenario. Sensor 47 went silent. How did OUTPOST diagnose the failure?

The fusion node applied the diagnostic framework from Section 2.3. Signature analysis found the abrupt silence with no prior degradation inconsistent with gradual sensor element failure but consistent with abrupt power regulation failure. Correlation check confirmed sensors 45, 46, 48, and 49 all operational, ruling out regional communication failure. Environmental context showed no known jamming indicators and nominal weather, lowering adversarial probability. The staleness trajectory of Sensor 47’s last 10 readings showed normal variance with no drift, ruling out slow degradation.

Diagnosis: localized hardware failure (most likely power regulation) with 78% confidence (illustrative value). The fusion node routed Sensor 47’s coverage zone to neighbors (Sensors 45 and 48), flagged the unit for physical inspection on next patrol, and updated its anomaly detection baseline to reduce reliance on Sensor 47’s historical patterns.

Post-reconnection analysis (satellite uplink restored 6 hours later): Sensor 47’s voltage regulator had failed suddenly - a known failure mode for this component batch. The diagnosis was correct. The system had self-measured, self-diagnosed, and self-healed without human intervention.

Cognitive Map: Self-measurement has four hard limits — novel failures outside the training distribution, adversarial attacks calibrated to evade the detector, measurement infrastructure failures that cut the observation loop before an anomaly is logged, and the judgment horizon where confidence intervals are too wide to support any decision. Recognition of these limits is not a weakness of the design; it is the feature that prevents the system from taking irreversible actions on uninformative data.

Learning from Measurement Failures

Measurement failures provide training data for improved detection. The four-step post-hoc process below turns each failure event into a concrete improvement to the detection catalog, threshold settings, or classifier model.

StepActionOutput
1Document failure modeFailure signature added to catalog
2Extract detection featuresNew features for anomaly detector
3Adjust thresholds if false negative
4Retrain modelsUpdated classifier with new case

Measurable improvement: increases after each logged failure of that type.


Model Scope and Failure Envelope

Each mechanism has bounded validity. When assumptions fail, so does the mechanism.

Validity Domain: Summary

Each mechanism is valid only within a domain defined by its assumptions. Outside that domain the protocol continues to run, but its correctness and performance guarantees no longer hold. The table below shows, for each component, which assumptions must hold, what breaks when they fail, and what mitigations exist.

ComponentAssumptionsBreaks WhenMitigation
Gossip (Prop 4)Connected network; delivery prob \(> 0.5\); uniform peer selectionPartition isolates clusters; high message loss; biased peer selectionHierarchical gossip; bridge detection; priority messages; random peer forcing
Detection (Prop 3)Abrupt step-like shift ; stable baseline; Gaussian noiseGradual drift; unstable baseline; heavy-tailed noiseDual-CUSUM for drift; adaptive windowing; robust statistics
Byzantine (Prop 6)Byzantine minority (\(f < n/3\)); honest nodes truthful; attacker cannot predict trimming\(f \geq n/3\); coordinated alignment past trimming; compromised honest nodeHierarchical trust; random trimming; continuous trust reassessment
Staleness (Prop 5)Brownian diffusion model with accurate \(\sigma\); reliable timestampsVolatility misestimate; clock spoofing; strongly mean-reverting metricsAdaptive volatility estimation; authenticated time; relative ordering

Counter-scenarios: Adversary who selectively jams inter-cluster gossip creates divergent health views undetectable within each cluster — detection requires cross-cluster comparison on reconnection. Adversary who compromises exactly \(n/3\) sensors gradually stays below instantaneous detection thresholds — detection requires trend analysis of trust scores, not just instantaneous counts.

Summary: Claim-Assumption-Failure Table

The table below consolidates the key correctness claims from this article, the assumptions each relies on, and the specific conditions under which each breaks down.

ClaimKey AssumptionsValid WhenFails When
Gossip converges in \(O(\ln n)\)Connected network, uniform peer selectionNetwork mostly connectedPartition isolates clusters
CUSUM detects faster than EWMAAbrupt shift Step-like anomaliesGradual drift; tiny shifts
Trimmed mean tolerates \(f\) Byzantine\(f < n/3\)Byzantine minority\(f \geq n/3\); coordinated attack
Confidence degrades as State evolution follows Brownian motion modelStable volatilityVolatility spikes; regime change
Ensemble improves detectionModels capture different anomaly typesAnomaly diversityAll anomalies same type

Irreducible Trade-offs

No design eliminates these tensions. The architect selects a point on each Pareto front.

Trade-off 1: Detection Sensitivity vs. False Positive Rate

Multi-objective formulation: Raising \(\theta\) decreases false positives but also decreases true positives — the formula captures both dimensions so the cost of each trade-off direction is explicit.

ROC trade-off: No threshold \(\theta\) achieves both TPR = 1 and FPR = 0 for overlapping distributions.

Cost-weighted decision: The formula selects the threshold that minimizes total expected error cost, explicitly weighting missed detections by and false alarms by .

Pareto front derivation (Gaussian anomaly model):

For normally distributed metrics with anomaly shift \(\delta\), the two closed-form expressions below give the TPR and FPR as a function of threshold \(\theta\) in units of \(\sigma\); the gap between the two curves widens as \(\delta\) increases relative to \(\sigma\).

The table below evaluates these expressions at four operating points and shows the cost-ratio condition under which each is optimal; the Use-when column is the key reference for practitioners choosing a threshold.

ThresholdTPR = FPR = Use when is
\(0.067\)High (FN aversion; tolerate FP)
\(0.023\)Medium
\(0.006\)Low
\(0.001\)Very low (FP aversion)

Optimal threshold selection: The closed-form below gives \(\theta^*\) as the quantile of the cost-normalized false-positive fraction, placing the decision boundary where the marginal cost of a false positive equals the marginal cost of a false negative.

For tactical edge where : the cost-optimal threshold is the row of the table above (FPR = 0.067, high sensitivity), accepting false positives to minimize missed detections.

Trade-off 2: Staleness vs. Bandwidth Cost

Multi-objective formulation: Increasing gossip rate \(\lambda\) improves data freshness but consumes proportionally more bandwidth — the formula sets up the optimization that finds the rate balancing both.

where \(f_g\) is gossip rate.

Confidence-bandwidth surface derivation: The table below samples the Pareto front at four gossip rates, showing bandwidth in units of \(\lambda \cdot m\) (message-rate times message-size) and confidence as where \(\gamma\) is the staleness decay rate.

With gossip contact rate \(f_g\), mean staleness is \(\tau = 1/f_g\) and confidence is ; the two expressions below make the linear bandwidth cost and exponential confidence gain explicit as functions of the single design parameter \(f_g\).

Gossip Rate \(\lambda\)Staleness \(1/\lambda\)BandwidthConfidence
\(0.1\)/s\(10\)s\(f_g \cdot m\)
\(0.5\)/s\(2\)s\(5 f_g \cdot m\)
\(1.0\)/s\(1\)s\(10 f_g \cdot m\)
\(2.0\)/s\(0.5\)s\(20 f_g \cdot m\)

Diminishing returns analysis: decreases as . Marginal confidence gain from doubling \(f_g\): as \(f_g\) increases.

Trade-off 3: Model Complexity vs. Adaptability

Multi-objective formulation: Choosing model \(m\) simultaneously determines accuracy on known patterns, adaptability to novel ones, and compute cost — the three-way trade-off that prevents any single model from dominating.

Pareto front derivation (bias-variance tradeoff): The bound below shows that accuracy is limited by two terms moving in opposite directions as model complexity increases — bias decreases while variance increases.

The table below evaluates all three trade-off dimensions for the four models used in this article; no row dominates all others, confirming the Pareto structure.

ModelCapacity \(C_m\)Adaptability \(\propto 1/C_m\)Compute \(O(\cdot)\)Memory
EWMA\(O(1)\)High\(O(1)\)96B
Isolation Forest\(O(t \log n)\)Medium\(O(\log n)\)25KB
Autoencoder\(O(d^2)\)Low\(O(d^2)\)280B
Ensemble Medium 25KB

No model dominates: High-capacity models (autoencoder) achieve lower bias but higher variance on novel distributions. Low-capacity models (EWMA) adapt to drift but miss complex patterns. The Pareto frontier is convex - gains on one objective require losses on another.

Trade-off 4: Byzantine Tolerance vs. Latency

Multi-objective formulation: Tolerating more Byzantine node s \(f\) requires more participating nodes and more message rounds, directly increasing aggregation latency — the formula makes this tension explicit.

where \(f\) is the number of Byzantine node s tolerated. The table below shows how tolerating each additional faulty node requires \(3\) more participating nodes and one additional message round, so the latency cost is linear in \(f\).

Tolerance LevelRequired NodesMessage RoundsLatency
\(f = 0\)\(n \geq 1\)1Low
\(f = 1\)\(n \geq 4\)2Medium
\(f = 2\)\(n \geq 7\)3High
\(f = k\)\(n \geq 3k+1\)\(k+1\)\(O(k)\)

Higher Byzantine tolerance requires more nodes and more communication rounds, increasing latency. Cannot achieve high tolerance with low latency and few nodes.

Cognitive Map: The four trade-offs — sensitivity vs. false positive rate, freshness vs. bandwidth, accuracy vs. adaptability, Byzantine tolerance vs. latency — are Pareto frontiers, not optimization problems with a single correct answer. Every deployment chooses a point on each frontier based on its mission cost ratios. The optimal threshold ( Proposition 9 ), gossip rate (Corollary 3), and ensemble composition (Section 2) are the three levers that move those operating points. Adjust them together, not independently.

Cost Surface: Measurement Under Resource Constraints

The formula below gives the total cost of measurement as a function of sampling rate and fidelity: the first term is the resource cost of sampling (quadratic in fidelity), while the second term is the staleness cost that increases as rate drops.

The first term represents sampling cost (higher rate and fidelity cost more). The second term represents staleness cost (lower rate increases staleness ). The optimal operating point balances these competing costs.

Resource Shadow Prices

The shadow price quantifies how much additional utility one more unit of each resource delivers at the current operating point; a high shadow price identifies the binding constraint where investment yields the largest return.

ResourceShadow Price \(\zeta_i\) (c.u.)Interpretation
Gossip bandwidth2.10/KB-hrValue of an additional kilobyte per hour of health-synchronization capacity
Detection compute0.05/inferenceValue of one additional detection inference pass at current anomaly rate
Sensor power0.80/mW-hrValue of one additional milliwatt-hour sustaining continuous sensing
Memory0.01/KBValue of one additional kilobyte enabling longer observation history

(Shadow prices in normalized cost units (c.u.) — illustrative relative values; ratios convey resource scarcity ordering. Detection compute (0.05 c.u./inference) is the reference unit. Calibrate to actual platform resource costs.)

Irreducible Trade-off Summary

The four trade-offs developed in this section are irreducible: no design choice eliminates the tension, only shifts the operating point along the Pareto frontier.

Trade-offObjectives in TensionCannot Simultaneously Achieve
Sensitivity-PrecisionCatch all anomalies vs. no false alarmsPerfect TPR and zero FPR
Freshness-BandwidthCurrent information vs. low network costBoth with limited bandwidth
Accuracy-AdaptabilityHigh accuracy vs. novel anomaly detectionBoth without ensemble overhead
Tolerance-LatencyByzantine resilience vs. fast aggregationBoth with few nodes

Reputation-Based Consensus at the Measurement Layer

The Byzantine framework established in Definitions 27 and 22 and Proposition 15 handles fault tolerance at the aggregation layer: trust-weighted trimmed means guard against nodes whose reports fall outside the statistical envelope. What it does not handle is a slower attack surface: a node whose divergence \(D_j(t)\) drifts systematically but stays within the trimming threshold, corrupting the gossip health vector ( Definition 24 ) and the adaptive baseline ( Definition 20 ) over many observation cycles before the damage becomes detectable.

Three mechanisms close this gap. A per-peer running estimator flags anomalous divergence history ( Definition 33 ). A local ejection rule removes the offending node from trust-weighted merges without requiring a fleet-wide vote ( Definition 34 ). A Phase-0-anchored trust-root maintains calibrated weights during the Denied regime ( Definition 35 ) — the connectivity state where gossip cannot propagate reputation updates at all.

Definition 33 (Divergence Sanity Bound). Node \(i\) maintains a Welford running estimator over divergence observations from peer \(j\): sample count \(n_j\), running mean \(\mu_j\), and second moment . On each new divergence observation \(d = D_j(t)\) the estimator updates as:

Cold-start handling: when \(n_j < 2\), the variance term is undefined. For the first observation, a prior variance \(\sigma_\text{prior}^2\) is substituted; the ejection gate is suppressed entirely for \(n_j = 0\), and the prior variance is used for \(n_j = 1\).

Observation \(d\) is flagged anomalous if , where \(k\) is a fleet-wide policy parameter (\(k = 3\) (illustrative value) for standard operation; \(k = 4\) (illustrative value) for high-safety contexts). The estimator is seeded from the Phase-0 attestation window ( Definition 35 ), not from cold-start zeros. \(D_j(t)\) is the per-node scalar approximation of Definition 57 ’s pairwise divergence metric, computed against the last-known fleet consensus snapshot received via gossip ( Definition 24 ).

Physical translation: Each node silently tracks how far each peer’s divergence deviates from that peer’s own history. A peer whose divergence was normally around 0.04 and suddenly hits 0.31 is flagged — not because of any absolute threshold, but because it is behaving three standard deviations outside its own baseline. The Welford update requires no stored history, only three running scalars per peer, making it viable on constrained hardware. Cold-starting from zeros would produce false flags during warm-up; seeding from Phase-0 eliminates this entirely.

Definition 34 (Soft-Quorum Ejection). Node \(i\) maintains a per-peer suspicion counter \(v_j\), a sliding-window flag count \(F_j(w)\) over the last \(w\) observations, and reputation weight \(w_j \in [0, w_0]\), where \(w_0\) is the Phase-0 calibrated baseline. The update rules are: a flagged observation sets \(v_j \leftarrow v_j + 1\); a clean observation reduces suspicion rather than resetting it: . When \(v_j \geq m\), node \(i\) sets \(w_j \leftarrow 0\) (soft-eject). Secondary rate-based trigger: if \(F_j(w) / w > r_\text{eject}\) over a sliding window of \(w\) observations, node \(i\) soft-ejects peer \(j\) regardless of consecutive-run length; default values \(r_\text{eject} = 0.4\) (illustrative value) and \(w = 10\) (illustrative value). The decaying counter prevents a Byzantine node that alternates flagged and clean observations from keeping \(v_j\) perpetually near zero; the rate-based trigger catches sustained low-grade misbehaviour that would evade an m-consecutive-flags gate alone. After \(r\) consecutive post-eject clean observations, \(w_j \leftarrow w_0\) (reinstatement). No broadcast is emitted; the decision is purely local. A soft-ejected peer \(j\) is excluded from Definition 58b’s Reputation-Weighted Merge and from Definition 65’s Reputation-Weighted Fleet Coherence Vote. Peer \(j\) remains in the denominator of Definition 66’s Logical Quorum — a cascade of ejections cannot erode the quorum threshold.

Physical translation: When a peer starts reporting values that are consistently far outside its own historical norm, the local node stops trusting it for aggregation decisions — silently, without coordinating with anyone else. The ejection is soft: the suspect peer’s count still contributes to quorum thresholds (so the fleet cannot be reduced to a rump quorum by ejecting a majority), but its reports no longer influence health consensus. Ejection is not permanent; it is a probationary state.

Rehabilitation: An ejected node may re-enter the quorum after reconnecting to a trust-root anchor ( Definition 35 ) and passing \(r\) consecutive validation rounds without triggering the ejection predicate. Re-entry is additionally gated by the reconnecting node’s Bayesian surprise score falling below \(\kappa / 2\) for a full gossip convergence period ( Proposition 12 ), where \(\kappa\) is the fleet-wide anomaly threshold used in Definition 33 . This prevents a briefly compliant Byzantine node from gaming the \(r\)-clean-observation window and re-entering at full weight \(w_0\).

Proposition 19 (False-Positive Ejection Bound). Under Gaussian divergence residuals and independent peers, the probability that an honest peer \(j\) is wrongly soft-ejected by node \(i\) within \(T\) observation steps satisfies:

Requiring multiple consecutive flags before ejection makes accidental exclusion of honest peers astronomically unlikely.

where \(\Phi\) is the standard normal CDF. At \(k = 3, m = 5\) (illustrative value): (theoretical bound), negligible for any realistic operating window.

Physical translation: With a 3-sigma threshold and requiring 5 consecutive flags to eject, the probability that a normally-behaving peer gets wrongly kicked out is so small that it would not be expected to happen once in the lifetime of any conceivable fleet deployment. The requirement for five consecutive flags (not just five total) is the key: a single clean observation resets the counter to zero, so transient sensor noise cannot accumulate into a false ejection.

Watch out for: this bound holds under the assumption of Gaussian divergence residuals with approximately independent peers; in the Denied regime where \(T_{\text{acc}}\) exceeds the Weibull P95 partition threshold, temporal autocorrelation violates both the Gaussianity and independence assumptions, causing actual false-ejection rates to exceed the stated bound.

Proof sketch. A false ejection requires \(m\) consecutive anomalous flags on an honest peer. Under Gaussian residuals, each flag occurs independently with probability \(1 - \Phi(k)\). The \(m\)-consecutive-flag probability is \((1 - \Phi(k))^m\); a union bound over \(T\) possible streak-ending positions gives the stated result. \(\square\)

OUTPOST scenario: With \(k = 3\) (illustrative value), \(m = 5\) (illustrative value), and a 14-day operating window (\(T \approx 1.2 \times 10^6\) steps at 1 Hz) (illustrative value), the fleet-wide expected false ejections remain below \(10^{-8}\) (theoretical bound) — effectively zero over any mission duration.

Empirical status: The \(P(\text{false eject})\) formula is exact under the Gaussian independence assumption. Real divergence residuals are not perfectly Gaussian or fully independent — temporal autocorrelation in sensor streams can cause streak lengths longer than independence predicts. The \(4.5 \times 10^{-15}\) figure for \(k=3,\,m=5\) is analytically correct under the model; empirical false-ejection rates in non-Gaussian environments have not been characterized.

Definition 35 (Trust-Root Anchor). At Phase-0, each node \(i\) generates an attestation record signed by hardware TPM:

The fleet trust-root for node \(i\) is . During the Denied connectivity regime ( Definition 6 , ), gossip cannot propagate reputation updates. Node \(i\) applies the following defaults: peer receives weight \(w_j = w_0\); peer (introduced post-Phase-0 without re-attestation) receives weight .

Physical translation: Before the mission begins, every node in the fleet exchanges hardware-attested identity records. During the mission, this pre-shared roster is the only trust reference available in the Denied regime — there is no way to verify a newcomer or revoke a known node without connectivity. Nodes added after Phase-0 (replacements, reinforcements) operate at reduced trust weight until they can be re-attested through a connected phase. The anchor does not prevent compromise; it bounds how much damage a compromised or new node can do without coordination.

Anchor compromise: If the trust-root anchor is itself Byzantine — for example, if Phase-0 attestation records were forged or the TPM was compromised before deployment — the fleet degrades to pairwise Proposition 15 (Byzantine Tolerance Bound): no quorum decisions are made until a new anchor is established through an out-of-band key ceremony. Individual nodes continue operating on local anomaly detection ( Definition 19 ) and can still form local soft-quorums among mutually trusting peers, but fleet-wide consensus requires re-establishing a verified trust root.

When hardware TPM attestation is unavailable (air-gapped deployments, legacy hardware, or emergency re-commissioning): the system falls back to a software-only attestation using a pre-shared symmetric key exchanged during physical installation. Software attestation provides weaker guarantees — it is vulnerable to a compromised node that knows the key — so deployments using software fallback must set \(f_\text{max} < n/4\) (one fewer faulty node tolerance) rather than the standard \(n/3\) Byzantine tolerance bound.

Proposition 20 (Isolated-Node Trust Guarantee). A node operating in the Denied regime for duration \(\tau\) maintains calibration error bounded by:

Trust weights drift linearly during isolation; a longer Phase-0 calibration window limits how fast that drift can accumulate.

where for Phase-0 calibration window length , under the assumption that fleet divergence statistics are approximately stationary over the partition duration.

Proof sketch: In the Denied regime, the Welford estimators are frozen — no gossip updates arrive. Weight drift accumulates only from the last observation window before partition start. The maximum drift rate is bounded by the per-sample update magnitude (the Phase-0 window normalizes the step size). Over duration \(\tau\) with stationary divergence, drift accumulates linearly: . Trust degrades gracefully rather than catastrophically; the Phase-0 anchor prevents unbounded weight drift regardless of partition duration. \(\square\)

Physical translation: A node isolated for 18 hours in the Denied regime will have trust weights that have drifted by at most from their calibrated values — a bounded, calculable degradation, not a free fall. The longer the Phase-0 calibration window , the smaller is, so investing in a thorough pre-mission calibration directly reduces how much the trust model degrades during extended disconnection.

OUTPOST illustration. Day 44 of operation. Sensor 88 begins reporting threat coordinates 340 m (illustrative value) east of the consensus estimate. Its climbs from a historical \(\mu \approx 0.04\) (illustrative value) to 0.31 (illustrative value) over six reporting cycles, exceeding (illustrative value). Node 12 flags Sensor 88 on the fifth observation (streak ) and sets . Nodes 11, 13, and 14 reach the same conclusion independently over the next two gossip cycles. The 126 remaining sensors continue operating; no coordinator is contacted; no fleet-wide vote is called.

At hour 72, sustained jamming drives OUTPOST into the Denied regime. Gossip halts. Node 12’s Welford estimators freeze at their current values. For the 126 Phase-0-attested peers, \(w_j = w_0\) is applied from ; for a sensor added at day 40 without re-attestation, . Proposition 20 bounds the trust drift over the 18-hour partition at — a calculable, bounded degradation.

ParameterStandard OUTPOSTHigh-Safety OUTPOST
\(k\) (sigma threshold)34
\(m\) (violations to eject)55
\(r\) (clean to reinstate)1020
per \(10^6\) steps

Empirical status: The drift bound is analytically derived under the stationary-divergence assumption. The OUTPOST illustration values (\(\mu \approx 0.04\), ) are scenario parameters, not measured fleet divergence statistics. The assumption of stationary divergence during partition may not hold if the physical environment changes significantly during an 18-hour blackout.

Watch out for: the bound assumes the operational environment remains stationary during isolation; if the environment shifts sharply during a partition — a new failure mode appears or ambient conditions change — the model drift can exceed \(\varepsilon_{\text{drift}} \cdot \tau\) independently of isolation duration, since the Brownian bound does not cover non-stationary divergence statistics.


Closing: The Measurement-Action Loop

Measurement feeds action; without action, measurement is logging. AUTODELIVERY ’s gossip propagation feeds task assignment; PREDICTIX ’s anomaly detection feeds workload rebalancing.

The diagram below shows the measurement-action loop (the MAPE-K cycle); notice that Execute feeds back into Monitor, meaning the system continuously validates whether its own healing actions had the intended effect rather than assuming success.

    
    graph LR
    M["Monitor
(observe state)"] --> A["Analyze
(detect anomaly)"] A --> P["Plan
(select action)"] P --> E["Execute
(apply healing)"] E -->|"feedback loop"| M style M fill:#c8e6c9 style A fill:#fff9c4 style P fill:#ffcc80 style E fill:#ffab91

Read the diagram: Monitor (green) observes the system state and feeds it to Analyze (yellow), which classifies anomalies. Plan (orange) selects a healing action; Execute (red-orange) applies it. The feedback arrow from Execute back to Monitor is the critical design requirement — every healing action becomes a new observation. Without this loop, the system cannot distinguish “healing worked” from “healing failed but we stopped looking.”

This is the MAPE-K loop (Monitor, Analyze, Plan, Execute, Knowledge) that IBM formalized for autonomic computing [9] .

Return to OUTPOST BRAVO.

Sensor 47 is silent. The fusion node has measured: abrupt silence, no prior degradation, neighbors fully operational, no correlated regional failure. The analysis suggests localized hardware failure with 78% confidence. The plan: reroute coverage to neighboring sensors, flag for inspection on the next patrol, log for human review when uplink restores.

But measurement alone doesn’t execute this plan. Self-healing must decide: Is 78% confidence sufficient to reroute coverage and degrade mission posture for that sector? What is the cost of a false alarm versus a missed failure? How does the rerouting affect the rest of the mesh?

These are the questions that precise measurement makes it possible to ask. Without a calibrated anomaly score and a staleness -bounded observation, there is no meaningful basis for any healing decision at all.


Gossip and epidemic dissemination. The gossip health protocol in this article ( Definition 24 ) is grounded in the epidemic-algorithm literature. Demers et al. [1] introduced the epidemic model for replicated database reconciliation, demonstrating logarithmic propagation times in fully connected networks — the direct antecedent of the \(O(\ln n / \lambda)\) convergence bound in Proposition 12 . Kermarrec et al. [6] tightened these results for probabilistic reliable dissemination in large-scale systems under message loss, providing the theoretical basis for the lossy sparse-mesh extension in Proposition 13 . Both papers treat the fully-connected and sparse cases separately; the synthesis into the edge mesh model — with explicit mesh diameter \(D\), per-edge loss probability \(p_\text{loss}\), and the conductance bound \(\Phi \geq 1/D\) — is specific to the contested-environment setting developed here.

Anomaly detection and change-point methods. The CUSUM statistic (Page [3] ) remains the standard for optimal detection of step shifts in mean, and its optimality for that specific problem class (proven by Moustakides) motivates its placement alongside EWMA rather than as a replacement. The Kalman-based adaptive baseline ( Definition 20 ) follows the linear filtering framework introduced in Kalman [4] ; the scalar form used here for edge deployment is the degenerate one-dimensional case of the full matrix filter, with the key insight that EWMA with fixed \(\alpha\) is a special case where the Riccati equation is prevented from converging. The Bayesian interpretation of the Surprise metric ( Definition 21 ) connects to the Bayesian framework for inference under uncertainty in Gelman et al. [5] ; the log Bayes factor formulation follows standard practice for sequential hypothesis testing in that tradition.

Byzantine fault tolerance. The classical impossibility result that correct consensus requires more than \(2/3\) of nodes to be honest was established by Lamport, Shostak, and Pease [7] . The trust-weighted generalization in Proposition 15 — replacing the node-count fraction with a trust-weight fraction — is a natural extension; Byzantine trust weight below \(1/3\) is a necessary condition. Castro and Liskov [8] demonstrated practical BFT in asynchronous networks through PBFT, which operates under the same \(f < n/3\) bound but with full replication semantics inapplicable to the gossip-health problem. The trust-decay and behavioral-fingerprint mechanisms ( Behavioral Fingerprint , Soft-Quorum Ejection, and Trust-Root Anchor) address the gap between the instantaneous fraction bound and the slow-accumulation attack that falls below the threshold until a coordinated action.

Autonomic computing and edge self-management. The MAPE-K control loop (Monitor–Analyse–Plan–Execute with Knowledge Base) was formalized by Kephart and Chess [9] as the organizing framework for self-managing systems; the four-phase structure pervades this article’s local anomaly detection pipeline and the closing measurement-action loop. Huebscher and McCann [10] surveyed the degrees and models of autonomic behavior, noting the dependence on local state awareness as a prerequisite for any self-* capability — the precise motivation for treating self-measurement as the foundational problem in this series rather than a supporting component. The edge computing context — limited compute, constrained memory, contested connectivity — follows the landscape surveyed by Satyanarayanan [2] , which identifies local processing and graceful disconnection as the distinguishing design properties of edge deployments relative to cloud architectures.


Three results carry forward from this article. First, the cost-optimal detection threshold ( Proposition 9 ) places the decision boundary where the ratio of anomaly likelihoods equals the ratio of error costs — a concrete, tunable criterion that replaces the ad hoc \(2\sigma\) or \(3\sigma\) thresholds common in practice. Second, gossip convergence in \(O(\ln n / f_g)\) time ( Proposition 12 ) means that fleet-wide health awareness scales gracefully: doubling a 47-drone swarm adds roughly 0.7 seconds to convergence at 1 Hz gossip rate, not a proportional delay. Third, the maximum useful staleness bound ( Proposition 14 ) gives designers a principled way to size observation frequency: the tighter the decision margin \(\Delta h\), the higher the sampling rate must be, in a quadratic relationship that makes aggressive margin requirements expensive.

But measurement is only half the loop. A system that can diagnose a failure with 78% confidence still faces the question of what to do about it: which recovery actions are safe to attempt, in what order, under what resource constraints, and with what guarantees of stability. Those are the questions addressed in Self-Healing Without Connectivity, which develops the formal autonomic control loop, defines healing action severity, and derives the stability conditions under which closed-loop recovery converges rather than oscillates.


References

[1] Demers, A., Greene, D., Hauser, C., Irish, W., Larson, J., Shenker, S., Sturgis, H., Swinehart, D., Terry, D. (1987). “Epidemic Algorithms for Replicated Database Maintenance.” Proc. PODC, 1–58. ACM. [doi]

[2] Satyanarayanan, M. (2017). “The Emergence of Edge Computing.” IEEE Computer, 50(1), 30–39. [doi]

[3] Page, E.S. (1954). “Continuous Inspection Schemes.” Biometrika, 41(1/2), 100–44. [doi]

[4] Kalman, R.E. (1960). “A New Approach to Linear Filtering and Prediction Problems.” Trans. ASME — Journal of Basic Engineering, 82(1), 35–66. [doi]

[5] Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B. (2003). Bayesian Data Analysis, 2nd ed. Chapman & Hall/CRC. [web]

[6] Kermarrec, A.-M., Massoulié, L., Ganesh, A.J. (2003). “Probabilistic Reliable Dissemination in Large-Scale Systems.” IEEE Transactions on Parallel and Distributed Systems, 14(3), 248–258. [doi]

[7] Lamport, L., Shostak, R., Pease, M. (1982). “The Byzantine Generals Problem.” ACM Trans. Programming Languages and Systems, 4(3), 382–401. [doi]

[8] Castro, M., Liskov, B. (1999). “Practical Byzantine Fault Tolerance.” Proc. OSDI, 173–186. [pdf]

[9] Kephart, J.O., Chess, D.M. (2003). “The Vision of Autonomic Computing.” IEEE Computer, 36(1), 41–50. [doi]

[10] Huebscher, M.C., McCann, J.A. (2008). “A Survey of Autonomic Computing — Degrees, Models, and Applications.” ACM Computing Surveys, 40(3), Article 7. [doi]


Back to top