Self-Measurement Without Central Observability
Prerequisites
This article builds directly on the contested connectivity framework:
- Connectivity regimes: The four states (Full, Degraded, Intermittent, Denied) and Markov transition model define when self-measurement matters most—during denied regime when central observability is unavailable
- Capability hierarchy (L0-L4): Self-measurement is the foundation enabling capability assessment. Without accurate health knowledge, the system cannot determine its current capability level
- The inversion thesis: “Design for disconnected, enhance for connected” applies directly—self-measurement must function in complete isolation, with central reporting as enhancement when connectivity permits
Self-measurement is the sensory system of autonomic architecture. Just as organisms must sense their internal state before they can respond, edge systems must measure their own health before they can heal. This part develops the engineering principles for that measurement capability.
Theoretical Contributions
This article develops the theoretical foundations for self-measurement in distributed systems under contested connectivity. We make the following contributions:
-
Local Anomaly Detection Framework: We formalize the anomaly detection problem as hypothesis testing under resource constraints, establishing optimal threshold selection as a function of asymmetric error costs.
-
Gossip-Based Health Propagation: We derive convergence bounds for epidemic protocols in partially-connected networks, proving \(O(\ln n)\) propagation time under standard assumptions.
-
Staleness-Confidence Theory: We model health state evolution as a stochastic process and derive the maximum useful staleness for decision-making, establishing the relationship between observation age and confidence degradation.
-
Byzantine-Tolerant Aggregation: We extend weighted voting mechanisms to handle adversarial nodes, providing trust-decay models that detect and isolate compromised participants.
-
Observability Constraint Sequence: We establish a priority ordering for measurement capabilities based on failure cost analysis, providing resource allocation guidelines for constrained systems.
These contributions connect to and extend prior work on fault detection in distributed systems (Cristian, 1991), epidemic algorithms (Demers et al., 1987), and autonomic computing (Kephart & Chess, 2003), adapting these frameworks for the specific challenges of contested edge environments.
Opening Narrative: OUTPOST Under Observation
Early morning. OUTPOST BRAVO’s 127-sensor perimeter mesh has been operating for 43 days. Without warning, the satellite uplink goes dark—no graceful degradation. Seconds later, Sensor 47 stops reporting. Last transmission: routine, battery at 73%, mesh connectivity strong. Then silence.
OUTPOST needs to answer: how do you diagnose this failure without external systems?
- Hardware failure: Route around the sensor
- Communication failure: Attempt alternative paths
- Environmental occlusion: Wait and retry
- Adversarial action: Alert defensive posture
Each diagnosis implies different response. Without central observability, OUTPOST must diagnose itself—analyze patterns, correlate with neighbors, assess probabilities, decide on response. All locally. All autonomously.
This is self-measurement: assessing health and diagnosing anomalies without external assistance. You can’t heal what you haven’t diagnosed, and you can’t diagnose what you haven’t measured.
The Self-Measurement Challenge
Cloud-native observability assumes continuous connectivity:
graph LR
A[Metrics] -->|"network"| B[Collector]
B -->|"network"| C[Storage]
C -->|"network"| D[Analysis]
D -->|"network"| E[Alerting]
E -->|"network"| F[Human Operator]
F -->|"network"| G[Remediation]
style A fill:#e8f5e9
style F fill:#ffcdd2
linkStyle 0,1,2,3,4,5 stroke:#f44336,stroke-width:2px,stroke-dasharray: 5 5
Every arrow represents a network call. For edge systems, this architecture fails at the first arrow—when connectivity is denied, the entire observability pipeline is severed.
The edge alternative inverts the data flow:
graph LR
A[Local Sensors] --> B[Local Analyzer]
B --> C[Health State]
C --> D[Autonomic Controller]
D --> E[Self-Healing Action]
E -->|"feedback"| A
style A fill:#e8f5e9
style B fill:#c8e6c9
style C fill:#fff9c4
style D fill:#ffcc80
style E fill:#ffab91
Analysis happens locally. Alerting goes to an autonomic controller, not human operators. The loop closes locally without external connectivity.
| Aspect | Cloud Observability | Edge Self-Measurement |
|---|---|---|
| Analysis location | Central service | Local device |
| Alerting target | Human operator | Autonomic controller |
| Training data | Abundant historical data | Limited local samples |
| Ground truth | Labels from past incidents | Uncertain, inferred |
| Compute budget | Elastic (scale up) | Fixed (device limits) |
| Memory budget | Practically unlimited | Constrained (MB range) |
| Response latency | Minutes acceptable | Seconds required |
Analysis must happen locally, and alerting must be autonomous. You can’t wait for human operators or external analysis services. The system must detect, diagnose, and decide—all within the constraints of local compute and memory.
Local Anomaly Detection
The Detection Problem
At its core, anomaly detection is a signal detection problem. The sensor produces a stream of values:
At each timestep, the local analyzer must decide: is this observation normal, or anomalous?
This is a binary classification under uncertainty:
- \(H_0\) (null hypothesis): The observation is from the normal distribution
- \(H_1\) (alternative): The observation is from an anomalous process
Definition 4 (Local Anomaly Detection Problem). Given a time series \(\{x_t\}_{t \geq 0}\) generated by process \(P\), the local anomaly detection problem is to determine, for each observation \(x_t\), whether \(P\) has transitioned from nominal behavior \(P_0\) to anomalous behavior \(P_1\), subject to:
- Computational budget \(O(1)\) per observation
- Memory budget \(O(m)\) for fixed \(m\)
- No access to ground truth labels
- Real-time decision requirement
The challenge is performing this classification:
- In real-time, on-device
- With limited compute and memory
- Without access to comprehensive training data
- Without ground truth labels for recent observations
| Constraint | Cloud Detection | Edge Detection |
|---|---|---|
| Compute | GPU clusters, distributed | Single CPU, milliwatts |
| Memory | Terabytes for models | Megabytes for everything |
| Training data | Petabytes historical | Days of local history |
| Ground truth | Labels from incident response | Inference from outcomes |
| FP cost | Human review time | Unnecessary healing action |
| FN cost | Delayed response | Undetected failure, potential loss |
The asymmetry of costs is critical. A false positive triggers an unnecessary healing action—wasteful but recoverable. A false negative leaves a failure undetected—potentially catastrophic in contested environments where undetected failures cascade.
Statistical Approaches
Edge anomaly detection requires algorithms that are:
- Computationally lightweight: O(1) per observation
- Memory-efficient: Constant or logarithmic memory
- Adaptive: Adjust to changing baselines without retraining
- Interpretable: Provide confidence, not just binary classification
Three approaches meet these requirements:
Exponential Weighted Moving Average (EWMA)
The simplest effective approach. Maintain running estimates of mean and variance:
Where \(\alpha \in (0, 1)\) controls the decay rate. Smaller \(\alpha\) means longer memory. Note: variance uses \(\mu_{t-1}\) to keep the estimate independent of \(x_t\), consistent with the anomaly score calculation.
The anomaly score normalizes deviation by variance:
Flag as anomaly if \(z_t > \theta\), where \(\theta\) is typically 2-3 standard deviations.
Proposition 3 (Optimal Anomaly Threshold). Given asymmetric error costs \(C_{\text{FP}}\) for false positives and \(C_{\text{FN}}\) for false negatives, the optimal detection threshold \(\theta^*\) satisfies the likelihood ratio condition:
For tactical edge systems where \(C_{\text{FN}} \gg C_{\text{FP}}\) (missed failures are catastrophic), the optimal threshold shifts toward more sensitive detection at the cost of increased false positives.
Proof sketch: The expected cost is \(C_{\text{FP}} \cdot P_{\text{FP}}(\theta) + C_{\text{FN}} \cdot P_{\text{FN}}(\theta)\). Taking the derivative and setting to zero yields the Neyman-Pearson lemma condition.
- Compute: O(1) per observation (two multiply-adds)
- Memory: O(1) (store \(\mu\), \(\sigma^2\))
- Adaptation: Automatic through exponential decay
Holt-Winters for Seasonal Patterns
For signals with periodic structure (day/night cycles, shift patterns), Holt-Winters captures level, trend, and seasonality:
Where \(L_t\) is level, \(T_t\) is trend, \(S_t\) is seasonal component, and \(p\) is period length.
- Compute: O(1) per observation
- Memory: O(p) to store one period of seasonal factors
- Adaptation: Continuous updates to all components
Period examples by scenario:
- RAVEN: p=1 (no meaningful seasonality in flight telemetry)—use EWMA instead
- CONVOY: p=24 hours for communication quality (terrain/atmospheric effects), p=8 hours for engine metrics (thermal cycles)
- OUTPOST: p=24 hours for solar/thermal cycles, p=7 days for activity patterns near defended perimeter
Isolation Forest Sketch for Multivariate
For multivariate anomaly detection with limited memory, streaming isolation forest maintains a sketch:
Where \(h(x)\) is path length to isolate \(x\), and \(c(n)\) is average path length in a random tree.
- Compute: O(log n) per query, O(t) per tree
- Memory: O(t × d) for t trees with depth limit d
- Adaptation: Reservoir sampling for tree updates
Concrete parameters for CONVOY: t=50 trees, d=8 depth limit, sample_size=128, contamination=0.02 (expected 2% anomaly rate). This configuration uses ~25KB memory and achieves 85% detection rate with 3% false positive rate on multi-sensor telemetry (engine, transmission, suspension combined).
CUSUM for Change-Point Detection
When the goal is detecting when a change occurred (not just that it occurred), Cumulative Sum (CUSUM) provides optimal detection for shifts in mean:
where \(\mu_0\) is the nominal mean and \(k\) is the allowable slack. Alarm when \(S_t > h\). CUSUM detects sustained shifts faster than EWMA but is more sensitive to the choice of \(k\). For RAVEN flight telemetry, CUSUM with \(k = 0.5\sigma\) detects motor degradation 15-20% faster than EWMA, at the cost of 10% higher false positive rate.
Concrete Error Rates
For RAVEN with anomaly threshold \(\theta = 2.5\sigma\) and base anomaly rate 2%:
- False Positive Rate: 1.2% (healthy flagged as anomaly)
- False Negative Rate: 8% (anomaly missed)
- Detection latency: 3-5 observations (15-25 seconds at 0.2 Hz sampling)
OUTPOST Sensor 47 uses EWMA for primary detection: temperature, motion intensity, battery voltage each tracked independently. Cross-sensor correlation uses a lightweight covariance estimate between Sensor 47 and its mesh neighbors.
Distinguishing Failure Modes
Detection answers “is something wrong?” Diagnosis answers “what is wrong?”
For Sensor 47’s silence, the fusion node must distinguish:
Sensor hardware failure:
- Signature: Gradual degradation before silence (increasing noise, drifting calibration)
- Correlation: Neighboring sensors unaffected
- Battery trend: Unusual power consumption before failure
Communication failure:
- Signature: Abrupt silence, no prior degradation
- Correlation: Multiple sensors in same mesh region affected
- Path analysis: Common relay nodes show degradation
Environmental occlusion:
- Signature: Specific sensor types affected (e.g., optical but not acoustic)
- Correlation: Geographic pattern (flooding, debris)
- Recovery pattern: Intermittent function as conditions change
Adversarial action:
- Signature: Precise silence, no RF emissions
- Correlation: Tactical pattern (sensors on approach path silenced)
- Timing: Coordinated with other events
The fusion node maintains causal models for each failure mode. Given observed evidence \(E\), Bayesian inference estimates posterior probability:
Priors \(P(\text{cause})\) come from historical failure rates. Likelihoods \(P(E | \text{cause})\) come from the signature patterns.
For Sensor 47:
- Abrupt silence (no degradation): Weights against hardware failure
- Neighbors functioning normally: Weights against communication failure
- Single sensor affected: Weights against environmental occlusion
- Location on approach path: Weights toward adversarial action
The diagnosis is probabilistic, not certain. Self-measurement provides confidence levels, not ground truth.
Distributed Health Inference
Gossip-Based Health Propagation
Individual nodes detect local anomalies. Fleet-wide health requires aggregation without a central coordinator.
Definition 5 (Gossip Health Protocol). A gossip health protocol is a tuple \((H, \lambda, M, T)\) where:
- \(H = [h_1, \ldots, h_n]\) is the health vector over \(n\) nodes
- \(\lambda\) is the gossip rate (exchanges per second per node)
- \(M: H \times H \rightarrow H\) is the merge function
- \(T: \mathbb{R}^+ \rightarrow \mathbb{R}^+\) is the staleness decay function
Gossip protocols solve this problem. Each node maintains a health vector:
Where \(h_i\) is node \(i\)’s estimated health state.
The protocol operates in rounds:
- Local update: Node \(i\) updates \(h_i\) based on local anomaly detection
- Peer selection: Node \(i\) selects random peer \(j\)
- Exchange: Nodes \(i\) and \(j\) exchange health vectors
- Merge: Each node merges received vector with local knowledge
graph LR
subgraph Before Exchange
A1["Node A: H_A"] -.->|"sends H_A"| B1["Node B: H_B"]
B1 -.->|"sends H_B"| A1
end
subgraph After Merge
A2["Node A: merge(H_A, H_B)"]
B2["Node B: merge(H_A, H_B)"]
end
A1 --> A2
B1 --> B2
style A1 fill:#e8f5e9
style B1 fill:#e3f2fd
style A2 fill:#c8e6c9
style B2 fill:#bbdefb
The merge function must handle:
- Staleness: Older observations are less reliable
- Conflicts: Different nodes may observe different values
- Adversarial injection: Compromised nodes may inject false health values
A weighted merge using timestamp-based staleness:
Where weights decay with staleness:
With \(\tau\) as time since observation and \(\gamma\) as decay rate (distinct from the gossip rate \(\lambda\)).
Proposition 4 (Gossip Convergence). For a gossip protocol with rate \(\lambda\) and \(n\) nodes, the expected time for information originating at one node to reach all nodes is:
Proof sketch: The information spread follows logistic dynamics \(dI/dt = \lambda I(1 - I)\) where \(I\) is the fraction of informed nodes. Solving with initial condition \(I(0) = 1/n\) and computing time to reach \(I = 1 - 1/n\) yields \(T = (2 \ln(n-1))/\lambda\). Corollary 2. Doubling swarm size adds only \(O(\ln 2 / \lambda) \approx 0.69/\lambda\) seconds to convergence time, making gossip protocols inherently scalable for edge fleets.
For tactical parameters (\(n \sim 50\), \(\lambda \sim 0.2\) Hz), the formula yields \(T = 2\ln(49)/0.2 \approx 39\) seconds—convergence within 30-40 seconds, fast enough to establish fleet-wide health awareness within a single mission phase. Broadcast approaches scale linearly with \(n\), which is why gossip wins at scale.
Priority-Weighted Gossip Extension
Standard gossip treats all health updates equally. In tactical environments, critical health changes (node failure, resource exhaustion, adversarial detection) should propagate faster than routine updates.
Priority classification:
- \(P_{CRITICAL}\) (priority 3): Node failure, Byzantine detection, adversarial alert
- \(P_{URGENT}\) (priority 2): Resource exhaustion (<10%), capability downgrade
- \(P_{NORMAL}\) (priority 1): Routine health updates, minor degradation
Accelerated propagation protocol:
For priority \(p\) messages, modify the gossip rate:
where \(\eta\) is the acceleration factor (typically 2-3). Critical messages gossip at \(3\times\) normal rate.
Message prioritization in constrained bandwidth:
When bandwidth is limited, each gossip exchange prioritizes by urgency. The protocol proceeds as follows:
Step 1: Merge local and peer health vectors into a unified update set.
Step 2: Sort updates by priority (descending), then by staleness (ascending) within each priority class.
Step 3: Transmit updates in sorted order until bandwidth budget exhausted:
Step 4: Critical override—always include \(P_{\text{CRITICAL}}\) updates even if over budget:
This ensures safety-critical information propagates regardless of bandwidth constraints, accepting temporary budget overrun.
Convergence improvement: For RAVEN with \(\eta = 2\), critical updates converge in ~15 seconds (vs. 39 seconds for normal updates)—a 2.6× speedup for time-sensitive health information.
Anti-flood protection: To prevent priority abuse (Byzantine node flooding P_CRITICAL messages), rate-limit critical messages per source:
where \(\rho_{\text{max}} \approx 0.01\) messages/second. Exceeding this rate triggers trust decay.
Gossip Under Partition
When the fleet partitions into disconnected clusters, gossip behavior changes fundamentally. Within each cluster, convergence continues normally. Between clusters, health state diverges.
Remark (Partition Staleness). For node \(i\) in cluster \(C_1\) observing node \(j\) in cluster \(C_2\), staleness—the elapsed time since observation—accumulates from partition time \(t_p\):
The staleness grows unboundedly during partition, eventually exceeding any useful threshold.
graph LR
subgraph Cluster_A["Cluster A (gossip active)"]
A1[Node 1] --- A2[Node 2]
A2 --- A3[Node 3]
A1 --- A3
end
subgraph Cluster_B["Cluster B (gossip active)"]
B1[Node 4] --- B2[Node 5]
B2 --- B3[Node 6]
B1 --- B3
end
A3 -.-x|"PARTITION
No communication"| B1
style Cluster_A fill:#e8f5e9
style Cluster_B fill:#e3f2fd
Cross-cluster state tracking:
Each node maintains a partition vector \(\rho_i\) tracking the last known connectivity state to each other node:
When \(\rho_i[j] > 0\) and \(t - \rho_i[j] > \tau_{\text{max}}\), node \(i\) marks its knowledge of node \(j\) as uncertain rather than stale.
Reconciliation priority:
Upon reconnection, nodes exchange partition vectors. The reconciliation priority for node \(j\)’s state is proportional to divergence duration:
Nodes with longest partition duration and highest importance (cluster leads, critical sensors) reconcile first.
Confidence Intervals on Stale Data
Health observations age. A drone last heard from 30 seconds ago may have changed state since then.
Definition 6 (Staleness). The staleness \(\tau\) of an observation is the elapsed time since the observation was made. An observation with staleness \(\tau\) has uncertainty that grows with \(\tau\) according to the underlying state dynamics.
Model health as a stochastic process. If health evolves with variance \(\sigma^2\) per unit time, the confidence interval on stale data is:
Where:
- \(h_{\text{last}}\) = last observed health value
- \(\tau\) = time since observation
- \(\sigma\) = health volatility parameter
- \(z_{\alpha/2}\) = confidence multiplier (1.96 for 95%)
Implications for decision-making:
The CI width grows as \(\sqrt{\tau}\)—a consequence of the Brownian motion model. This square-root scaling means confidence degrades slowly at first but accelerates with staleness.
When the CI spans a decision threshold (like the L2 capability boundary), you can’t reliably commit to that capability level. The staleness has exceeded the decision horizon for that threshold—the maximum time at which stale data can support the decision.
Different decisions have different horizons. Safety-critical decisions with narrow margins have short horizons. Advisory decisions with wide margins have longer horizons. The system tracks staleness against the relevant horizon for each decision type.
Response strategies when confidence is insufficient:
- Active probe: Attempt direct communication to get fresh observation
- Conservative fallback: Assume health at lower bound of CI
- Escalate observation priority: Increase gossip rate for this node
Proposition 5 (Maximum Useful Staleness). For a health process with volatility \(\sigma\) and a decision requiring discrimination at precision \(\Delta h\) with confidence \(1 - \alpha\), the maximum useful staleness is:
where \(z_{\alpha/2}\) is the standard normal quantile. Beyond \(\tau_{\text{max}}\), the confidence interval spans the decision threshold and the observation cannot support the decision.
Proof: Follows directly from the Brownian motion model \(dh = \sigma , dW\), which yields variance \(\sigma^2 \tau\) after elapsed time \(\tau\). Setting the CI half-width equal to \(\Delta h\) and solving for \(\tau\) gives the result. Corollary 3. The quadratic relationship \(\tau_{\text{max}} \propto (\Delta h / \sigma)^2\) implies that tightening decision margins dramatically reduces useful staleness. Systems with narrow operating envelopes require proportionally higher observation frequency.
Byzantine-Tolerant Health Aggregation
In contested environments, some nodes may be compromised. They may inject false health values to:
- Mask their own degradation (hide compromise)
- Cause healthy nodes to appear degraded (create confusion)
- Destabilize fleet-wide health estimates (denial of service)
Definition 7 (Byzantine Node). A node is Byzantine if it may deviate arbitrarily from the protocol specification, including sending different values to different peers, reporting false observations, or selectively participating in gossip rounds.
Weighted voting based on trust scores:
Where \(T_i\) is the trust score of node \(i\). Trust is earned through consistent, verifiable behavior and decays when inconsistencies are detected.
Outlier detection on received health reports:
If node \(i\) reports health for node \(k\) that differs significantly from the consensus, flag the report as suspicious:
Repeated suspicious reports decrease trust score for node \(i\).
Isolation protocol for nodes with inconsistent claims:
- Track history of claims per node
- Compute consistency score: fraction of claims matching consensus
- If consistency below threshold, quarantine node from health aggregation
- Quarantined nodes can still participate but their reports are not trusted
Proposition 6 (Byzantine Tolerance Bound). With trust-weighted aggregation, correct health estimation is maintained if the total Byzantine trust weight is bounded:
This generalizes the classical \(f < n/3\) bound: with uniform trust, this reduces to \(f < 1/3\). With trust decay on suspicious nodes, Byzantine influence decreases over time, allowing tolerance of more compromised nodes provided their accumulated trust is low.
This is not foolproof—a sophisticated adversary who understands the aggregation mechanism can craft attacks that pass consistency checks. Byzantine tolerance provides defense in depth, not absolute security.
Trust Recovery Mechanisms
Trust decay handles misbehaving nodes, but legitimate nodes may be temporarily compromised (e.g., sensor interference, transient fault) and later recover. A purely decaying trust model permanently punishes temporary failures.
Trust recovery model:
Trust evolves according to a mean-reverting process with decay for misbehavior and recovery for consistent behavior:
where \(\gamma_{\text{decay}} \approx 0.1\) (fast decay) and \(\gamma_{\text{recover}} \approx 0.01\) (slow recovery). The asymmetry ensures that building trust takes longer than losing it—appropriate for contested environments.
Recovery conditions:
A node must demonstrate sustained consistent behavior before trust recovery activates:
where \(W\) is typically 50-100 gossip rounds and \(\theta_{\text{recovery}} \approx 0.95\). A node with even 5% inconsistent reports continues decaying.
Sybil attack resistance:
An adversary creating multiple fake identities (Sybil attack) can attempt to dominate the trust-weighted aggregation. Countermeasures:
-
Identity binding: Nodes must prove identity through cryptographic challenge-response or physical attestation (GPS position consistency over time)
-
Trust inheritance limits: New nodes start with \(T_{\text{initial}} = T_{\text{sponsor}} \cdot \beta\) where \(\beta < 0.5\). No node can spawn high-trust children.
-
Global trust budget: Total trust across all nodes is bounded:
New node admission requires either trust redistribution or explicit authorization.
- Behavioral clustering: Nodes exhibiting suspiciously correlated behavior (same reports, same timing) are grouped and treated as a single trust entity:
Trust recovery example:
CONVOY vehicle V3 experiences temporary GPS interference causing inconsistent position reports for 10 minutes. Trust drops from 1.0 to 0.35 during interference. After interference clears:
- Minutes 0-5: Consistent reports, trust rises to 0.42
- Minutes 5-15: Continued consistency, trust rises to 0.58
- Minutes 15-30: Trust rises to 0.78
- After 1 hour of consistency: Trust returns to 0.95
The slow recovery prevents adversaries from rapidly cycling between attack and “good behavior” phases.
The Observability Constraint Sequence
Hierarchy of Observability
With limited resources, what should be measured first?
The observability constraint sequence prioritizes metrics by importance:
| Level | Category | Examples | Resource Cost |
|---|---|---|---|
| P0 | Availability | Is it alive? Responding? | Minimal (heartbeat) |
| P1 | Resource exhaustion | Power, memory, storage remaining | Low (counters) |
| P2 | Performance degradation | Latency, throughput, error rates | Medium (aggregates) |
| P3 | Anomaly patterns | Unusual behavior, drift | Medium-High (models) |
| P4 | Root cause indicators | Why is it behaving this way? | High (correlation) |
P0 is non-negotiable. If a node doesn’t know whether its peers are alive, it cannot make any meaningful decisions. Availability monitoring requires minimal resources—a periodic heartbeat suffices.
P1 catches imminent failures. Resource exhaustion is the most predictable failure mode. If power drops below 10%, failure is imminent regardless of other factors. P1 monitoring prevents surprise crashes.
P2 detects gradual degradation. A sensor that responds but with increasing latency is degrading. P2 catches problems before they become failures—enabling proactive healing.
P3 catches the unexpected. Anomaly detection (Section 2) falls here. It’s more expensive than simple counters but catches failure modes that weren’t explicitly modeled.
P4 explains rather than just detects. Root cause analysis requires correlating multiple signals across time—computationally expensive but essential for learning.
The sequence is priority-ordered, not exclusive. A well-resourced system implements all levels. A constrained system implements as many as resources allow, starting from P0.
Resource Budget for Observability
Observability competes with the primary mission for resources:
Where:
- \(R_{\text{observe}}\) = resources for self-measurement
- \(R_{\text{mission}}\) = resources for primary function
- \(R_{\text{total}}\) = total available resources
The optimization problem:
Subject to \(R_{\text{observe}} + R_{\text{mission}} \leq R_{\text{total}}\)
Typically:
- Mission value has diminishing returns: more resources yield proportionally less capability
- Health value has threshold effects: below minimum, health knowledge is useless; above minimum, marginal gains decrease
The optimal allocation gives sufficient resources to observability for reliable health knowledge, then allocates remainder to mission.
OUTPOST allocation example:
- Total compute: 1000 MIPS
- Total bandwidth to fusion: 100 Kbps
Allocation:
- P0-P1 monitoring: 50 MIPS (5%), 5 Kbps (5%)—heartbeats and resource counters
- P2-P3 monitoring: 100 MIPS (10%), 15 Kbps (15%)—performance aggregates, anomaly detection
- Gossip overhead: 0 MIPS local, 20 Kbps (20%)—health propagation
- Mission (sensor processing): 850 MIPS (85%), 60 Kbps (60%)—primary function
This 15% observability overhead enables reliable self-measurement while preserving the majority of resources for the mission.
RAVEN Self-Measurement Protocol
The RAVEN drone swarm requires self-measurement at two levels: individual drone health and swarm-wide coordination state.
Per-Drone Local Measurement
Each drone continuously monitors:
Power State
- Battery voltage, current draw, temperature
- Estimated flight time remaining: \(t_{\text{remain}} = E_{\text{remaining}} / P_{\text{avg}}\)
- Anomaly detection: Sudden voltage drop, unusual current patterns
Sensor Health
- Camera: Image quality metrics, focus response, exposure accuracy
- Radar: Return signal strength, calibration consistency
- GPS: Satellite count, position dilution of precision (PDOP)
- IMU: Gyro drift rate, accelerometer noise floor
Link Quality
- RSSI to each mesh neighbor
- Packet delivery ratio per link
- Latency distribution per link
Mission Progress
- Coverage completion percentage
- Threat detection count
- Position relative to assigned sector
EWMA tracking on each metric with \(\alpha = 0.1\) (10-second effective memory). Anomaly threshold at 3σ for critical metrics (power, flight controls), 2σ for secondary metrics (sensors, links).
Swarm-Wide Health Inference
Gossip protocol parameters:
- Exchange rate: 0.2 Hz (once per 5 seconds)
- Staleness threshold: 30 seconds (confidence drops below 90%)
- Trust decay: \(\gamma = 0.05\) per second
- Maximum useful staleness: 60 seconds (confidence drops below 50%, data essentially stale)
Relationship: The staleness threshold (30s) marks where data begins degrading meaningfully—decisions based on 30s-old data have ~90% confidence. The maximum useful staleness (60s) marks where confidence falls below 50%—beyond this, the data provides little more than a guess. The 2:1 ratio reflects the quadratic confidence decay from Proposition 5.
Health vector per drone contains:
- Binary availability (alive/silent)
- Power state (percentage)
- Critical sensor status (functional/degraded/failed)
- Mission capability level (L0-L4)
Merge function uses timestamp-weighted average for numeric values, latest-timestamp-wins for categorical values.
Convergence guarantees: With logarithmic propagation dynamics, fleet-wide health convergence occurs within 30-40 seconds—fast enough to track operational state changes while remaining robust to individual message losses.
Anomaly Detection and Self-Diagnosis
Cross-sensor correlation matrix maintained locally. Example correlations:
- GPS PDOP vs. IMU drift: High PDOP should not correlate with low drift (if they do, likely spoofing)
- Battery voltage vs. current: Should follow known discharge curve (deviation indicates cell degradation)
- Camera image vs. radar return: Consistent threat detections (divergence suggests sensor failure)
Self-diagnosis follows a structured decision process:
| Observation Pattern | Diagnosis | Action |
|---|---|---|
| Power anomaly with neighbors unaffected or recent maneuver | Local power issue | Reduce power consumption, report to swarm |
| Sensor anomaly with cross-sensor consistency | Environmental condition | Continue with degraded confidence |
| Sensor anomaly with cross-sensor inconsistency | Sensor failure | Disable sensor, rely on alternatives |
| Communication anomaly affecting multiple neighbors | Environmental interference or jamming | Increase transmit power, switch frequencies |
| Communication anomaly affecting only self | Local radio failure | Attempt radio restart, fall back to minimal beacon |
The diagnosis is probabilistic—the table represents the most likely paths, but confidence levels are maintained throughout.
CONVOY Self-Measurement Protocol
The CONVOY ground vehicle network operates with different constraints: vehicles have more resources than drones but face different failure modes.
Per-Vehicle Local Measurement
Each vehicle monitors:
Mechanical Systems
- Engine: RPM, temperature, oil pressure, fuel consumption
- Transmission: Gear state, clutch wear indicators
- Suspension: Ride height, damper response
- Brakes: Pad wear, hydraulic pressure
Navigation Systems
- GPS: Position, velocity, satellite count, PDOP
- INS: Accelerometer and gyro readings, drift rate
- Dead reckoning: Wheel encoder counts, heading
- Map matching: Confidence in current road segment
Communication Systems
- Mesh connectivity to other vehicles
- Range to each neighbor
- Bandwidth utilization
- Latency to convoy lead
Anomaly detection uses Holt-Winters for metrics with diurnal patterns (communication quality varies with terrain) and EWMA for stationary metrics (mechanical systems).
Convoy-Level Health Inference
Hierarchical aggregation:
- Primary mode: Lead vehicle collects health from all vehicles, computes aggregate, distributes summary
- Fallback mode: If lead unreachable, peer-to-peer gossip among reachable vehicles
Lead vehicle aggregation:
- Computes minimum capability level across convoy: \(L_{\text{convoy}} = \min_i L_i\)
- Identifies vehicles with critical anomalies
- Determines convoy-wide constraints (e.g., maximum safe speed based on worst vehicle)
Fallback gossip parameters:
- Exchange rate: 0.1 Hz (once per 10 seconds)—lower than RAVEN due to vehicle stability
- Staleness threshold: 60 seconds
- Trust decay: \(\gamma = 0.02\) per second
Anomaly Detection Focus
Position spoofing detection:
Each vehicle tracks its own position via GPS, INS, and dead reckoning. It also receives claimed positions from neighbors. Cross-correlation identifies spoofing:
If \(\Delta_{ij}\) exceeds threshold for vehicle \(i\) as observed by multiple neighbors \(j\), vehicle \(i\) is flagged for position anomaly.
Communication anomaly classification:
Distinguish jamming from terrain effects:
- Jamming: Affects all frequencies, correlates with adversarial activity, affects multiple vehicles
- Terrain: Affects specific paths, correlates with geographic features, predictable from maps
Use convoy’s position history to build terrain propagation model. Deviations from model suggest adversarial interference.
Integration with Markov connectivity model:
From the Markov connectivity model, the expected transition rates between regimes are known. Observed transitions that deviate from expectations are flagged:
Unexpectedly rapid transitions from connected to denied suggest adversarial action rather than natural degradation.
OUTPOST Self-Measurement Protocol
The OUTPOST sensor mesh operates with the most extreme constraints: ultra-low power, extended deployment durations (30+ days), and fixed positions that make physical inspection impractical.
Per-Sensor Local Measurement
Each sensor node continuously monitors with minimal power:
Power State
- Solar panel voltage and current
- Battery state of charge (SoC): \(\text{SoC} = E_{\text{current}} / E_{\text{capacity}}\)
- Power budget: \(P_{\text{solar}} - P_{\text{load}} = P_{\text{net}}\)
- Anomaly detection: Solar panel degradation, battery cell failure
Environmental Monitoring
- Temperature: Affects sensor calibration and battery performance
- Humidity: Risk of condensation and corrosion
- Vibration: Indicates physical disturbance or tampering
- Ambient light: Validates solar panel output
Sensor Calibration State
- Drift from initial calibration
- Cross-correlation with neighboring sensors
- Response time degradation
- False positive/negative rates for known test patterns
Communication State
- RSSI to fusion node and neighboring sensors
- Successful message delivery rate
- Round-trip latency
- Queue depth for outgoing messages
Proposition 7 (Power-Aware Measurement Scheduling). For a sensor with solar charging profile \(P_{\text{solar}}(t)\) and measurement cost \(C_m\) per measurement, the optimal measurement schedule maximizes information gain while maintaining positive energy margin:
where \(I(m_t)\) is the information gain from measurement at time \(t\) and \(E_{\text{reserve}}\) is the required energy reserve.
In practice, this means scheduling high-power measurements (radar, active sensors) during peak solar hours and relying on low-power passive measurements during night and low-light periods.
Greedy heuristic: Sort measurements by information-gain-per-watt ratio \(I(m)/C_m\). Schedule in order until power budget exhausted. For OUTPOST, this yields:
- Passive seismic (0.1W, high info): Always on
- Passive acoustic (0.2W, medium info): Always on
- Active IR scan (2W, high info): Peak solar only (10am-2pm)
- Radar ping (5W, very high info): Midday only (11am-1pm), battery > 80%
This heuristic achieves ~85% of optimal information gain with O(n log n) computation, suitable for embedded deployment.
Mesh-Wide Health Inference
OUTPOST uses hierarchical aggregation with fusion nodes:
graph TD
subgraph Sensors["Sensor Layer (distributed)"]
S1[Sensor 1]
S2[Sensor 2]
S3[Sensor 3]
S4[Sensor 4]
S5[Sensor 5]
S6[Sensor 6]
end
subgraph Fusion["Fusion Layer (aggregation)"]
F1[Fusion A]
F2[Fusion B]
end
subgraph Command["Command Layer (satellite)"]
U[Uplink to HQ]
end
S1 --> F1
S2 --> F1
S3 --> F1
S4 --> F2
S5 --> F2
S6 --> F2
F1 --> U
F2 --> U
F1 -.->|"backup link"| F2
style U fill:#c8e6c9
style F1 fill:#fff9c4
style F2 fill:#fff9c4
style Sensors fill:#e3f2fd
style Fusion fill:#fff3e0
style Command fill:#e8f5e9
Normal operation: Sensors report to fusion nodes at low frequency (once per minute). Fusion nodes aggregate health and forward summaries via satellite uplink.
Degraded operation: If satellite uplink fails, fusion nodes exchange health via inter-fusion mesh links. Sensors continue local operation with extended buffer storage.
Denied operation: Each sensor operates independently with full local decision authority. Health state cached for post-reconnection reconciliation.
Gossip parameters for OUTPOST:
- Exchange rate: 0.017 Hz (once per minute)—optimized for power
- Staleness threshold: 300 seconds (5 minutes)
- Trust decay: \(\gamma = 0.002\) per second
- Maximum useful staleness: 600 seconds
Tamper Detection
Fixed sensor positions make physical tampering a significant threat. Multi-layer detection:
Physical indicators:
- Accelerometer for movement detection (sensor should be stationary)
- Light sensor for enclosure opening
- Temperature anomaly from human proximity
- Magnetic field disturbance from tools
Logical indicators:
- Sudden calibration drift after stable period
- Communication pattern change (new signal characteristics)
- Behavior inconsistent with neighboring sensors
Response protocol:
- Log tamper indicators with timestamp
- Increase reporting frequency if power permits
- Alert fusion node with tamper confidence level
- Continue operation unless tamper confidence exceeds threshold
- At high confidence: switch to quarantine mode (report but don’t trust own data)
Cross-Sensor Validation
OUTPOST leverages overlapping sensor coverage for validation:
Where \(\mathcal{N}_i\) is the set of sensors with overlapping coverage, and \(\text{Agreement}(s_i, s_j)\) measures correlation between sensor detections.
Low confidence triggers:
- Sensor \(s_i\) reports detection that no neighbors corroborate
- Sensor \(s_i\) fails to report detection that all neighbors report
- Sensor \(s_i\) timing systematically differs from neighbors
Cross-validation doesn’t determine which sensor is correct—it identifies sensors requiring investigation.
The Limits of Self-Measurement
Self-measurement has boundaries. Recognizing these limits is essential for correct system design.
Novel Failure Modes
Anomaly detection learns from historical data. A failure mode never seen before—outside the training distribution—may not be detected as anomalous.
Example: OUTPOST sensors are trained on hardware failures, communication failures, and known environmental conditions. A new adversarial technique—acoustic disruption of MEMS sensors—produces sensor behavior within “normal” ranges but with corrupted data. The anomaly detector sees normal statistics; the semantic content is compromised.
Mitigation: Defense in depth. Multiple detection mechanisms with different assumptions. Cross-validation between sensors. Periodic ground-truth verification when connectivity allows.
Adversarial Understanding
An adversary who understands the detection algorithm can craft attacks that evade detection.
If the adversary knows we use EWMA with \(\alpha = 0.1\), they can introduce gradual drift that stays within 2σ at each step but accumulates to significant deviation over time. The “boiling frog” attack.
Mitigation: Ensemble of detection algorithms with different sensitivities. Long-term drift detection (comparing current baseline to baseline from days ago). Randomized detection parameters.
Cascading Failures
Self-measurement assumes the measurement infrastructure is functional. But the measurement infrastructure can fail too.
If the power management system fails, anomaly detection may lose power before it can detect the power anomaly. If the communication subsystem fails, gossip cannot propagate health. The failure cascades faster than measurement can track.
Mitigation: P0/P1 monitoring on dedicated, ultra-low-power subsystem. Watchdog timers that trigger even if main processor fails. Hardware-level health indicators independent of software.
The Judgment Horizon
When should the system distrust its own measurements?
- When confidence intervals are too wide to support decisions
- When multiple sensors give irreconcilable readings
- When the system is operating outside its training distribution
- When measurement infrastructure itself is compromised
At the judgment horizon, self-measurement must acknowledge its limits. The system should:
- Log that it has reached measurement uncertainty limits
- Fall back to conservative assumptions
- Request human input when connectivity allows
- Avoid irreversible actions until confidence is restored
Sensor 47 Resolution
Return to our opening scenario. Sensor 47 went silent. How did OUTPOST diagnose the failure?
The fusion node applied the diagnostic framework from Section 2.3:
- Signature analysis: Abrupt silence, no prior degradation—inconsistent with hardware failure
- Correlation check: Sensors 45, 46, 48, 49 all operational—not a regional communication failure
- Environmental context: No known jamming indicators, weather nominal
- Staleness trajectory: Sensor 47’s last 10 readings showed normal variance, no drift
Diagnosis: Localized hardware failure (most likely power regulation), with 78% confidence. The fusion node:
- Routed Sensor 47’s coverage zone to neighbors (Sensors 45 and 48)
- Flagged for physical inspection on next patrol
- Updated its anomaly detection baseline to reduce reliance on Sensor 47’s historical patterns
Post-reconnection analysis (satellite uplink restored 6 hours later): Sensor 47’s voltage regulator had failed suddenly—a known failure mode for this component batch. The diagnosis was correct. The system had self-measured, self-diagnosed, and self-healed without human intervention.
Learning from Measurement Failures
Anti-fragile self-measurement improves from its failures. When post-hoc analysis reveals a measurement failure:
- Document the failure mode
- Add detection signature if possible
- Adjust thresholds or algorithms
- Update training data to include this case
Each measurement failure is an opportunity to improve future measurement.
Closing: The Measurement-Action Loop
Self-measurement without self-action is just logging.
You measure in order to act—to heal, adapt, improve. The measurement-action loop drives autonomic architecture:
graph LR
M["Monitor
(observe state)"] --> A["Analyze
(detect anomaly)"]
A --> P["Plan
(select action)"]
P --> E["Execute
(apply healing)"]
E -->|"feedback loop"| M
style M fill:#c8e6c9
style A fill:#fff9c4
style P fill:#ffcc80
style E fill:#ffab91
This is the MAPE-K loop (Monitor, Analyze, Plan, Execute, Knowledge) that IBM formalized for autonomic computing. The self-healing article develops the healing phase in detail.
Return to OUTPOST BRAVO.
Sensor 47 is silent. The fusion node has measured: abrupt silence, neighbors functional, location on approach path. The analysis suggests adversarial action with 73% confidence. The plan: increase defensive posture, activate backup sensors in the region, log for human review when uplink restores.
But measurement alone doesn’t execute this plan. Self-healing must decide: Is 73% confidence sufficient to escalate defensive posture? What is the cost of false alarm versus missed threat? How does the healing action affect the rest of the system?
The next article on self-healing develops the engineering principles for autonomous healing under uncertainty.