Free cookie consent management tool by TermsFeed Generator

Self-Measurement Without Central Observability


Prerequisites

This article builds directly on the contested connectivity framework:

Self-measurement is the sensory system of autonomic architecture. Just as organisms must sense their internal state before they can respond, edge systems must measure their own health before they can heal. This part develops the engineering principles for that measurement capability.


Theoretical Contributions

This article develops the theoretical foundations for self-measurement in distributed systems under contested connectivity. We make the following contributions:

  1. Local Anomaly Detection Framework: We formalize the anomaly detection problem as hypothesis testing under resource constraints, establishing optimal threshold selection as a function of asymmetric error costs.

  2. Gossip-Based Health Propagation: We derive convergence bounds for epidemic protocols in partially-connected networks, proving \(O(\ln n)\) propagation time under standard assumptions.

  3. Staleness-Confidence Theory: We model health state evolution as a stochastic process and derive the maximum useful staleness for decision-making, establishing the relationship between observation age and confidence degradation.

  4. Byzantine-Tolerant Aggregation: We extend weighted voting mechanisms to handle adversarial nodes, providing trust-decay models that detect and isolate compromised participants.

  5. Observability Constraint Sequence: We establish a priority ordering for measurement capabilities based on failure cost analysis, providing resource allocation guidelines for constrained systems.

These contributions connect to and extend prior work on fault detection in distributed systems (Cristian, 1991), epidemic algorithms (Demers et al., 1987), and autonomic computing (Kephart & Chess, 2003), adapting these frameworks for the specific challenges of contested edge environments.


Opening Narrative: OUTPOST Under Observation

Early morning. OUTPOST BRAVO’s 127-sensor perimeter mesh has been operating for 43 days. Without warning, the satellite uplink goes dark—no graceful degradation. Seconds later, Sensor 47 stops reporting. Last transmission: routine, battery at 73%, mesh connectivity strong. Then silence.

OUTPOST needs to answer: how do you diagnose this failure without external systems?

Each diagnosis implies different response. Without central observability, OUTPOST must diagnose itself—analyze patterns, correlate with neighbors, assess probabilities, decide on response. All locally. All autonomously.

This is self-measurement: assessing health and diagnosing anomalies without external assistance. You can’t heal what you haven’t diagnosed, and you can’t diagnose what you haven’t measured.


The Self-Measurement Challenge

Cloud-native observability assumes continuous connectivity:

    
    graph LR
    A[Metrics] -->|"network"| B[Collector]
    B -->|"network"| C[Storage]
    C -->|"network"| D[Analysis]
    D -->|"network"| E[Alerting]
    E -->|"network"| F[Human Operator]
    F -->|"network"| G[Remediation]

    style A fill:#e8f5e9
    style F fill:#ffcdd2
    linkStyle 0,1,2,3,4,5 stroke:#f44336,stroke-width:2px,stroke-dasharray: 5 5

Every arrow represents a network call. For edge systems, this architecture fails at the first arrow—when connectivity is denied, the entire observability pipeline is severed.

The edge alternative inverts the data flow:

    
    graph LR
    A[Local Sensors] --> B[Local Analyzer]
    B --> C[Health State]
    C --> D[Autonomic Controller]
    D --> E[Self-Healing Action]
    E -->|"feedback"| A

    style A fill:#e8f5e9
    style B fill:#c8e6c9
    style C fill:#fff9c4
    style D fill:#ffcc80
    style E fill:#ffab91

Analysis happens locally. Alerting goes to an autonomic controller, not human operators. The loop closes locally without external connectivity.

AspectCloud ObservabilityEdge Self-Measurement
Analysis locationCentral serviceLocal device
Alerting targetHuman operatorAutonomic controller
Training dataAbundant historical dataLimited local samples
Ground truthLabels from past incidentsUncertain, inferred
Compute budgetElastic (scale up)Fixed (device limits)
Memory budgetPractically unlimitedConstrained (MB range)
Response latencyMinutes acceptableSeconds required

Analysis must happen locally, and alerting must be autonomous. You can’t wait for human operators or external analysis services. The system must detect, diagnose, and decide—all within the constraints of local compute and memory.


Local Anomaly Detection

The Detection Problem

At its core, anomaly detection is a signal detection problem. The sensor produces a stream of values:

At each timestep, the local analyzer must decide: is this observation normal, or anomalous?

This is a binary classification under uncertainty:

Definition 4 (Local Anomaly Detection Problem). Given a time series \(\{x_t\}_{t \geq 0}\) generated by process \(P\), the local anomaly detection problem is to determine, for each observation \(x_t\), whether \(P\) has transitioned from nominal behavior \(P_0\) to anomalous behavior \(P_1\), subject to:

The challenge is performing this classification:

ConstraintCloud DetectionEdge Detection
ComputeGPU clusters, distributedSingle CPU, milliwatts
MemoryTerabytes for modelsMegabytes for everything
Training dataPetabytes historicalDays of local history
Ground truthLabels from incident responseInference from outcomes
FP costHuman review timeUnnecessary healing action
FN costDelayed responseUndetected failure, potential loss

The asymmetry of costs is critical. A false positive triggers an unnecessary healing action—wasteful but recoverable. A false negative leaves a failure undetected—potentially catastrophic in contested environments where undetected failures cascade.

Statistical Approaches

Edge anomaly detection requires algorithms that are:

Three approaches meet these requirements:

Exponential Weighted Moving Average (EWMA)

The simplest effective approach. Maintain running estimates of mean and variance:

Where \(\alpha \in (0, 1)\) controls the decay rate. Smaller \(\alpha\) means longer memory. Note: variance uses \(\mu_{t-1}\) to keep the estimate independent of \(x_t\), consistent with the anomaly score calculation.

The anomaly score normalizes deviation by variance:

Flag as anomaly if \(z_t > \theta\), where \(\theta\) is typically 2-3 standard deviations.

Proposition 3 (Optimal Anomaly Threshold). Given asymmetric error costs \(C_{\text{FP}}\) for false positives and \(C_{\text{FN}}\) for false negatives, the optimal detection threshold \(\theta^*\) satisfies the likelihood ratio condition:

For tactical edge systems where \(C_{\text{FN}} \gg C_{\text{FP}}\) (missed failures are catastrophic), the optimal threshold shifts toward more sensitive detection at the cost of increased false positives.

Proof sketch: The expected cost is \(C_{\text{FP}} \cdot P_{\text{FP}}(\theta) + C_{\text{FN}} \cdot P_{\text{FN}}(\theta)\). Taking the derivative and setting to zero yields the Neyman-Pearson lemma condition.

Holt-Winters for Seasonal Patterns

For signals with periodic structure (day/night cycles, shift patterns), Holt-Winters captures level, trend, and seasonality:

Where \(L_t\) is level, \(T_t\) is trend, \(S_t\) is seasonal component, and \(p\) is period length.

Period examples by scenario:

Isolation Forest Sketch for Multivariate

For multivariate anomaly detection with limited memory, streaming isolation forest maintains a sketch:

Where \(h(x)\) is path length to isolate \(x\), and \(c(n)\) is average path length in a random tree.

Concrete parameters for CONVOY: t=50 trees, d=8 depth limit, sample_size=128, contamination=0.02 (expected 2% anomaly rate). This configuration uses ~25KB memory and achieves 85% detection rate with 3% false positive rate on multi-sensor telemetry (engine, transmission, suspension combined).

CUSUM for Change-Point Detection

When the goal is detecting when a change occurred (not just that it occurred), Cumulative Sum (CUSUM) provides optimal detection for shifts in mean:

where \(\mu_0\) is the nominal mean and \(k\) is the allowable slack. Alarm when \(S_t > h\). CUSUM detects sustained shifts faster than EWMA but is more sensitive to the choice of \(k\). For RAVEN flight telemetry, CUSUM with \(k = 0.5\sigma\) detects motor degradation 15-20% faster than EWMA, at the cost of 10% higher false positive rate.

Concrete Error Rates

For RAVEN with anomaly threshold \(\theta = 2.5\sigma\) and base anomaly rate 2%:

OUTPOST Sensor 47 uses EWMA for primary detection: temperature, motion intensity, battery voltage each tracked independently. Cross-sensor correlation uses a lightweight covariance estimate between Sensor 47 and its mesh neighbors.

Distinguishing Failure Modes

Detection answers “is something wrong?” Diagnosis answers “what is wrong?”

For Sensor 47’s silence, the fusion node must distinguish:

Sensor hardware failure:

Communication failure:

Environmental occlusion:

Adversarial action:

The fusion node maintains causal models for each failure mode. Given observed evidence \(E\), Bayesian inference estimates posterior probability:

Priors \(P(\text{cause})\) come from historical failure rates. Likelihoods \(P(E | \text{cause})\) come from the signature patterns.

For Sensor 47:

The diagnosis is probabilistic, not certain. Self-measurement provides confidence levels, not ground truth.


Distributed Health Inference

Gossip-Based Health Propagation

Individual nodes detect local anomalies. Fleet-wide health requires aggregation without a central coordinator.

Definition 5 (Gossip Health Protocol). A gossip health protocol is a tuple \((H, \lambda, M, T)\) where:

Gossip protocols solve this problem. Each node maintains a health vector:

Where \(h_i\) is node \(i\)’s estimated health state.

The protocol operates in rounds:

  1. Local update: Node \(i\) updates \(h_i\) based on local anomaly detection
  2. Peer selection: Node \(i\) selects random peer \(j\)
  3. Exchange: Nodes \(i\) and \(j\) exchange health vectors
  4. Merge: Each node merges received vector with local knowledge
    
    graph LR
    subgraph Before Exchange
    A1["Node A: H_A"] -.->|"sends H_A"| B1["Node B: H_B"]
    B1 -.->|"sends H_B"| A1
    end
    subgraph After Merge
    A2["Node A: merge(H_A, H_B)"]
    B2["Node B: merge(H_A, H_B)"]
    end
    A1 --> A2
    B1 --> B2

    style A1 fill:#e8f5e9
    style B1 fill:#e3f2fd
    style A2 fill:#c8e6c9
    style B2 fill:#bbdefb

The merge function must handle:

A weighted merge using timestamp-based staleness:

Where weights decay with staleness:

With \(\tau\) as time since observation and \(\gamma\) as decay rate (distinct from the gossip rate \(\lambda\)).

Proposition 4 (Gossip Convergence). For a gossip protocol with rate \(\lambda\) and \(n\) nodes, the expected time for information originating at one node to reach all nodes is:

Proof sketch: The information spread follows logistic dynamics \(dI/dt = \lambda I(1 - I)\) where \(I\) is the fraction of informed nodes. Solving with initial condition \(I(0) = 1/n\) and computing time to reach \(I = 1 - 1/n\) yields \(T = (2 \ln(n-1))/\lambda\). Corollary 2. Doubling swarm size adds only \(O(\ln 2 / \lambda) \approx 0.69/\lambda\) seconds to convergence time, making gossip protocols inherently scalable for edge fleets.

For tactical parameters (\(n \sim 50\), \(\lambda \sim 0.2\) Hz), the formula yields \(T = 2\ln(49)/0.2 \approx 39\) seconds—convergence within 30-40 seconds, fast enough to establish fleet-wide health awareness within a single mission phase. Broadcast approaches scale linearly with \(n\), which is why gossip wins at scale.

Priority-Weighted Gossip Extension

Standard gossip treats all health updates equally. In tactical environments, critical health changes (node failure, resource exhaustion, adversarial detection) should propagate faster than routine updates.

Priority classification:

Accelerated propagation protocol:

For priority \(p\) messages, modify the gossip rate:

where \(\eta\) is the acceleration factor (typically 2-3). Critical messages gossip at \(3\times\) normal rate.

Message prioritization in constrained bandwidth:

When bandwidth is limited, each gossip exchange prioritizes by urgency. The protocol proceeds as follows:

Step 1: Merge local and peer health vectors into a unified update set.

Step 2: Sort updates by priority (descending), then by staleness (ascending) within each priority class.

Step 3: Transmit updates in sorted order until bandwidth budget exhausted:

Step 4: Critical override—always include \(P_{\text{CRITICAL}}\) updates even if over budget:

This ensures safety-critical information propagates regardless of bandwidth constraints, accepting temporary budget overrun.

Convergence improvement: For RAVEN with \(\eta = 2\), critical updates converge in ~15 seconds (vs. 39 seconds for normal updates)—a 2.6× speedup for time-sensitive health information.

Anti-flood protection: To prevent priority abuse (Byzantine node flooding P_CRITICAL messages), rate-limit critical messages per source:

where \(\rho_{\text{max}} \approx 0.01\) messages/second. Exceeding this rate triggers trust decay.

Gossip Under Partition

When the fleet partitions into disconnected clusters, gossip behavior changes fundamentally. Within each cluster, convergence continues normally. Between clusters, health state diverges.

Remark (Partition Staleness). For node \(i\) in cluster \(C_1\) observing node \(j\) in cluster \(C_2\), staleness—the elapsed time since observation—accumulates from partition time \(t_p\):

The staleness grows unboundedly during partition, eventually exceeding any useful threshold.

    
    graph LR
    subgraph Cluster_A["Cluster A (gossip active)"]
    A1[Node 1] --- A2[Node 2]
    A2 --- A3[Node 3]
    A1 --- A3
    end
    subgraph Cluster_B["Cluster B (gossip active)"]
    B1[Node 4] --- B2[Node 5]
    B2 --- B3[Node 6]
    B1 --- B3
    end
    A3 -.-x|"PARTITION
No communication"| B1 style Cluster_A fill:#e8f5e9 style Cluster_B fill:#e3f2fd

Cross-cluster state tracking:

Each node maintains a partition vector \(\rho_i\) tracking the last known connectivity state to each other node:

When \(\rho_i[j] > 0\) and \(t - \rho_i[j] > \tau_{\text{max}}\), node \(i\) marks its knowledge of node \(j\) as uncertain rather than stale.

Reconciliation priority:

Upon reconnection, nodes exchange partition vectors. The reconciliation priority for node \(j\)’s state is proportional to divergence duration:

Nodes with longest partition duration and highest importance (cluster leads, critical sensors) reconcile first.

Confidence Intervals on Stale Data

Health observations age. A drone last heard from 30 seconds ago may have changed state since then.

Definition 6 (Staleness). The staleness \(\tau\) of an observation is the elapsed time since the observation was made. An observation with staleness \(\tau\) has uncertainty that grows with \(\tau\) according to the underlying state dynamics.

Model health as a stochastic process. If health evolves with variance \(\sigma^2\) per unit time, the confidence interval on stale data is:

Where:

Implications for decision-making:

The CI width grows as \(\sqrt{\tau}\)—a consequence of the Brownian motion model. This square-root scaling means confidence degrades slowly at first but accelerates with staleness.

When the CI spans a decision threshold (like the L2 capability boundary), you can’t reliably commit to that capability level. The staleness has exceeded the decision horizon for that threshold—the maximum time at which stale data can support the decision.

Different decisions have different horizons. Safety-critical decisions with narrow margins have short horizons. Advisory decisions with wide margins have longer horizons. The system tracks staleness against the relevant horizon for each decision type.

Response strategies when confidence is insufficient:

  1. Active probe: Attempt direct communication to get fresh observation
  2. Conservative fallback: Assume health at lower bound of CI
  3. Escalate observation priority: Increase gossip rate for this node

Proposition 5 (Maximum Useful Staleness). For a health process with volatility \(\sigma\) and a decision requiring discrimination at precision \(\Delta h\) with confidence \(1 - \alpha\), the maximum useful staleness is:

where \(z_{\alpha/2}\) is the standard normal quantile. Beyond \(\tau_{\text{max}}\), the confidence interval spans the decision threshold and the observation cannot support the decision.

Proof: Follows directly from the Brownian motion model \(dh = \sigma , dW\), which yields variance \(\sigma^2 \tau\) after elapsed time \(\tau\). Setting the CI half-width equal to \(\Delta h\) and solving for \(\tau\) gives the result. Corollary 3. The quadratic relationship \(\tau_{\text{max}} \propto (\Delta h / \sigma)^2\) implies that tightening decision margins dramatically reduces useful staleness. Systems with narrow operating envelopes require proportionally higher observation frequency.

Byzantine-Tolerant Health Aggregation

In contested environments, some nodes may be compromised. They may inject false health values to:

Definition 7 (Byzantine Node). A node is Byzantine if it may deviate arbitrarily from the protocol specification, including sending different values to different peers, reporting false observations, or selectively participating in gossip rounds.

Weighted voting based on trust scores:

Where \(T_i\) is the trust score of node \(i\). Trust is earned through consistent, verifiable behavior and decays when inconsistencies are detected.

Outlier detection on received health reports:

If node \(i\) reports health for node \(k\) that differs significantly from the consensus, flag the report as suspicious:

Repeated suspicious reports decrease trust score for node \(i\).

Isolation protocol for nodes with inconsistent claims:

  1. Track history of claims per node
  2. Compute consistency score: fraction of claims matching consensus
  3. If consistency below threshold, quarantine node from health aggregation
  4. Quarantined nodes can still participate but their reports are not trusted

Proposition 6 (Byzantine Tolerance Bound). With trust-weighted aggregation, correct health estimation is maintained if the total Byzantine trust weight is bounded:

This generalizes the classical \(f < n/3\) bound: with uniform trust, this reduces to \(f < 1/3\). With trust decay on suspicious nodes, Byzantine influence decreases over time, allowing tolerance of more compromised nodes provided their accumulated trust is low.

This is not foolproof—a sophisticated adversary who understands the aggregation mechanism can craft attacks that pass consistency checks. Byzantine tolerance provides defense in depth, not absolute security.

Trust Recovery Mechanisms

Trust decay handles misbehaving nodes, but legitimate nodes may be temporarily compromised (e.g., sensor interference, transient fault) and later recover. A purely decaying trust model permanently punishes temporary failures.

Trust recovery model:

Trust evolves according to a mean-reverting process with decay for misbehavior and recovery for consistent behavior:

where \(\gamma_{\text{decay}} \approx 0.1\) (fast decay) and \(\gamma_{\text{recover}} \approx 0.01\) (slow recovery). The asymmetry ensures that building trust takes longer than losing it—appropriate for contested environments.

Recovery conditions:

A node must demonstrate sustained consistent behavior before trust recovery activates:

where \(W\) is typically 50-100 gossip rounds and \(\theta_{\text{recovery}} \approx 0.95\). A node with even 5% inconsistent reports continues decaying.

Sybil attack resistance:

An adversary creating multiple fake identities (Sybil attack) can attempt to dominate the trust-weighted aggregation. Countermeasures:

  1. Identity binding: Nodes must prove identity through cryptographic challenge-response or physical attestation (GPS position consistency over time)

  2. Trust inheritance limits: New nodes start with \(T_{\text{initial}} = T_{\text{sponsor}} \cdot \beta\) where \(\beta < 0.5\). No node can spawn high-trust children.

  3. Global trust budget: Total trust across all nodes is bounded:

New node admission requires either trust redistribution or explicit authorization.

  1. Behavioral clustering: Nodes exhibiting suspiciously correlated behavior (same reports, same timing) are grouped and treated as a single trust entity:

Trust recovery example:

CONVOY vehicle V3 experiences temporary GPS interference causing inconsistent position reports for 10 minutes. Trust drops from 1.0 to 0.35 during interference. After interference clears:

The slow recovery prevents adversaries from rapidly cycling between attack and “good behavior” phases.


The Observability Constraint Sequence

Hierarchy of Observability

With limited resources, what should be measured first?

The observability constraint sequence prioritizes metrics by importance:

LevelCategoryExamplesResource Cost
P0AvailabilityIs it alive? Responding?Minimal (heartbeat)
P1Resource exhaustionPower, memory, storage remainingLow (counters)
P2Performance degradationLatency, throughput, error ratesMedium (aggregates)
P3Anomaly patternsUnusual behavior, driftMedium-High (models)
P4Root cause indicatorsWhy is it behaving this way?High (correlation)

P0 is non-negotiable. If a node doesn’t know whether its peers are alive, it cannot make any meaningful decisions. Availability monitoring requires minimal resources—a periodic heartbeat suffices.

P1 catches imminent failures. Resource exhaustion is the most predictable failure mode. If power drops below 10%, failure is imminent regardless of other factors. P1 monitoring prevents surprise crashes.

P2 detects gradual degradation. A sensor that responds but with increasing latency is degrading. P2 catches problems before they become failures—enabling proactive healing.

P3 catches the unexpected. Anomaly detection (Section 2) falls here. It’s more expensive than simple counters but catches failure modes that weren’t explicitly modeled.

P4 explains rather than just detects. Root cause analysis requires correlating multiple signals across time—computationally expensive but essential for learning.

The sequence is priority-ordered, not exclusive. A well-resourced system implements all levels. A constrained system implements as many as resources allow, starting from P0.

Resource Budget for Observability

Observability competes with the primary mission for resources:

Where:

The optimization problem:

Subject to \(R_{\text{observe}} + R_{\text{mission}} \leq R_{\text{total}}\)

Typically:

The optimal allocation gives sufficient resources to observability for reliable health knowledge, then allocates remainder to mission.

OUTPOST allocation example:

Allocation:

This 15% observability overhead enables reliable self-measurement while preserving the majority of resources for the mission.


RAVEN Self-Measurement Protocol

The RAVEN drone swarm requires self-measurement at two levels: individual drone health and swarm-wide coordination state.

Per-Drone Local Measurement

Each drone continuously monitors:

Power State

Sensor Health

Link Quality

Mission Progress

EWMA tracking on each metric with \(\alpha = 0.1\) (10-second effective memory). Anomaly threshold at 3σ for critical metrics (power, flight controls), 2σ for secondary metrics (sensors, links).

Swarm-Wide Health Inference

Gossip protocol parameters:

Relationship: The staleness threshold (30s) marks where data begins degrading meaningfully—decisions based on 30s-old data have ~90% confidence. The maximum useful staleness (60s) marks where confidence falls below 50%—beyond this, the data provides little more than a guess. The 2:1 ratio reflects the quadratic confidence decay from Proposition 5.

Health vector per drone contains:

Merge function uses timestamp-weighted average for numeric values, latest-timestamp-wins for categorical values.

Convergence guarantees: With logarithmic propagation dynamics, fleet-wide health convergence occurs within 30-40 seconds—fast enough to track operational state changes while remaining robust to individual message losses.

Anomaly Detection and Self-Diagnosis

Cross-sensor correlation matrix maintained locally. Example correlations:

Self-diagnosis follows a structured decision process:

Observation PatternDiagnosisAction
Power anomaly with neighbors unaffected or recent maneuverLocal power issueReduce power consumption, report to swarm
Sensor anomaly with cross-sensor consistencyEnvironmental conditionContinue with degraded confidence
Sensor anomaly with cross-sensor inconsistencySensor failureDisable sensor, rely on alternatives
Communication anomaly affecting multiple neighborsEnvironmental interference or jammingIncrease transmit power, switch frequencies
Communication anomaly affecting only selfLocal radio failureAttempt radio restart, fall back to minimal beacon

The diagnosis is probabilistic—the table represents the most likely paths, but confidence levels are maintained throughout.


CONVOY Self-Measurement Protocol

The CONVOY ground vehicle network operates with different constraints: vehicles have more resources than drones but face different failure modes.

Per-Vehicle Local Measurement

Each vehicle monitors:

Mechanical Systems

Navigation Systems

Communication Systems

Anomaly detection uses Holt-Winters for metrics with diurnal patterns (communication quality varies with terrain) and EWMA for stationary metrics (mechanical systems).

Convoy-Level Health Inference

Hierarchical aggregation:

  1. Primary mode: Lead vehicle collects health from all vehicles, computes aggregate, distributes summary
  2. Fallback mode: If lead unreachable, peer-to-peer gossip among reachable vehicles

Lead vehicle aggregation:

Fallback gossip parameters:

Anomaly Detection Focus

Position spoofing detection:

Each vehicle tracks its own position via GPS, INS, and dead reckoning. It also receives claimed positions from neighbors. Cross-correlation identifies spoofing:

If \(\Delta_{ij}\) exceeds threshold for vehicle \(i\) as observed by multiple neighbors \(j\), vehicle \(i\) is flagged for position anomaly.

Communication anomaly classification:

Distinguish jamming from terrain effects:

Use convoy’s position history to build terrain propagation model. Deviations from model suggest adversarial interference.

Integration with Markov connectivity model:

From the Markov connectivity model, the expected transition rates between regimes are known. Observed transitions that deviate from expectations are flagged:

Unexpectedly rapid transitions from connected to denied suggest adversarial action rather than natural degradation.


OUTPOST Self-Measurement Protocol

The OUTPOST sensor mesh operates with the most extreme constraints: ultra-low power, extended deployment durations (30+ days), and fixed positions that make physical inspection impractical.

Per-Sensor Local Measurement

Each sensor node continuously monitors with minimal power:

Power State

Environmental Monitoring

Sensor Calibration State

Communication State

Proposition 7 (Power-Aware Measurement Scheduling). For a sensor with solar charging profile \(P_{\text{solar}}(t)\) and measurement cost \(C_m\) per measurement, the optimal measurement schedule maximizes information gain while maintaining positive energy margin:

where \(I(m_t)\) is the information gain from measurement at time \(t\) and \(E_{\text{reserve}}\) is the required energy reserve.

In practice, this means scheduling high-power measurements (radar, active sensors) during peak solar hours and relying on low-power passive measurements during night and low-light periods.

Greedy heuristic: Sort measurements by information-gain-per-watt ratio \(I(m)/C_m\). Schedule in order until power budget exhausted. For OUTPOST, this yields:

  1. Passive seismic (0.1W, high info): Always on
  2. Passive acoustic (0.2W, medium info): Always on
  3. Active IR scan (2W, high info): Peak solar only (10am-2pm)
  4. Radar ping (5W, very high info): Midday only (11am-1pm), battery > 80%

This heuristic achieves ~85% of optimal information gain with O(n log n) computation, suitable for embedded deployment.

Mesh-Wide Health Inference

OUTPOST uses hierarchical aggregation with fusion nodes:

    
    graph TD
    subgraph Sensors["Sensor Layer (distributed)"]
    S1[Sensor 1]
    S2[Sensor 2]
    S3[Sensor 3]
    S4[Sensor 4]
    S5[Sensor 5]
    S6[Sensor 6]
    end
    subgraph Fusion["Fusion Layer (aggregation)"]
    F1[Fusion A]
    F2[Fusion B]
    end
    subgraph Command["Command Layer (satellite)"]
    U[Uplink to HQ]
    end
    S1 --> F1
    S2 --> F1
    S3 --> F1
    S4 --> F2
    S5 --> F2
    S6 --> F2
    F1 --> U
    F2 --> U
    F1 -.->|"backup link"| F2

    style U fill:#c8e6c9
    style F1 fill:#fff9c4
    style F2 fill:#fff9c4
    style Sensors fill:#e3f2fd
    style Fusion fill:#fff3e0
    style Command fill:#e8f5e9

Normal operation: Sensors report to fusion nodes at low frequency (once per minute). Fusion nodes aggregate health and forward summaries via satellite uplink.

Degraded operation: If satellite uplink fails, fusion nodes exchange health via inter-fusion mesh links. Sensors continue local operation with extended buffer storage.

Denied operation: Each sensor operates independently with full local decision authority. Health state cached for post-reconnection reconciliation.

Gossip parameters for OUTPOST:

Tamper Detection

Fixed sensor positions make physical tampering a significant threat. Multi-layer detection:

Physical indicators:

Logical indicators:

Response protocol:

  1. Log tamper indicators with timestamp
  2. Increase reporting frequency if power permits
  3. Alert fusion node with tamper confidence level
  4. Continue operation unless tamper confidence exceeds threshold
  5. At high confidence: switch to quarantine mode (report but don’t trust own data)

Cross-Sensor Validation

OUTPOST leverages overlapping sensor coverage for validation:

Where \(\mathcal{N}_i\) is the set of sensors with overlapping coverage, and \(\text{Agreement}(s_i, s_j)\) measures correlation between sensor detections.

Low confidence triggers:

Cross-validation doesn’t determine which sensor is correct—it identifies sensors requiring investigation.


The Limits of Self-Measurement

Self-measurement has boundaries. Recognizing these limits is essential for correct system design.

Novel Failure Modes

Anomaly detection learns from historical data. A failure mode never seen before—outside the training distribution—may not be detected as anomalous.

Example: OUTPOST sensors are trained on hardware failures, communication failures, and known environmental conditions. A new adversarial technique—acoustic disruption of MEMS sensors—produces sensor behavior within “normal” ranges but with corrupted data. The anomaly detector sees normal statistics; the semantic content is compromised.

Mitigation: Defense in depth. Multiple detection mechanisms with different assumptions. Cross-validation between sensors. Periodic ground-truth verification when connectivity allows.

Adversarial Understanding

An adversary who understands the detection algorithm can craft attacks that evade detection.

If the adversary knows we use EWMA with \(\alpha = 0.1\), they can introduce gradual drift that stays within 2σ at each step but accumulates to significant deviation over time. The “boiling frog” attack.

Mitigation: Ensemble of detection algorithms with different sensitivities. Long-term drift detection (comparing current baseline to baseline from days ago). Randomized detection parameters.

Cascading Failures

Self-measurement assumes the measurement infrastructure is functional. But the measurement infrastructure can fail too.

If the power management system fails, anomaly detection may lose power before it can detect the power anomaly. If the communication subsystem fails, gossip cannot propagate health. The failure cascades faster than measurement can track.

Mitigation: P0/P1 monitoring on dedicated, ultra-low-power subsystem. Watchdog timers that trigger even if main processor fails. Hardware-level health indicators independent of software.

The Judgment Horizon

When should the system distrust its own measurements?

At the judgment horizon, self-measurement must acknowledge its limits. The system should:

  1. Log that it has reached measurement uncertainty limits
  2. Fall back to conservative assumptions
  3. Request human input when connectivity allows
  4. Avoid irreversible actions until confidence is restored

Sensor 47 Resolution

Return to our opening scenario. Sensor 47 went silent. How did OUTPOST diagnose the failure?

The fusion node applied the diagnostic framework from Section 2.3:

  1. Signature analysis: Abrupt silence, no prior degradation—inconsistent with hardware failure
  2. Correlation check: Sensors 45, 46, 48, 49 all operational—not a regional communication failure
  3. Environmental context: No known jamming indicators, weather nominal
  4. Staleness trajectory: Sensor 47’s last 10 readings showed normal variance, no drift

Diagnosis: Localized hardware failure (most likely power regulation), with 78% confidence. The fusion node:

Post-reconnection analysis (satellite uplink restored 6 hours later): Sensor 47’s voltage regulator had failed suddenly—a known failure mode for this component batch. The diagnosis was correct. The system had self-measured, self-diagnosed, and self-healed without human intervention.

Learning from Measurement Failures

Anti-fragile self-measurement improves from its failures. When post-hoc analysis reveals a measurement failure:

  1. Document the failure mode
  2. Add detection signature if possible
  3. Adjust thresholds or algorithms
  4. Update training data to include this case

Each measurement failure is an opportunity to improve future measurement.


Closing: The Measurement-Action Loop

Self-measurement without self-action is just logging.

You measure in order to act—to heal, adapt, improve. The measurement-action loop drives autonomic architecture:

    
    graph LR
    M["Monitor
(observe state)"] --> A["Analyze
(detect anomaly)"] A --> P["Plan
(select action)"] P --> E["Execute
(apply healing)"] E -->|"feedback loop"| M style M fill:#c8e6c9 style A fill:#fff9c4 style P fill:#ffcc80 style E fill:#ffab91

This is the MAPE-K loop (Monitor, Analyze, Plan, Execute, Knowledge) that IBM formalized for autonomic computing. The self-healing article develops the healing phase in detail.

Return to OUTPOST BRAVO.

Sensor 47 is silent. The fusion node has measured: abrupt silence, neighbors functional, location on approach path. The analysis suggests adversarial action with 73% confidence. The plan: increase defensive posture, activate backup sensors in the region, log for human review when uplink restores.

But measurement alone doesn’t execute this plan. Self-healing must decide: Is 73% confidence sufficient to escalate defensive posture? What is the cost of false alarm versus missed threat? How does the healing action affect the rest of the system?

The next article on self-healing develops the engineering principles for autonomous healing under uncertainty.


Back to top