wifi-densepose/docs/adr/ADR-176-ruview-swarm-nan-fa...

104 lines
6.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-176: `ruview-swarm` NaN-Fail-Open Safety Review
| Field | Value |
|-------|-------|
| **Status** | Accepted — 4 real safety bugs fixed + pinned; 2 issues documented for follow-up |
| **Date** | 2026-06-15 |
| **Deciders** | ruv |
| **Codename** | **SWARM-FAILCLOSED** |
| **Reviews** | ADR-148 (`ruview-swarm` drone swarm control plane) |
| **Milestone** | #9 (ungated-crate security sweep) — crate 1 of 4 |
## Context
`ruview-swarm` (ADR-148) is the drone swarm control plane — hierarchical-mesh
topology, Raft consensus, MARL, CSI sensing payload, MAVLink/PX4 command
dispatch. It is the highest-stakes of the four never-reviewed v2 crates: a defect
here can produce an **unsafe physical drone command**. It had no prior security
ADR.
### Trust-boundary map
Untrusted input enters via `SwarmOrchestrator::receive_peer_state` /
`receive_peer_detection`, which accept full `DroneState` / `CsiDetection` serde
structs with **f64/f32 fields and no finite-check**, and via
`SwarmConfig`/`FhssConfig`/`Geofence` deserialization. The MAVLink wire formats in
`mavlink_messages.rs` are **integer-encoded** (i32 mm / u8) and provably cannot
carry NaN — so the NaN class is reachable through the **serde struct path, not the
MAVLink decode path**. Commands flow out to a `FlightController` (PX4/ArduPilot).
The unifying bug class found: **IEEE-754 NaN/Inf silently defeating a safety
comparison** (`NaN < threshold` evaluates to `false`), causing safety logic to
**fail OPEN**. This is distinct from but rhymes with the NaN-state-poisoning
class found earlier in calibration/vitals/geo (there, NaN latched into persistent
state; here, NaN slips through a one-shot guard). Both are "non-finite input
defeats logic," and the fix discipline is the same: **reject non-finite at the
trust boundary, fail CLOSED.**
## Decision
Fix the four reachable fail-open bugs by making each safety predicate
non-finite-aware and fail-closed, each pinned by a fails-on-old test. Document
two further genuine issues that need larger, riskier changes rather than churning
them in a security pass.
### Findings fixed (all MEASURED fails-on-old)
| # | Severity | File:line | Issue | Fix | Pin (old behavior) |
|---|----------|-----------|-------|-----|--------------------|
| F1a | **HIGH** | `failsafe/mod.rs:51` | `nearest_neighbor_dist < collision_dist_m` fails open on a NaN peer position **collision avoidance silently disabled** | `!is_finite() ||` `EmergencyDiverge` | `test_nan_neighbor_distance_fails_closed_to_diverge` (old `Nominal`) |
| F1b | **HIGH** | `failsafe/mod.rs:75` | NaN `battery_pct` bypasses every battery check drone stays Nominal on unknown battery | `!is_finite() ||` `ReturnToHome` | `test_nan_battery_fails_closed_to_rth` (old `Nominal`) |
| F2 | **MEDIUM** | `security/geofence.rs:33` | NaN `z` altitude skips the altitude-breach check and point-in-polygon returns `Safe` silent geofence bypass | leading non-finite coord `HardBreach` | `test_nan_altitude_fails_closed` (old `Safe`) |
| F3 | **MEDIUM/DoS** | `security/antijamming.rs:65,71,102` | empty deserialized `channels_mhz` `% 0` **panic** in `next_hop`/`current_channel_mhz`/`evasive_hop`/`tick`, crashing the radio task | `len == 0` early-return (`0.0` sentinel) | `test_empty_channels_does_not_panic` (old panic `divisor of zero`) |
| F4 | **LOW** | `sensing/multiview.rs:70` | NaN `victim_position` passes the `is_some()` filter and propagates into the fused "confirmed victim" location dispatched to the swarm | require finite confidence + position (drop) | `test_nan_victim_position_dropped_from_fusion` (old non-finite fused position) |
### Dimensions confirmed clean (with evidence)
- **MAVLink decode panic-safety** `SwarmNodeState::decode(&[u8;20])` `try_into().unwrap()`s are over fixed const ranges of a fixed-size array provably infallible; no arbitrary-length `&[u8]` decode path exists.
- **UWB/GPS anti-spoofing NaN-safe** `(gps_dist - uwb_dist).abs() <= tol` already fails CLOSED on a NaN range (counts as inconsistent spoof rejected); covered by `test_spoofed_gps_invalid`.
- **Bounded grid / no allocate-from-length-field** `ProbabilityGrid` bounds-checks `cx/cy`; `pos_to_cell` uses saturating `as u32` (no UB).
- **Mesh `nearest_k` NaN-safe sort** `partial_cmp(..).unwrap_or(Equal)` cannot panic on NaN.
- **No hardcoded secrets** `MavlinkSigner` key is constructor-injected `[u8;32]`; grep-confirmed nothing embedded.
### Documented, not fixed (genuine — deferred to avoid churn/regression risk)
1. **Raft `AppendEntries` lacks the Log-Matching consistency check**
(`topology/raft.rs:187`). A follower appends a leader's entries when
`term >= current_term` **without validating `prev_log_index`/`prev_log_term`**,
so a malformed/byzantine leader can corrupt a follower's log a genuine
consensus-safety gap. A correct fix reworks the log-append plus the
caller-side vote-tally contract (the existing `handle_message` delegates
tallying to the caller) a larger change with test-rewrite risk, so it is
recorded here rather than rushed in a security pass.
2. **`MavlinkSigner::verify` uses a non-constant-time tag `==` and has no
replay/timestamp-window rejection** (`security/mavlink_signing.rs:64`). The
module doc already flags the replay limitation as a demo/test simplification.
Hardening (constant-time compare + monotonic timestamp window) is a focused
follow-up.
These two are the recommended scope of the next `ruview-swarm` hardening pass.
## Validation
- `cargo test -p ruview-swarm --no-default-features` **117 → 123** passed, 0 failed (+6 pins).
- All 6 new tests MEASURED fails-on-old (2× `Nominal`, `Safe`, panic `divisor of zero`, non-finite fused position); pass on the fix.
- `cargo test --workspace --no-default-features` **exit 0**, 0 failed.
- `python archive/v1/data/proof/verify.py` **VERDICT: PASS**, hash
`f8e76f21…46f7a` unchanged (ruview-swarm off the signal proof path).
## Consequences
### Positive
- Four reachable fail-open paths in a *physical-safety* control plane (collision
avoidance, battery RTH, geofence, anti-jamming radio task) now fail CLOSED on
hostile/degenerate input, each regression-pinned.
- Extends the "non-finite input defeats logic" defense from the state-poisoning
variant (calibration/vitals/geo) to the fail-open-comparison variant.
### Negative / Neutral
- Two genuine issues (Raft log-matching, MAVLink signer) remain open by choice
see Documented-not-fixed; they define the next hardening pass.
## Links
- ADR-148 `ruview-swarm` drone swarm control system
- ADR-172 core/cli review (where the NaN bug-class root question was settled NO)
- ADR-127 homecore review (sibling NaN/concurrency hardening)