104 lines
6.7 KiB
Markdown
104 lines
6.7 KiB
Markdown
# ADR-176: `ruview-swarm` NaN-Fail-Open Safety Review
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| **Status** | Accepted — 4 real safety bugs fixed + pinned; 2 issues documented for follow-up |
|
||
| **Date** | 2026-06-15 |
|
||
| **Deciders** | ruv |
|
||
| **Codename** | **SWARM-FAILCLOSED** |
|
||
| **Reviews** | ADR-148 (`ruview-swarm` drone swarm control plane) |
|
||
| **Milestone** | #9 (ungated-crate security sweep) — crate 1 of 4 |
|
||
|
||
## Context
|
||
|
||
`ruview-swarm` (ADR-148) is the drone swarm control plane — hierarchical-mesh
|
||
topology, Raft consensus, MARL, CSI sensing payload, MAVLink/PX4 command
|
||
dispatch. It is the highest-stakes of the four never-reviewed v2 crates: a defect
|
||
here can produce an **unsafe physical drone command**. It had no prior security
|
||
ADR.
|
||
|
||
### Trust-boundary map
|
||
Untrusted input enters via `SwarmOrchestrator::receive_peer_state` /
|
||
`receive_peer_detection`, which accept full `DroneState` / `CsiDetection` serde
|
||
structs with **f64/f32 fields and no finite-check**, and via
|
||
`SwarmConfig`/`FhssConfig`/`Geofence` deserialization. The MAVLink wire formats in
|
||
`mavlink_messages.rs` are **integer-encoded** (i32 mm / u8) and provably cannot
|
||
carry NaN — so the NaN class is reachable through the **serde struct path, not the
|
||
MAVLink decode path**. Commands flow out to a `FlightController` (PX4/ArduPilot).
|
||
|
||
The unifying bug class found: **IEEE-754 NaN/Inf silently defeating a safety
|
||
comparison** (`NaN < threshold` evaluates to `false`), causing safety logic to
|
||
**fail OPEN**. This is distinct from — but rhymes with — the NaN-state-poisoning
|
||
class found earlier in calibration/vitals/geo (there, NaN latched into persistent
|
||
state; here, NaN slips through a one-shot guard). Both are "non-finite input
|
||
defeats logic," and the fix discipline is the same: **reject non-finite at the
|
||
trust boundary, fail CLOSED.**
|
||
|
||
## Decision
|
||
|
||
Fix the four reachable fail-open bugs by making each safety predicate
|
||
non-finite-aware and fail-closed, each pinned by a fails-on-old test. Document
|
||
two further genuine issues that need larger, riskier changes rather than churning
|
||
them in a security pass.
|
||
|
||
### Findings fixed (all MEASURED fails-on-old)
|
||
|
||
| # | Severity | File:line | Issue | Fix | Pin (old behavior) |
|
||
|---|----------|-----------|-------|-----|--------------------|
|
||
| F1a | **HIGH** | `failsafe/mod.rs:51` | `nearest_neighbor_dist < collision_dist_m` fails open on a NaN peer position → **collision avoidance silently disabled** | `!is_finite() ||` → `EmergencyDiverge` | `test_nan_neighbor_distance_fails_closed_to_diverge` (old → `Nominal`) |
|
||
| F1b | **HIGH** | `failsafe/mod.rs:75` | NaN `battery_pct` bypasses every battery check → drone stays Nominal on unknown battery | `!is_finite() ||` → `ReturnToHome` | `test_nan_battery_fails_closed_to_rth` (old → `Nominal`) |
|
||
| F2 | **MEDIUM** | `security/geofence.rs:33` | NaN `z` altitude skips the altitude-breach check and point-in-polygon returns `Safe` → silent geofence bypass | leading non-finite coord → `HardBreach` | `test_nan_altitude_fails_closed` (old → `Safe`) |
|
||
| F3 | **MEDIUM/DoS** | `security/antijamming.rs:65,71,102` | empty deserialized `channels_mhz` → `% 0` **panic** in `next_hop`/`current_channel_mhz`/`evasive_hop`/`tick`, crashing the radio task | `len == 0` early-return (`0.0` sentinel) | `test_empty_channels_does_not_panic` (old → panic `divisor of zero`) |
|
||
| F4 | **LOW** | `sensing/multiview.rs:70` | NaN `victim_position` passes the `is_some()` filter and propagates into the fused "confirmed victim" location dispatched to the swarm | require finite confidence + position (drop) | `test_nan_victim_position_dropped_from_fusion` (old → non-finite fused position) |
|
||
|
||
### Dimensions confirmed clean (with evidence)
|
||
- **MAVLink decode panic-safety** — `SwarmNodeState::decode(&[u8;20])` `try_into().unwrap()`s are over fixed const ranges of a fixed-size array → provably infallible; no arbitrary-length `&[u8]` decode path exists.
|
||
- **UWB/GPS anti-spoofing NaN-safe** — `(gps_dist - uwb_dist).abs() <= tol` already fails CLOSED on a NaN range (counts as inconsistent → spoof rejected); covered by `test_spoofed_gps_invalid`.
|
||
- **Bounded grid / no allocate-from-length-field** — `ProbabilityGrid` bounds-checks `cx/cy`; `pos_to_cell` uses saturating `as u32` (no UB).
|
||
- **Mesh `nearest_k` NaN-safe sort** — `partial_cmp(..).unwrap_or(Equal)` cannot panic on NaN.
|
||
- **No hardcoded secrets** — `MavlinkSigner` key is constructor-injected `[u8;32]`; grep-confirmed nothing embedded.
|
||
|
||
### Documented, not fixed (genuine — deferred to avoid churn/regression risk)
|
||
|
||
1. **Raft `AppendEntries` lacks the Log-Matching consistency check**
|
||
(`topology/raft.rs:187`). A follower appends a leader's entries when
|
||
`term >= current_term` **without validating `prev_log_index`/`prev_log_term`**,
|
||
so a malformed/byzantine leader can corrupt a follower's log — a genuine
|
||
consensus-safety gap. A correct fix reworks the log-append plus the
|
||
caller-side vote-tally contract (the existing `handle_message` delegates
|
||
tallying to the caller) — a larger change with test-rewrite risk, so it is
|
||
recorded here rather than rushed in a security pass.
|
||
2. **`MavlinkSigner::verify` uses a non-constant-time tag `==` and has no
|
||
replay/timestamp-window rejection** (`security/mavlink_signing.rs:64`). The
|
||
module doc already flags the replay limitation as a demo/test simplification.
|
||
Hardening (constant-time compare + monotonic timestamp window) is a focused
|
||
follow-up.
|
||
|
||
These two are the recommended scope of the next `ruview-swarm` hardening pass.
|
||
|
||
## Validation
|
||
|
||
- `cargo test -p ruview-swarm --no-default-features` → **117 → 123** passed, 0 failed (+6 pins).
|
||
- All 6 new tests MEASURED fails-on-old (2× `Nominal`, `Safe`, panic `divisor of zero`, non-finite fused position); pass on the fix.
|
||
- `cargo test --workspace --no-default-features` → **exit 0**, 0 failed.
|
||
- `python archive/v1/data/proof/verify.py` → **VERDICT: PASS**, hash
|
||
`f8e76f21…46f7a` unchanged (ruview-swarm off the signal proof path).
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
- Four reachable fail-open paths in a *physical-safety* control plane (collision
|
||
avoidance, battery RTH, geofence, anti-jamming radio task) now fail CLOSED on
|
||
hostile/degenerate input, each regression-pinned.
|
||
- Extends the "non-finite input defeats logic" defense from the state-poisoning
|
||
variant (calibration/vitals/geo) to the fail-open-comparison variant.
|
||
|
||
### Negative / Neutral
|
||
- Two genuine issues (Raft log-matching, MAVLink signer) remain open by choice —
|
||
see Documented-not-fixed; they define the next hardening pass.
|
||
|
||
## Links
|
||
- ADR-148 — `ruview-swarm` drone swarm control system
|
||
- ADR-172 — core/cli review (where the NaN bug-class root question was settled NO)
|
||
- ADR-127 — homecore review (sibling NaN/concurrency hardening)
|