6.7 KiB
ADR-176: ruview-swarm NaN-Fail-Open Safety Review
| Field | Value |
|---|---|
| Status | Accepted — 4 real safety bugs fixed + pinned; 2 issues documented for follow-up |
| Date | 2026-06-15 |
| Deciders | ruv |
| Codename | SWARM-FAILCLOSED |
| Reviews | ADR-148 (ruview-swarm drone swarm control plane) |
| Milestone | #9 (ungated-crate security sweep) — crate 1 of 4 |
Context
ruview-swarm (ADR-148) is the drone swarm control plane — hierarchical-mesh
topology, Raft consensus, MARL, CSI sensing payload, MAVLink/PX4 command
dispatch. It is the highest-stakes of the four never-reviewed v2 crates: a defect
here can produce an unsafe physical drone command. It had no prior security
ADR.
Trust-boundary map
Untrusted input enters via SwarmOrchestrator::receive_peer_state /
receive_peer_detection, which accept full DroneState / CsiDetection serde
structs with f64/f32 fields and no finite-check, and via
SwarmConfig/FhssConfig/Geofence deserialization. The MAVLink wire formats in
mavlink_messages.rs are integer-encoded (i32 mm / u8) and provably cannot
carry NaN — so the NaN class is reachable through the serde struct path, not the
MAVLink decode path. Commands flow out to a FlightController (PX4/ArduPilot).
The unifying bug class found: IEEE-754 NaN/Inf silently defeating a safety
comparison (NaN < threshold evaluates to false), causing safety logic to
fail OPEN. This is distinct from — but rhymes with — the NaN-state-poisoning
class found earlier in calibration/vitals/geo (there, NaN latched into persistent
state; here, NaN slips through a one-shot guard). Both are "non-finite input
defeats logic," and the fix discipline is the same: reject non-finite at the
trust boundary, fail CLOSED.
Decision
Fix the four reachable fail-open bugs by making each safety predicate non-finite-aware and fail-closed, each pinned by a fails-on-old test. Document two further genuine issues that need larger, riskier changes rather than churning them in a security pass.
Findings fixed (all MEASURED fails-on-old)
| # | Severity | File:line | Issue | Fix | Pin (old behavior) |
|---|---|---|---|---|---|
| F1a | HIGH | failsafe/mod.rs:51 |
nearest_neighbor_dist < collision_dist_m fails open on a NaN peer position → collision avoidance silently disabled |
`!is_finite() | |
| F1b | HIGH | failsafe/mod.rs:75 |
NaN battery_pct bypasses every battery check → drone stays Nominal on unknown battery |
`!is_finite() | |
| F2 | MEDIUM | security/geofence.rs:33 |
NaN z altitude skips the altitude-breach check and point-in-polygon returns Safe → silent geofence bypass |
leading non-finite coord → HardBreach |
test_nan_altitude_fails_closed (old → Safe) |
| F3 | MEDIUM/DoS | security/antijamming.rs:65,71,102 |
empty deserialized channels_mhz → % 0 panic in next_hop/current_channel_mhz/evasive_hop/tick, crashing the radio task |
len == 0 early-return (0.0 sentinel) |
test_empty_channels_does_not_panic (old → panic divisor of zero) |
| F4 | LOW | sensing/multiview.rs:70 |
NaN victim_position passes the is_some() filter and propagates into the fused "confirmed victim" location dispatched to the swarm |
require finite confidence + position (drop) | test_nan_victim_position_dropped_from_fusion (old → non-finite fused position) |
Dimensions confirmed clean (with evidence)
- MAVLink decode panic-safety —
SwarmNodeState::decode(&[u8;20])try_into().unwrap()s are over fixed const ranges of a fixed-size array → provably infallible; no arbitrary-length&[u8]decode path exists. - UWB/GPS anti-spoofing NaN-safe —
(gps_dist - uwb_dist).abs() <= tolalready fails CLOSED on a NaN range (counts as inconsistent → spoof rejected); covered bytest_spoofed_gps_invalid. - Bounded grid / no allocate-from-length-field —
ProbabilityGridbounds-checkscx/cy;pos_to_celluses saturatingas u32(no UB). - Mesh
nearest_kNaN-safe sort —partial_cmp(..).unwrap_or(Equal)cannot panic on NaN. - No hardcoded secrets —
MavlinkSignerkey is constructor-injected[u8;32]; grep-confirmed nothing embedded.
Documented, not fixed (genuine — deferred to avoid churn/regression risk)
- Raft
AppendEntrieslacks the Log-Matching consistency check (topology/raft.rs:187). A follower appends a leader's entries whenterm >= current_termwithout validatingprev_log_index/prev_log_term, so a malformed/byzantine leader can corrupt a follower's log — a genuine consensus-safety gap. A correct fix reworks the log-append plus the caller-side vote-tally contract (the existinghandle_messagedelegates tallying to the caller) — a larger change with test-rewrite risk, so it is recorded here rather than rushed in a security pass. MavlinkSigner::verifyuses a non-constant-time tag==and has no replay/timestamp-window rejection (security/mavlink_signing.rs:64). The module doc already flags the replay limitation as a demo/test simplification. Hardening (constant-time compare + monotonic timestamp window) is a focused follow-up.
These two are the recommended scope of the next ruview-swarm hardening pass.
Validation
cargo test -p ruview-swarm --no-default-features→ 117 → 123 passed, 0 failed (+6 pins).- All 6 new tests MEASURED fails-on-old (2×
Nominal,Safe, panicdivisor of zero, non-finite fused position); pass on the fix. cargo test --workspace --no-default-features→ exit 0, 0 failed.python archive/v1/data/proof/verify.py→ VERDICT: PASS, hashf8e76f21…46f7aunchanged (ruview-swarm off the signal proof path).
Consequences
Positive
- Four reachable fail-open paths in a physical-safety control plane (collision avoidance, battery RTH, geofence, anti-jamming radio task) now fail CLOSED on hostile/degenerate input, each regression-pinned.
- Extends the "non-finite input defeats logic" defense from the state-poisoning variant (calibration/vitals/geo) to the fail-open-comparison variant.
Negative / Neutral
- Two genuine issues (Raft log-matching, MAVLink signer) remain open by choice — see Documented-not-fixed; they define the next hardening pass.
Links
- ADR-148 —
ruview-swarmdrone swarm control system - ADR-172 — core/cli review (where the NaN bug-class root question was settled NO)
- ADR-127 — homecore review (sibling NaN/concurrency hardening)