wifi-densepose/docs/adr/ADR-176-ruview-swarm-nan-fa...

6.7 KiB
Raw Blame History

ADR-176: ruview-swarm NaN-Fail-Open Safety Review

Field Value
Status Accepted — 4 real safety bugs fixed + pinned; 2 issues documented for follow-up
Date 2026-06-15
Deciders ruv
Codename SWARM-FAILCLOSED
Reviews ADR-148 (ruview-swarm drone swarm control plane)
Milestone #9 (ungated-crate security sweep) — crate 1 of 4

Context

ruview-swarm (ADR-148) is the drone swarm control plane — hierarchical-mesh topology, Raft consensus, MARL, CSI sensing payload, MAVLink/PX4 command dispatch. It is the highest-stakes of the four never-reviewed v2 crates: a defect here can produce an unsafe physical drone command. It had no prior security ADR.

Trust-boundary map

Untrusted input enters via SwarmOrchestrator::receive_peer_state / receive_peer_detection, which accept full DroneState / CsiDetection serde structs with f64/f32 fields and no finite-check, and via SwarmConfig/FhssConfig/Geofence deserialization. The MAVLink wire formats in mavlink_messages.rs are integer-encoded (i32 mm / u8) and provably cannot carry NaN — so the NaN class is reachable through the serde struct path, not the MAVLink decode path. Commands flow out to a FlightController (PX4/ArduPilot).

The unifying bug class found: IEEE-754 NaN/Inf silently defeating a safety comparison (NaN < threshold evaluates to false), causing safety logic to fail OPEN. This is distinct from — but rhymes with — the NaN-state-poisoning class found earlier in calibration/vitals/geo (there, NaN latched into persistent state; here, NaN slips through a one-shot guard). Both are "non-finite input defeats logic," and the fix discipline is the same: reject non-finite at the trust boundary, fail CLOSED.

Decision

Fix the four reachable fail-open bugs by making each safety predicate non-finite-aware and fail-closed, each pinned by a fails-on-old test. Document two further genuine issues that need larger, riskier changes rather than churning them in a security pass.

Findings fixed (all MEASURED fails-on-old)

# Severity File:line Issue Fix Pin (old behavior)
F1a HIGH failsafe/mod.rs:51 nearest_neighbor_dist < collision_dist_m fails open on a NaN peer position → collision avoidance silently disabled `!is_finite()
F1b HIGH failsafe/mod.rs:75 NaN battery_pct bypasses every battery check → drone stays Nominal on unknown battery `!is_finite()
F2 MEDIUM security/geofence.rs:33 NaN z altitude skips the altitude-breach check and point-in-polygon returns Safe → silent geofence bypass leading non-finite coord → HardBreach test_nan_altitude_fails_closed (old → Safe)
F3 MEDIUM/DoS security/antijamming.rs:65,71,102 empty deserialized channels_mhz% 0 panic in next_hop/current_channel_mhz/evasive_hop/tick, crashing the radio task len == 0 early-return (0.0 sentinel) test_empty_channels_does_not_panic (old → panic divisor of zero)
F4 LOW sensing/multiview.rs:70 NaN victim_position passes the is_some() filter and propagates into the fused "confirmed victim" location dispatched to the swarm require finite confidence + position (drop) test_nan_victim_position_dropped_from_fusion (old → non-finite fused position)

Dimensions confirmed clean (with evidence)

  • MAVLink decode panic-safetySwarmNodeState::decode(&[u8;20]) try_into().unwrap()s are over fixed const ranges of a fixed-size array → provably infallible; no arbitrary-length &[u8] decode path exists.
  • UWB/GPS anti-spoofing NaN-safe(gps_dist - uwb_dist).abs() <= tol already fails CLOSED on a NaN range (counts as inconsistent → spoof rejected); covered by test_spoofed_gps_invalid.
  • Bounded grid / no allocate-from-length-fieldProbabilityGrid bounds-checks cx/cy; pos_to_cell uses saturating as u32 (no UB).
  • Mesh nearest_k NaN-safe sortpartial_cmp(..).unwrap_or(Equal) cannot panic on NaN.
  • No hardcoded secretsMavlinkSigner key is constructor-injected [u8;32]; grep-confirmed nothing embedded.

Documented, not fixed (genuine — deferred to avoid churn/regression risk)

  1. Raft AppendEntries lacks the Log-Matching consistency check (topology/raft.rs:187). A follower appends a leader's entries when term >= current_term without validating prev_log_index/prev_log_term, so a malformed/byzantine leader can corrupt a follower's log — a genuine consensus-safety gap. A correct fix reworks the log-append plus the caller-side vote-tally contract (the existing handle_message delegates tallying to the caller) — a larger change with test-rewrite risk, so it is recorded here rather than rushed in a security pass.
  2. MavlinkSigner::verify uses a non-constant-time tag == and has no replay/timestamp-window rejection (security/mavlink_signing.rs:64). The module doc already flags the replay limitation as a demo/test simplification. Hardening (constant-time compare + monotonic timestamp window) is a focused follow-up.

These two are the recommended scope of the next ruview-swarm hardening pass.

Validation

  • cargo test -p ruview-swarm --no-default-features117 → 123 passed, 0 failed (+6 pins).
  • All 6 new tests MEASURED fails-on-old (2× Nominal, Safe, panic divisor of zero, non-finite fused position); pass on the fix.
  • cargo test --workspace --no-default-featuresexit 0, 0 failed.
  • python archive/v1/data/proof/verify.pyVERDICT: PASS, hash f8e76f21…46f7a unchanged (ruview-swarm off the signal proof path).

Consequences

Positive

  • Four reachable fail-open paths in a physical-safety control plane (collision avoidance, battery RTH, geofence, anti-jamming radio task) now fail CLOSED on hostile/degenerate input, each regression-pinned.
  • Extends the "non-finite input defeats logic" defense from the state-poisoning variant (calibration/vitals/geo) to the fail-open-comparison variant.

Negative / Neutral

  • Two genuine issues (Raft log-matching, MAVLink signer) remain open by choice — see Documented-not-fixed; they define the next hardening pass.
  • ADR-148 — ruview-swarm drone swarm control system
  • ADR-172 — core/cli review (where the NaN bug-class root question was settled NO)
  • ADR-127 — homecore review (sibling NaN/concurrency hardening)