diff --git a/CHANGELOG.md b/CHANGELOG.md index 69112085..6a2ad97e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Fixed +- **Multistatic fusion guard interval is now operator-configurable — fixes permanent trust demotion with WiFi-synced ESP32 nodes (#1049).** Two independently-clocked ESP32-S3 boards on ESP-NOW sync drift 10–150 ms (typ. ~70 ms) — the 100 ms beacon + WiFi-MAC jitter cannot hold them within the published 60 ms default guard, so the governed-trust cycle permanently demoted to `Restricted`, suppressed all pose output, and spun the error counter to 200k+ with **no escape hatch but a container restart**. Added a **direct `WDP_GUARD_INTERVAL_US` override** (+ optional `WDP_SOFT_GUARD_US`) to `multistatic_guard_config_from_env`, so a deployment can lift the hard guard past its measured spread (e.g. `WDP_GUARD_INTERVAL_US=200000`) without having to know its exact TDM schedule. Precedence is most-specific-wins: a direct override beats the existing `WDP_TDM_SLOTS`+`WDP_TDM_SLOT_US` schedule-derived guard, which beats the 60 ms/20 ms default; the override is applied on top of whichever base is selected, the soft band is always clamped strictly below the hard guard, and a malformed/zero value is ignored (falls back to the base rather than breaking fusion). The effective guard is now logged at startup. Pinned by 6 new tests (`multistatic_guard_config_tests`): direct-override-wins / beats-TDM-derived / soft-clamped-below-hard / lowering-hard-pulls-soft-down / malformed-or-zero-falls-back / default-when-unset. `wifi-densepose-sensing-server` bin tests **449 → 455**, 0 failed; Python proof VERDICT PASS, hash unchanged (off the signal proof path). + ### Security - **`wifi-densepose-occworld-candle` — beyond-SOTA security + correctness review (Milestone #9, crate 4/4).** (1) **HIGH (MEASURED) — checkpoint-load crash on any int32 tensor** (`model.rs::safetensor_dtype_to_candle`). `safetensors::Dtype::I32` was mapped to `candle_core::DType::I64` and the raw int32 byte buffer (4 bytes/elem) was then handed to `Tensor::from_raw_buffer(.., I64, shape, ..)`. Candle derives `elem_count = data.len() / dtype.size_in_bytes()`, so the I64 path halved the element count while keeping the *original* shape — yielding a tensor whose declared shape claims twice as many elements as its backing storage holds. Reading it **panics** (`range end index 6 out of range for slice of length 3` — slice OOB inside candle-core) on any attacker-supplied or PyTorch-exported checkpoint containing an int32 tensor (common: index/buffer tensors). Fixed by mapping `I32 → DType::I32` (and `I16 → DType::I16`), both first-class candle dtypes. Reproduction recorded on old code; pinned by `tests/checkpoint_loading.rs::int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new) plus F32/I64/corrupt-file control cases. (2) **LOW (MEASURED) — `predict()` lacked frame/batch validation at the input boundary** (`inference.rs`). It validated H/W/D but not the externally-supplied frame count; an `f_in > num_frames*2` over-indexed the temporal positional embedding deep in the transformer and surfaced as a cryptic candle "gather" `InvalidIndex` (returned error, not a panic — candle bounds-checks), and a zero frame/batch dim fed a zero-element tensor into the pipeline. Now rejected at the boundary with a clear `ShapeMismatch`. Pinned by `predict_rejects_zero_frames` / `predict_rejects_too_many_frames` / `predict_accepts_frame_count_at_capacity`. (3) **LOW (MEASURED) — divide-by-zero panic on a degenerate input to the public `VQCodebook::encode`** (`vqvae.rs`): a rank-0 / empty-last-dim tensor made `last == 0` and panicked on `elem_count() / last`. Now fails closed with a clear error. Pinned by `encode_rejects_scalar_without_panicking`. **Dimensions confirmed CLEAN with evidence:** panic surface — zero `unwrap()`/`expect()`/`panic!`/`unreachable!` in production code paths (grep evidence; all error handling via `?`/`map_err`); NaN-state-poisoning — N/A (engine is stateless between `predict` calls, input is `u8` class indices so non-finite input is structurally impossible, no persistent world-model buffer to latch into); unbounded-alloc / shape-data mismatch from malformed weights — defended upstream by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared byte range, rejected before reaching candle); secrets — none (grep clean, only `token_h`/`token_w` config fields match). `unsafe_code = forbid` in the crate manifest. **Build/validation status (MEASURED on Windows):** crate builds and tests under `cargo test -p wifi-densepose-occworld-candle --no-default-features` — **29/29 pass** (20 unit + 4 checkpoint_loading + 3 predict_honesty + 2 doc) after fixes; `cargo test --workspace --no-default-features` = 0 failed across all crates (lone `wifi-densepose-desktop` `api_integration` failure was a Windows "Access is denied (os error 5)" file-lock flake — re-ran in isolation **21/21 pass**); Python proof VERDICT PASS, hash `f8e76f21…446f7a` unchanged. *Warrants ADR slot 179 (parent to author).* - **`wifi-densepose-wasm-edge` beyond-SOTA closing review — boundary NaN-state-poisoning guard + clean-with-evidence attestation (ADR-040 edge crate, ~70 modules).** Closing pass of the security campaign over the last untouched sizeable crate. **One real finding fixed (LOW / source-analysis + reproduced):** the two WASM↔host frame boundaries (`lib.rs::on_frame`/`on_timer` and `bin/ghost_hunter.rs::on_frame`) read raw IEEE-754 `f32` from the `csi_get_phase`/`csi_get_amplitude`/`csi_get_variance`/`csi_get_motion_energy` host imports **without any finiteness check** — the entire crate had **zero** `is_finite`/`is_nan` guards, and the in-crate `clamp` helpers propagate NaN (`NaN < lo` and `NaN > hi` are both false). A single non-finite value (firmware DSP bug, uninitialised buffer, or hostile host) latches NaN into the long-lived per-module accumulators (EMA, Welford, phasor sums, anomaly baselines); once latched, every downstream comparison evaluates `false`, so detectors fail **degraded** (stuck gate state, silently-disabled anomaly checks) — silent corruption, not a crash (WASM `panic=abort` is *not* tripped: no indexing/`unwrap` on the poisoned value). Threat model is a **semi-trusted** boundary (the Tier-2 DSP firmware supplies the imports, not direct network/JS), hence LOW severity / defense-in-depth. **Fix:** added `sanitize_host_f32()` (maps non-finite→`0.0`, `core`-only so it holds in `no_std`) applied at every `host_get_*` float read — a single chokepoint covering all ~70 downstream modules, mirroring the existing M-01 negative-`n_subcarriers` boundary clamp. **Pinned by** `boundary_tests::{sanitize_passes_finite_values_through, sanitize_maps_non_finite_to_zero, coherence_monitor_nan_latches_without_sanitize_but_not_with}` — the last asserts on the *current* `CoherenceMonitor` that a raw NaN frame latches the smoothed score (documents the hazard) while the boundary-sanitized path stays finite. **Dimensions attested CLEAN with evidence (source-analysis):** (a) **panic-on-input** — every non-test `unwrap()`/`expect()` is either `#[cfg(test)]` or in the `std`-gated RVF *builder* host tool writing to an in-memory `Vec` (infallible); no `panic!`/`unreachable!`/`todo!`/`get_unchecked` in any hot path. (b) **shape/bounds** — all frame-buffer access is `min()`-clamped (`MAX_SC=32`, `DTW_MAX_LEN`, `LCS_WINDOW`, `PATTERN_LEN`), all index-by-cast sites (`feature_id as usize`, `conclusion_id`, `minute_counter`, `plan_step`) are either compile-time-const-bounded or `if idx <`/`%`-guarded; negative `n_subcarriers` already mapped to 0 (M-01). (c) **memory/leak** — no `move ||` closures, no `mem::forget`/`Box::leak`/`.leak()`; the only `Box::new` is in the `std`-gated `skill_registry` (one-time init, bounded). (d) **secrets** — none (grep clean). **MEASURED build/test evidence:** host `cargo test --features std,medical-experimental` = **672 passed / 0 failed** (was 669 pre-fix; +3 new tests); the real deployment artifacts all build clean on the actual target — `cargo build --target wasm32-unknown-unknown --release` (no_std/panic=abort default lib), `--bin ghost_hunter --no-default-features --features standalone-bin`, and `--features medical-experimental` (toolchain 1.89 per `rust-toolchain.toml`). No ADR slot needed — a single LOW defense-in-depth boundary fix; CHANGELOG attestation suffices. diff --git a/v2/crates/wifi-densepose-sensing-server/src/main.rs b/v2/crates/wifi-densepose-sensing-server/src/main.rs index de16a72c..466e49ab 100644 --- a/v2/crates/wifi-densepose-sensing-server/src/main.rs +++ b/v2/crates/wifi-densepose-sensing-server/src/main.rs @@ -6391,32 +6391,71 @@ fn vitals_snapshots_from_sensing_json( } } -/// Build the multistatic guard config, optionally derived from the TDM schedule -/// declared in the environment (#1031). +/// Build the multistatic guard config from the environment (#1031, #1049). /// -/// When both `WDP_TDM_SLOTS` and `WDP_TDM_SLOT_US` parse as positive integers, -/// the guard is derived via [`MultistaticConfig::for_tdm_schedule`] so a -/// deployment can match its exact schedule. Otherwise the published default -/// (60 ms hard / 20 ms soft) is returned. `min_nodes` is *not* set here — the -/// caller overrides it for single-node passthrough. +/// Three precedence layers, most-specific wins: +/// 1. `WDP_GUARD_INTERVAL_US` (+ optional `WDP_SOFT_GUARD_US`) — a **direct** +/// hard-guard override. This is the #1049 escape hatch: WiFi/ESP-NOW-synced +/// ESP32 nodes drift 10–150 ms (the 100 ms beacon + WiFi-MAC jitter cannot +/// hold two independently-clocked boards within the published default), so a +/// deployment can simply lift the guard past its measured spread (e.g. +/// `WDP_GUARD_INTERVAL_US=200000`) without knowing its exact TDM schedule. +/// 2. `WDP_TDM_SLOTS` + `WDP_TDM_SLOT_US` (both positive) — derive the guard +/// from the declared schedule via [`MultistaticConfig::for_tdm_schedule`]. +/// 3. Otherwise the published default (60 ms hard / 20 ms soft). +/// +/// The direct override (1) is applied **on top of** whichever base (2 or 3) is +/// selected, so `WDP_GUARD_INTERVAL_US` always wins for the hard guard while a +/// TDM-derived soft band is preserved unless it would exceed the new hard guard. +/// `min_nodes` is *not* set here — the caller overrides it for single-node +/// passthrough. fn multistatic_guard_config_from_env() -> MultistaticConfig { multistatic_guard_config_from( std::env::var("WDP_TDM_SLOTS").ok().as_deref(), std::env::var("WDP_TDM_SLOT_US").ok().as_deref(), + std::env::var("WDP_GUARD_INTERVAL_US").ok().as_deref(), + std::env::var("WDP_SOFT_GUARD_US").ok().as_deref(), ) } /// Pure core of [`multistatic_guard_config_from_env`] for testability. -fn multistatic_guard_config_from(slots: Option<&str>, slot_us: Option<&str>) -> MultistaticConfig { - match ( +fn multistatic_guard_config_from( + slots: Option<&str>, + slot_us: Option<&str>, + guard_us: Option<&str>, + soft_us: Option<&str>, +) -> MultistaticConfig { + // Base: TDM-schedule-derived when both slot params are valid, else default. + let mut cfg = match ( slots.and_then(|s| s.trim().parse::().ok()), slot_us.and_then(|s| s.trim().parse::().ok()), ) { - (Some(n), Some(us)) if n >= 1 && us >= 1 => { - MultistaticConfig::for_tdm_schedule(n, us) - } + (Some(n), Some(us)) if n >= 1 && us >= 1 => MultistaticConfig::for_tdm_schedule(n, us), _ => MultistaticConfig::default(), + }; + + // Direct hard-guard override (#1049). Ignored when unset/zero/unparseable so + // a malformed env var falls back to the base rather than breaking fusion. + if let Some(g) = guard_us + .and_then(|s| s.trim().parse::().ok()) + .filter(|&g| g >= 1) + { + cfg.guard_interval_us = g; + // Keep the soft band strictly below the (possibly lowered) hard guard. + if cfg.soft_guard_us >= g { + cfg.soft_guard_us = g.saturating_sub(1).max(1); + } } + + // Optional explicit soft-guard override, always clamped strictly below hard. + if let Some(s) = soft_us + .and_then(|s| s.trim().parse::().ok()) + .filter(|&s| s >= 1) + { + cfg.soft_guard_us = s.min(cfg.guard_interval_us.saturating_sub(1).max(1)); + } + + cfg } /// Turn a `ProgressiveLoader::new` failure into an actionable diagnostic (#894). @@ -7485,11 +7524,16 @@ async fn main() { pose_tracker: PoseTracker::new(), last_tracker_instant: None, multistatic_fuser: { - // #1031: the default guard (60 ms hard / 20 ms soft) accommodates a - // real TDM slot offset. A deployment can override it to match its - // own schedule via WDP_TDM_SLOTS + WDP_TDM_SLOT_US (both set ⇒ derive - // from the schedule), else the published default is used. + // #1031/#1049: the default guard (60 ms hard / 20 ms soft) + // accommodates a real TDM slot offset. A deployment overrides it via + // WDP_GUARD_INTERVAL_US (direct, e.g. 200000 for WiFi/ESP-NOW sync — + // #1049) or WDP_TDM_SLOTS + WDP_TDM_SLOT_US (derive from schedule). let cfg = multistatic_guard_config_from_env(); + info!( + "Multistatic fusion guard: {} µs hard / {} µs soft (override via \ + WDP_GUARD_INTERVAL_US / WDP_SOFT_GUARD_US, or WDP_TDM_SLOTS+WDP_TDM_SLOT_US)", + cfg.guard_interval_us, cfg.soft_guard_us + ); let mut fuser = MultistaticFuser::with_config(MultistaticConfig { min_nodes: 1, // single-node passthrough ..cfg @@ -7797,6 +7841,72 @@ async fn main() { info!("Server shut down cleanly"); } +#[cfg(test)] +mod multistatic_guard_config_tests { + //! #1049 — the multistatic guard interval must be operator-configurable so a + //! WiFi/ESP-NOW deployment (10–150 ms inter-node clock drift) can lift the + //! guard past its measured timestamp spread instead of being permanently + //! demoted to Restricted with no escape hatch. + use super::*; + + #[test] + fn default_guard_when_nothing_set() { + let cfg = multistatic_guard_config_from(None, None, None, None); + assert_eq!(cfg.guard_interval_us, MultistaticConfig::default().guard_interval_us); + assert_eq!(cfg.soft_guard_us, MultistaticConfig::default().soft_guard_us); + } + + #[test] + fn direct_guard_override_wins_and_unblocks_wifi_spread() { + // The #1049 reporter's measured ~70 ms spread exceeds the 60 ms default + // → permanent demotion. A direct 200 ms override accepts it. + let cfg = multistatic_guard_config_from(None, None, Some("200000"), None); + assert_eq!(cfg.guard_interval_us, 200_000); + assert!(cfg.soft_guard_us < cfg.guard_interval_us); + // 70 ms spread now sits inside the guard. + assert!(70_000 < cfg.guard_interval_us); + } + + #[test] + fn direct_guard_override_beats_tdm_derived() { + // Both TDM params AND a direct override set → the direct hard guard wins, + // the TDM-derived soft band is preserved (still strictly below hard). + let cfg = multistatic_guard_config_from(Some("2"), Some("18000"), Some("200000"), None); + assert_eq!(cfg.guard_interval_us, 200_000); + assert!(cfg.soft_guard_us < cfg.guard_interval_us); + assert!(cfg.soft_guard_us >= 1); + } + + #[test] + fn soft_override_is_clamped_strictly_below_hard() { + // A soft guard ≥ hard would be nonsensical → clamped below the hard guard. + let cfg = multistatic_guard_config_from(None, None, Some("50000"), Some("999999")); + assert_eq!(cfg.guard_interval_us, 50_000); + assert!(cfg.soft_guard_us < 50_000); + } + + #[test] + fn lowering_hard_below_default_soft_pulls_soft_down() { + // Override hard to 10 ms (< default 20 ms soft) → soft drops below it. + let cfg = multistatic_guard_config_from(None, None, Some("10000"), None); + assert_eq!(cfg.guard_interval_us, 10_000); + assert!(cfg.soft_guard_us < 10_000); + } + + #[test] + fn malformed_or_zero_override_falls_back_to_base() { + // Garbage / zero must not break fusion — fall back to the base config. + for bad in ["", "abc", "0", "-5", "12.5"] { + let cfg = multistatic_guard_config_from(None, None, Some(bad), None); + assert_eq!( + cfg.guard_interval_us, + MultistaticConfig::default().guard_interval_us, + "override {bad:?} should be ignored" + ); + } + } +} + #[cfg(test)] mod node_sync_snapshot_serialization_tests { //! ADR-110 iter 24 — JSON public-API contract for the iter 23