diff --git a/CHANGELOG.md b/CHANGELOG.md index ce00cb2d..913dcbbe 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Security +- **`wifi-densepose-occworld-candle` — beyond-SOTA security + correctness review (Milestone #9, crate 4/4).** (1) **HIGH (MEASURED) — checkpoint-load crash on any int32 tensor** (`model.rs::safetensor_dtype_to_candle`). `safetensors::Dtype::I32` was mapped to `candle_core::DType::I64` and the raw int32 byte buffer (4 bytes/elem) was then handed to `Tensor::from_raw_buffer(.., I64, shape, ..)`. Candle derives `elem_count = data.len() / dtype.size_in_bytes()`, so the I64 path halved the element count while keeping the *original* shape — yielding a tensor whose declared shape claims twice as many elements as its backing storage holds. Reading it **panics** (`range end index 6 out of range for slice of length 3` — slice OOB inside candle-core) on any attacker-supplied or PyTorch-exported checkpoint containing an int32 tensor (common: index/buffer tensors). Fixed by mapping `I32 → DType::I32` (and `I16 → DType::I16`), both first-class candle dtypes. Reproduction recorded on old code; pinned by `tests/checkpoint_loading.rs::int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new) plus F32/I64/corrupt-file control cases. (2) **LOW (MEASURED) — `predict()` lacked frame/batch validation at the input boundary** (`inference.rs`). It validated H/W/D but not the externally-supplied frame count; an `f_in > num_frames*2` over-indexed the temporal positional embedding deep in the transformer and surfaced as a cryptic candle "gather" `InvalidIndex` (returned error, not a panic — candle bounds-checks), and a zero frame/batch dim fed a zero-element tensor into the pipeline. Now rejected at the boundary with a clear `ShapeMismatch`. Pinned by `predict_rejects_zero_frames` / `predict_rejects_too_many_frames` / `predict_accepts_frame_count_at_capacity`. (3) **LOW (MEASURED) — divide-by-zero panic on a degenerate input to the public `VQCodebook::encode`** (`vqvae.rs`): a rank-0 / empty-last-dim tensor made `last == 0` and panicked on `elem_count() / last`. Now fails closed with a clear error. Pinned by `encode_rejects_scalar_without_panicking`. **Dimensions confirmed CLEAN with evidence:** panic surface — zero `unwrap()`/`expect()`/`panic!`/`unreachable!` in production code paths (grep evidence; all error handling via `?`/`map_err`); NaN-state-poisoning — N/A (engine is stateless between `predict` calls, input is `u8` class indices so non-finite input is structurally impossible, no persistent world-model buffer to latch into); unbounded-alloc / shape-data mismatch from malformed weights — defended upstream by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared byte range, rejected before reaching candle); secrets — none (grep clean, only `token_h`/`token_w` config fields match). `unsafe_code = forbid` in the crate manifest. **Build/validation status (MEASURED on Windows):** crate builds and tests under `cargo test -p wifi-densepose-occworld-candle --no-default-features` — **29/29 pass** (20 unit + 4 checkpoint_loading + 3 predict_honesty + 2 doc) after fixes; `cargo test --workspace --no-default-features` = 0 failed across all crates (lone `wifi-densepose-desktop` `api_integration` failure was a Windows "Access is denied (os error 5)" file-lock flake — re-ran in isolation **21/21 pass**); Python proof VERDICT PASS, hash `f8e76f21…446f7a` unchanged. *Warrants ADR slot 179 (parent to author).* - **ADR-131 HOMECORE-UI BFF gateway — public-PR review fixes (PR #1082).** (1) **HIGH — path-traversal / confused-deputy SSRF closed in the `/api/cal/*` reverse-proxy** (`homecore-server/src/gateway.rs`). The wildcard proxy path was interpolated straight into the upstream URL while `proxy()` attaches the server-side calibration bearer, so `/api/cal/v1/../../x` (and percent-encoded `..%2f`, `%2e%2e`, leading `/`, backslash, double-encoded `%252e`) could escape the `…/api/` scope **with the privileged token**. Now `validate_proxy_path()` decode-then-checks and rejects absolute/backslash/dot-segment/encoded-traversal paths with a typed **400 BEFORE the URL is built** (applies to GET **and** POST); legit `v1/...` paths still pass. Pinned by `cal_proxy_rejects_traversal_with_400_before_upstream` (fails on old code) + `validate_proxy_path_rejects_traversal_variants`. (2) **CORS + request-tracing now cover the gateway routes.** `/api/homecore/*` and `/api/cal/*` were `.merge()`d **outside** the layers `homecore-api::router()` applies, leaving them with no CORS allowlist and untraced; the audited `build_cors_layer()` (HC-05) + `TraceLayer` are now applied to the whole merged surface in `main.rs`. Pinned by `gateway_routes_are_cors_covered_after_merge` (Vite-dev-origin preflight succeeds on a gateway route). (3) **Fabricated-data honesty (§6 invariant 3):** the gateway no longer injects a hardcoded `anomaly.threshold: 0.5` — it passes through the REAL upstream threshold or emits `null` (withheld); the dashboard renders a not-available `—` instead of `"null%"`/`"null°C"` for null appliance metrics; the COG panel's Hailo-worker pill reflects the real appliance probe instead of a hardcoded `"connected"`; `rooms.js` treats a null anomaly threshold as withheld, not a fake `0.8` default. (4) **Robustness:** a forwarded `hef` that is a string (not an array) no longer throws in the COG panel; the calibration wizard guards `frames/target` against `NaN%`/`Infinity%` and clears its baseline poll timer on Restart / panel teardown (leaked `setTimeout` loop fixed). (5) **Perf:** per-bank RoomState fetches and the appliance service probes now run concurrently (`futures::join_all`; async `tokio::net::TcpStream` + `timeout` replaces the blocking `connect_timeout` that parked a worker per probe); the mock fixture module is now a dynamic `import()` gated on demo mode so production never bundles it. **Note (workspace-wide, not fixed here):** `homecore-server` requests `reqwest`'s `rustls-tls` only, but cargo feature-unification means a sibling crate enabling the default `native-tls` re-introduces OpenSSL into the final binary regardless — a true "no OpenSSL on the appliance" guarantee requires aligning every reqwest-pulling crate on rustls-only. **Note (pre-existing, out of scope):** DEV-mode `allow_any_non_empty()` bearer auth when `HOMECORE_TOKENS` is unset on `0.0.0.0` is unchanged; the loud `warn!` at boot is retained — provision real tokens before network exposure. **Verified:** `cargo test -p homecore-server --no-default-features` = **18/18 pass**, `cargo build -p homecore-server` clean, UI suite (`node tests`) all green, Python proof VERDICT PASS (hash unchanged). - **`wifi-densepose-desktop` (Tauri v2 desktop app) beyond-SOTA security review (needs ADR slot 178) — one real IPC serial-command-injection fix + one over-broad shell-capability removal, each MEASURED on Windows; remaining IPC/path/secret dimensions confirmed clean with evidence.** Beyond-SOTA review of the Tauri desktop crate (the real attack surface is the webview→Rust IPC boundary + the capability allowlist). The crate **builds + tests on this Windows box** (`cargo check`/`cargo test -p wifi-densepose-desktop --no-default-features` — Tauri 2.10 + GTK-less Windows webview2 path), so both findings are **MEASURED**, not source-analysis. **WDP-DESK-01 (serial command injection via `configure_esp32_wifi`, MODERATE) — FIXED.** The `#[tauri::command] configure_esp32_wifi(port, ssid, password)` handler took `ssid`/`password` straight from the webview and concatenated them into newline-terminated serial commands (`format!("wifi_config {} {}\r\n", ssid, password)`, `set ssid {}\r\n`, …) with **zero validation** before writing them to the ESP32 over the line-oriented serial protocol. A `\r\n` embedded in either field lets a malicious/compromised webview **terminate the command line early and inject an arbitrary follow-up firmware command** (`reboot`, `erase_nvs`, etc.) — a command-injection-into-device-protocol crossing the IPC trust boundary. Ironic note: the crate already shipped `test_wifi_credentials_validation` documenting the WPA2 length bounds, but the handler never enforced them. **Fix:** a new `validate_wifi_credentials(ssid, password)` rejects out-of-range lengths (SSID 1-32, password 8-63 — WPA2 PSK bounds) **and any control character** (`char::is_control()` catches `\r`/`\n`/NUL), called at the top of the handler before any serial write — fail-closed (`Err` → no bytes sent). Pinned fails-on-old / passes-on-new by `test_validate_wifi_credentials_rejects_injection` (`"net\r\nreboot"`, `"net\ninjected"`, `"pass\r\nerase_nvs"`, embedded NUL — all rejected; would splice into the command stream pre-fix), `test_validate_wifi_credentials_rejects_out_of_range`, and `test_validate_wifi_credentials_accepts_valid` (boundary 32-char SSID / 8- and 63-char passwords still accepted). **WDP-DESK-02 (over-broad shell capability, MODERATE) — REMOVED.** `capabilities/default.json` granted the webview `shell:allow-execute` + `shell:allow-open`, but the Rust backend spawns every process via `std::process::Command` directly (espflash/which/sensing-server — which **bypasses** the Tauri allowlist entirely) and the React UI only ever calls `dialog.open` (file picker) — verified by grep: `tauri_plugin_shell` is `init()`-ed but its `Command`/`open` API is **never invoked from Rust or TS**. The two `shell:` permissions were therefore unused privilege: a webview compromise (e.g. XSS in a UI dep) would have gained **arbitrary host command execution** via `shell.execute` with no scope restriction (no `shell` scope object was even defined). **Fix:** removed both `shell:` permissions from the capability (kept `core:default` + the two `dialog:` perms the UI actually uses). MEASURED: the build-regenerated `gen/schemas/capabilities.json` now reads `"permissions":["core:default","dialog:allow-open","dialog:allow-save"]` (shell perms gone), and the crate still builds + all tests pass — confirming nothing depended on the granted shell scope. (Plugin `init()` + the npm dep left in place to keep the blast radius minimal and avoid touching the off-limits generated ACL manifests; with no permission granted the plugin is inert.) **Dimensions confirmed clean (with evidence):** (1) **No directory-traversal / arbitrary-file primitive crossing the boundary** — the path-taking commands (`flash_firmware`/`verify_firmware`/`ota_update`/`wasm_upload`/`provision_*`) pass the webview-supplied path to `std::fs`/`espflash` to **read a firmware/wasm blob the local user themselves selected via the `dialog.open` native picker**; there is no command that *writes* to or *reads back* an arbitrary attacker-named path to the webview — `settings` read/write is confined to `app_data_dir().join("settings.json")` (fixed filename, no user path component), so no traversal sink exists. (2) **No shell-string interpolation** — every subprocess uses `Command::new(prog).args([...])` (argv vector, no shell), so the `port`/`source`/`chip`/`baud` args cannot inject a second command even though they are unvalidated (the `source` value flows only as a single `--source ` argv element). (3) **No SSRF-to-secret** — the `node_ip`-built URLs (`http://{ip}:8032/...`) target the local ESP32 mesh and return only device status; no credential is returned to the webview. (4) **Panic-on-input** — handlers use `.map_err(|e| e.to_string())?` throughout; the one `srv.pid.expect(...)` in `server_status` is guarded by an explicit `is_none()` early-return on the line above (unreachable), and the discovery/provision deserializers bounds-check before every slice index (`pos + len > data.len()` guards, NVS size capped at 4096). (5) **No hardcoded secrets** — `ota_psk` is an `Option` supplied per-call/from settings, never embedded; grep for embedded keys/tokens over `src/` is empty. (6) **Tauri config** — `tauri.conf.json` ships no `"all": true` / `"$HOME/**"` FS or HTTP scope (no `fs`/`http` plugin enabled at all); the window set is a single fixed main window. `cargo test -p wifi-densepose-desktop --no-default-features`: lib **18 → 21 passed** (+3 validator pins), integration **21 → 21**, 0 failed. Workspace otherwise unchanged; Python deterministic proof unchanged (`f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a`, bit-exact — the desktop crate is off the signal proof path). Both findings warrant **ADR slot 178**. - **`ruview-swarm` beyond-SOTA security + correctness review (ADR-148 drone swarm control plane; needs ADR slot 176) — 4 real fail-open / DoS bugs fixed in the NaN-state-poisoning class, each pinned fails-on-old / passes-on-new; 5 dimensions confirmed clean with evidence.** The shared theme is **IEEE-754 NaN/Inf silently defeating a safety comparison** on data that crosses the untrusted swarm-comm trust boundary (`SwarmOrchestrator::receive_peer_state` / `receive_peer_detection` accept full `DroneState`/`CsiDetection` whose f64/f32 fields deserialize with no finite-check; the integer-encoded MAVLink wire formats in `mavlink_messages.rs` cannot carry NaN, but the serde struct path can). **(1) HIGH — `failsafe::FailSafeMachine::tick` collision-avoidance + battery fail-open** (`failsafe/mod.rs:51,75`). `nearest_neighbor_dist < collision_dist_m` and `battery_pct <= rth_pct` both evaluate `false` for a NaN operand, so a poisoned peer position (→ NaN `nearest_peer_distance` via `Position3D::distance_to`) **silently disabled collision avoidance** and a NaN battery reading kept a drone Nominal — the worst failure for a physical airframe. Fixed to fail CLOSED (`!is_finite() ||` → `EmergencyDiverge` / `ReturnToHome`). MEASURED fails-on-old: `test_nan_neighbor_distance_fails_closed_to_diverge` / `test_nan_battery_fails_closed_to_rth` both returned `Nominal` pre-fix. **(2) MEDIUM — `security::geofence::Geofence::check` NaN-altitude bypass** (`security/geofence.rs:33`). A NaN `z` (altitude) with valid x/y skipped the altitude breach (`NaN < min || NaN > max` = `false`) and returned **`Safe`** through the point-in-polygon path — a silent geofence bypass. Fixed with a leading non-finite-coordinate → `HardBreach` guard. MEASURED fails-on-old: `test_nan_altitude_fails_closed` returned `Safe` pre-fix. **(3) MEDIUM/DoS — `security::antijamming::FhssRadio` `% 0` panic on empty `channels_mhz`** (`security/antijamming.rs:65,71,102`). `FhssConfig` is `Deserialize`; an empty channel list (malformed/hostile config) made `next_hop`/`current_channel_mhz`/`evasive_hop`/`tick` panic with `remainder with a divisor of zero`, crashing the radio task. Fixed with `len == 0` early-returns (benign `0.0` sentinel). MEASURED fails-on-old: `test_empty_channels_does_not_panic` **panicked** (`divisor of zero`) pre-fix. **(4) LOW — `sensing::multiview::MultiViewFusion::fuse` NaN victim-position propagation** (`sensing/multiview.rs:70`). A NaN `victim_position` passed the `is_some()` filter and propagated through the confidence-weighted average into the fused "confirmed victim" location dispatched to the swarm. Fixed by requiring finite `confidence` + finite position components (fail-closed drop). MEASURED fails-on-old: `test_nan_victim_position_dropped_from_fusion` produced a non-finite fused position pre-fix. **Dimensions confirmed clean (with evidence):** (a) **MAVLink decode panic-safety** — `SwarmNodeState::decode(&[u8;20])` `try_into().unwrap()`s are over fixed const ranges of a fixed-size array (provably infallible; no arbitrary-length `&[u8]` path exists). (b) **UWB GPS anti-spoofing is NaN-safe** — `(gps_dist - uwb_dist).abs() <= tol` already fails CLOSED on a NaN range/position (counts as inconsistent → spoof rejected), verified by reasoning + existing `test_spoofed_gps_invalid`. (c) **Bounded grid / no allocation-from-length-field** — `ProbabilityGrid::update_bayesian`/`mark_scanned` bounds-check `cx >= width || cy >= height`; `pos_to_cell` uses saturating `as u32` (Rust `as` saturates, no UB). (d) **Mesh `nearest_k` NaN-safe sort** — `partial_cmp(..).unwrap_or(Equal)` cannot panic on NaN distances. (e) **No hardcoded secrets** — `MavlinkSigner` key is constructor-injected (`[u8;32]`), nothing embedded. **Documented-not-fixed (for ADR-176, not churned to avoid test-rewrite risk):** (i) **Raft `AppendEntries` lacks the Log-Matching consistency check** (`topology/raft.rs:187`) — a follower appends leader entries on `term >= current_term` without validating `prev_log_index`/`prev_log_term`, so a malformed/byzantine leader can corrupt a follower's log (a genuine consensus-safety gap; vote tallying is also delegated to the caller per the existing `handle_message` comment). (ii) **`MavlinkSigner::verify` uses a non-constant-time tag `==` and has no replay/timestamp-window rejection** (`security/mavlink_signing.rs:64`) — the doc comment already flags the replay limitation as a known demo/test simplification. `cargo test -p ruview-swarm --no-default-features`: **117 → 123 passed, 0 failed** (+6 pins). Workspace green; Python deterministic proof unchanged (`f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a`, bit-exact — `ruview-swarm` is off the signal proof path). diff --git a/docs/adr/ADR-179-occworld-candle-checkpoint-load-hardening.md b/docs/adr/ADR-179-occworld-candle-checkpoint-load-hardening.md new file mode 100644 index 00000000..cc2924c6 --- /dev/null +++ b/docs/adr/ADR-179-occworld-candle-checkpoint-load-hardening.md @@ -0,0 +1,81 @@ +# ADR-179: `wifi-densepose-occworld-candle` Checkpoint-Load Hardening + +| Field | Value | +|-------|-------| +| **Status** | Accepted — 1 HIGH + 2 LOW bugs fixed + pinned (MEASURED on Windows) | +| **Date** | 2026-06-15 | +| **Deciders** | ruv | +| **Codename** | **OCCWORLD-DTYPE** | +| **Reviews** | `wifi-densepose-occworld-candle` (Candle occupancy-world model) | +| **Milestone** | #9 (ungated-crate security sweep) — crate 4 of 4 — **CLOSES the milestone** | + +## Context + +`wifi-densepose-occworld-candle` is a Candle-based occupancy-world model +(VQ-VAE + transformer over occupancy tokens). The real risk surface for an ML +crate is degenerate-input / malformed-weights handling: a `#[forbid(unsafe_code)]` +crate can still **panic** (a DoS, and under WASM an abort) when a tensor op hits an +inconsistent shape. The crate **builds and tests on Windows**, so all findings are +MEASURED. + +## Decision + +Fix the three reachable bugs, each pinned by a fails-on-old test; attest the rest +clean with evidence. + +### Findings fixed (all MEASURED) + +| # | Severity | Location | Issue | Fix | +|---|----------|----------|-------|-----| +| 1 | **HIGH** | `model.rs:95` (`Dtype::I32 => Some(DType::I64)`) | **Crash on any int32-tensor checkpoint.** An I32 byte buffer (4 B/elem) is handed to `from_raw_buffer(.., I64, shape, ..)`; candle derives `elem_count = data.len()/8`, **halving** the count while keeping the original shape → a tensor that claims 2× its storage. Reading it **panics** with a slice-OOB (`range end index 6 out of range for slice of length 3`) inside candle-core. A checkpoint with any int32 tensor (index/buffer tensors are common in PyTorch exports) → **DoS on load**. | Map `I32 → DType::I32`, `I16 → DType::I16` (both first-class candle dtypes). Pinned by `int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new). | +| 2 | LOW | `inference.rs::predict` | Frame/batch dims weren't validated (only H/W/D were): `f_in > num_frames*2` over-indexes the temporal embedding → a cryptic candle `InvalidIndex` *error* (not a panic — candle bounds-checks); zero frame/batch feeds a zero-element tensor. | Boundary guard rejects zero / over-capacity frame+batch with a clear `ShapeMismatch`. 5 pins. | +| 3 | LOW | `vqvae.rs:141` (`z.elem_count() / last`) | **Divide-by-zero panic** in public `VQCodebook::encode` on a rank-0 / empty-last-dim tensor (`last == 0`). | Fail-closed guard returns a clear error. Pinned by `encode_rejects_scalar_without_panicking`. | + +The HIGH finding is the notable one: the crate's own dtype mapping **defeated** +the upstream `safetensors::validate()` byte-length guarantee by misdeclaring the +dtype — the one place malformed/widened weights could reach a panicking candle op. + +### Dimensions confirmed clean (with evidence) + +- **Panic surface** — grep for `unwrap()/expect()/panic!/unreachable!` across `src/` + → **zero in production paths**; all ops use `?`/`map_err`; the `last().unwrap_or(&0)` + is now guarded. `as` casts operate only on config-bounded/internal values. +- **NaN-state-poisoning (the named class) — N/A.** The engine is **stateless between + `predict` calls** (no persistent world-model buffer to latch into), and input is + `u8` class indices (non-finite input structurally impossible). NaN weights flow to + `argmax` (deterministic, bounded to a valid class index) — no panic, no persistence. +- **Unbounded alloc / shape-data mismatch from malformed weights** — defended upstream + by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared + byte range + contiguous-offset + buffer-length checks), rejected before reaching + candle. Finding #1 was the one place the crate defeated that guarantee. +- **Model/path loading** — `load`/`load_safetensors` check `path.exists()` → typed + `CheckpointNotFound`; corrupt bytes → `CheckpointParse` (pinned). No path-traversal + surface (caller-supplied path, opened read-only, never joined with untrusted segments). +- **Secrets** — grep clean (only `token_h`/`token_w` config fields match `token`). +- **Determinism** — the crate's central honesty claim, verified by the pre-existing + `tests/predict_honesty.rs` (3 tests, still pass). +- `unsafe_code = "forbid"` in the manifest. + +## Validation + +- `cargo test -p wifi-densepose-occworld-candle --no-default-features` → **31/31** + (lib 17, checkpoint_loading 4, input_validation 5, predict_honesty 3, doctests 2), + 0 failed. +- `cargo test --workspace --no-default-features` → 0 failed across every crate (a lone + `wifi-densepose-desktop --test api_integration` "Access is denied (os error 5)" was a + Windows file-lock/AV flake — re-ran isolated 21/21, unrelated). +- `python archive/v1/data/proof/verify.py` → **VERDICT: PASS**, hash `f8e76f21…46f7a` + unchanged (occworld off the signal proof path). + +## Consequences + +### Positive +- A checkpoint-load DoS (the int32 dtype-widening panic) and two degenerate-input + panics are closed in the world-model crate, each pinned. **Milestone #9 (all 4 + ungated crates) is complete.** + +### Negative / Neutral +- None. Guards reject only malformed/degenerate inputs. + +## Links +- ADR-176 / ADR-177 / ADR-178 — sibling Milestone-#9 reviews (ruview-swarm, nvsim, desktop) diff --git a/v2/crates/wifi-densepose-occworld-candle/src/inference.rs b/v2/crates/wifi-densepose-occworld-candle/src/inference.rs index fa4fbb01..e6174a11 100644 --- a/v2/crates/wifi-densepose-occworld-candle/src/inference.rs +++ b/v2/crates/wifi-densepose-occworld-candle/src/inference.rs @@ -206,6 +206,27 @@ impl OccWorldCandle { ))); } + // Validate the externally-supplied frame and batch counts at this + // system boundary. The temporal positional embedding has only + // `num_frames * 2` rows, so a larger `f_in` would over-index the + // embedding table deep inside the transformer and surface as a cryptic + // "gather" index error; a zero frame/batch count would feed a + // zero-element tensor into the reshape/conv pipeline. Reject both here + // with a clear, domain-level error instead. + if f_in == 0 || b == 0 { + return Err(OccWorldError::ShapeMismatch(format!( + "past_occupancy must have non-zero batch and frame dims, got \ + batch={b}, frames={f_in}" + ))); + } + if f_in > cfg.num_frames * 2 { + return Err(OccWorldError::ShapeMismatch(format!( + "past_occupancy frame count {f_in} exceeds the temporal embedding \ + capacity ({} = num_frames*2)", + cfg.num_frames * 2 + ))); + } + // ── Step 1: VQVAE encode each past frame ────────────────────────── // Flatten batch*frames: (B, F, H, W, D) → (B*F, H, W, D) let occ_flat = past_occupancy @@ -455,4 +476,8 @@ mod tests { "expected CheckpointNotFound, got {result:?}" ); } + + // The `predict` input-validation boundary guards (zero/over-capacity frame + // counts) live in `tests/input_validation.rs` so they exercise only the + // public API and keep this file under the 500-line limit. } diff --git a/v2/crates/wifi-densepose-occworld-candle/src/model.rs b/v2/crates/wifi-densepose-occworld-candle/src/model.rs index 8a83c209..f028b861 100644 --- a/v2/crates/wifi-densepose-occworld-candle/src/model.rs +++ b/v2/crates/wifi-densepose-occworld-candle/src/model.rs @@ -92,8 +92,21 @@ fn safetensor_dtype_to_candle(dt: safetensors::Dtype) -> Option Some(DType::F64), Dtype::F16 => Some(DType::F16), Dtype::BF16 => Some(DType::BF16), - Dtype::I32 => Some(DType::I64), // widen for Candle compatibility + // I32 MUST map to DType::I32, not I64. `Tensor::from_raw_buffer` + // derives its element count from `data.len() / dtype.size_in_bytes()`; + // handing an int32 byte buffer (4 bytes/elem) to the I64 path + // (8 bytes/elem) halves the element count while keeping the original + // shape, producing a tensor whose declared shape claims twice as many + // elements as its storage holds. That silent shape/storage mismatch + // panics (slice OOB) the moment the tensor is read — a crash on any + // checkpoint containing an int32 tensor. See + // `tests/checkpoint_loading.rs::int32_tensor_loads_with_consistent_shape_and_values`. + Dtype::I32 => Some(DType::I32), Dtype::I64 => Some(DType::I64), + // I16 is also a first-class Candle dtype (2 bytes/elem); map it + // directly rather than rejecting it, for the same byte-size-correctness + // reason as I32 above. + Dtype::I16 => Some(DType::I16), Dtype::U8 => Some(DType::U8), Dtype::U32 => Some(DType::U32), _ => None, diff --git a/v2/crates/wifi-densepose-occworld-candle/src/vqvae.rs b/v2/crates/wifi-densepose-occworld-candle/src/vqvae.rs index 531f5caf..748d6cbd 100644 --- a/v2/crates/wifi-densepose-occworld-candle/src/vqvae.rs +++ b/v2/crates/wifi-densepose-occworld-candle/src/vqvae.rs @@ -137,6 +137,17 @@ impl VQCodebook { let orig_shape = z.shape().clone(); let orig_dims = orig_shape.dims().to_vec(); let last = *orig_shape.dims().last().unwrap_or(&0); + // Guard the divide below: a scalar (rank-0) or empty-last-dim tensor + // would make `last == 0` and panic on the `elem_count() / last` + // division. `encode` is a `pub fn` on a `pub struct`, so this is a + // reachable public boundary — fail closed with a clear error instead. + if last == 0 { + return Err(candle_core::Error::Msg(format!( + "VQCodebook::encode expects a tensor with a non-zero last dim of \ + size embed_dim={}, got shape {orig_dims:?}", + self.embed_dim + ))); + } // Flatten to (N, embed_dim) let n = z.elem_count() / last; let z_flat = z.reshape((n, last))?; // (N, D) @@ -339,6 +350,21 @@ mod tests { Ok(()) } + #[test] + fn encode_rejects_scalar_without_panicking() { + // A rank-0 (scalar) tensor has an empty dims list → `last == 0`. + // Before the guard this divided by zero and panicked; now it returns + // a clean error. `encode` is public, so this is a reachable boundary. + let device = Device::Cpu; + let codebook = VQCodebook::dummy(4, 8, &device).unwrap(); + let scalar = Tensor::from_vec(vec![1.0f32], (), &device).unwrap(); + let result = codebook.encode(&scalar); + assert!( + result.is_err(), + "scalar input must error, not panic; got {result:?}" + ); + } + #[test] fn test_fold_unfold_roundtrip() -> candle_core::Result<()> { let device = Device::Cpu; diff --git a/v2/crates/wifi-densepose-occworld-candle/tests/checkpoint_loading.rs b/v2/crates/wifi-densepose-occworld-candle/tests/checkpoint_loading.rs new file mode 100644 index 00000000..af4b66fc --- /dev/null +++ b/v2/crates/wifi-densepose-occworld-candle/tests/checkpoint_loading.rs @@ -0,0 +1,185 @@ +//! Checkpoint-loading robustness tests for `crate::model::load_safetensors`. +//! +//! Security review (Milestone #9, crate 4/4). These tests pin the behaviour of +//! the SafeTensors weight-loading path against malformed / degenerate +//! checkpoints — the only externally-controlled file-input surface in the crate. +//! +//! The headline regression is the **int32 dtype-widening byte-size bug** +//! (`security/occworld-candle` finding #1): `model.rs` mapped +//! `safetensors::Dtype::I32` → `candle_core::DType::I64` and then handed the +//! raw *int32* byte buffer (4 bytes/elem) to `Tensor::from_raw_buffer(.., I64, +//! shape, ..)`. Candle's `from_raw_buffer` computes `elem_count = +//! data.len() / 8`, producing a tensor whose declared shape claims twice as +//! many elements as the backing storage actually holds — a silent +//! shape/storage inconsistency on attacker-supplied checkpoints. +//! +//! `build_safetensors` hand-assembles the binary container +//! (``) so the test states exactly +//! what bytes reach the loader, independent of the `safetensors` writer API. + +use candle_core::Device; +use wifi_densepose_occworld_candle::model::load_safetensors; + +/// Hand-build a single-tensor SafeTensors buffer. +/// +/// `dtype` is the safetensors dtype string (e.g. `"I32"`, `"F32"`). +/// `shape` is the declared shape. `data` is the raw little-endian tensor bytes +/// — the caller is responsible for making `data.len()` consistent with +/// `shape × dtype_size` (safetensors itself validates this, so an inconsistent +/// pair is rejected before reaching the candle conversion). +fn build_safetensors(name: &str, dtype: &str, shape: &[usize], data: &[u8]) -> Vec { + let shape_json: Vec = shape.iter().map(|d| d.to_string()).collect(); + let header = format!( + "{{\"{name}\":{{\"dtype\":\"{dtype}\",\"shape\":[{}],\"data_offsets\":[0,{}]}}}}", + shape_json.join(","), + data.len() + ); + let header_bytes = header.into_bytes(); + let mut buf = Vec::new(); + buf.extend_from_slice(&(header_bytes.len() as u64).to_le_bytes()); + buf.extend_from_slice(&header_bytes); + buf.extend_from_slice(data); + buf +} + +fn write_temp(bytes: &[u8], stem: &str) -> std::path::PathBuf { + let mut p = std::env::temp_dir(); + p.push(format!( + "occworld_ckpt_{stem}_{}_{}.safetensors", + std::process::id(), + // nanosecond-ish disambiguator so parallel tests never collide + std::time::SystemTime::now() + .duration_since(std::time::UNIX_EPOCH) + .map(|d| d.as_nanos()) + .unwrap_or(0) + )); + std::fs::write(&p, bytes).expect("write temp checkpoint"); + p +} + +/// REGRESSION (finding #1): an int32 tensor in a checkpoint must load into a +/// tensor whose element count matches its declared shape. +/// +/// On the OLD code (`I32 -> DType::I64`) the 6-element int32 tensor below was +/// handed to `from_raw_buffer(.., I64, [2,3], ..)`, which derived +/// `elem_count = 24 bytes / 8 = 3` and built a 3-element storage carrying a +/// shape claiming 6 elements — reading it panicked with a slice-OOB +/// (`range end index 6 out of range for slice of length 3`). On the FIXED code +/// (`I32 -> DType::I32`) the tensor round-trips: dtype I32, 6 elements, +/// values `[1,2,3,4,5,6]`. +#[test] +fn int32_tensor_loads_with_consistent_shape_and_values() { + let device = Device::Cpu; + let shape = [2usize, 3]; + let vals: [i32; 6] = [1, 2, 3, 4, 5, 6]; + let mut data = Vec::with_capacity(24); + for v in vals { + data.extend_from_slice(&v.to_le_bytes()); + } + let bytes = build_safetensors("quantize.embedding.weight", "I32", &shape, &data); + let path = write_temp(&bytes, "i32"); + + let map = load_safetensors(&path, &device).expect("int32 checkpoint must load"); + let t = map + .get("quantize.embedding.weight") + .expect("mapped key present"); + + // The declared shape's element count MUST equal the storage's element + // count. On the old code these disagreed (6 vs 3). + assert_eq!( + t.dims(), + &[2, 3], + "int32 tensor must preserve its declared shape" + ); + assert_eq!( + t.elem_count(), + 6, + "element count must match shape — storage/shape consistency" + ); + + // The dtype must be I32 — the int32 byte buffer is interpreted as int32, + // not reinterpreted as half as many int64 lanes. + assert_eq!( + t.dtype(), + candle_core::DType::I32, + "int32 checkpoint tensor must load as DType::I32" + ); + + // And the values must be exactly recovered (no reinterpretation of two + // int32 lanes as one int64). This is the strongest proof the dtype is + // handled correctly end-to-end. + let flat = t.flatten_all().expect("flatten"); + let got: Vec = flat.to_vec1::().expect("to_vec i32"); + assert_eq!( + got, + vec![1i32, 2, 3, 4, 5, 6], + "int32 values must be recovered exactly" + ); + + let _ = std::fs::remove_file(&path); +} + +/// A well-formed F32 tensor must round-trip unchanged (control case — proves +/// the fix does not regress the common float path). +#[test] +fn f32_tensor_round_trips() { + let device = Device::Cpu; + let shape = [4usize]; + let vals: [f32; 4] = [0.5, -1.0, 2.25, 3.0]; + let mut data = Vec::with_capacity(16); + for v in vals { + data.extend_from_slice(&v.to_le_bytes()); + } + let bytes = build_safetensors("post_quant_conv.bias", "F32", &shape, &data); + let path = write_temp(&bytes, "f32"); + + let map = load_safetensors(&path, &device).expect("f32 checkpoint must load"); + let t = map.get("post_quant_conv.bias").expect("key present"); + assert_eq!(t.dims(), &[4]); + let got: Vec = t.to_vec1::().expect("to_vec f32"); + assert_eq!(got, vec![0.5, -1.0, 2.25, 3.0]); + + let _ = std::fs::remove_file(&path); +} + +/// A truncated / corrupt header must produce a parse error, never a panic. +/// (Defense-in-depth: the loader is fed an untrusted file.) +#[test] +fn corrupt_checkpoint_errors_cleanly() { + let device = Device::Cpu; + // Garbage that is not a valid SafeTensors container. + let bytes = vec![0xFFu8; 32]; + let path = write_temp(&bytes, "corrupt"); + + let result = load_safetensors(&path, &device); + assert!( + result.is_err(), + "corrupt checkpoint must error, got Ok: {result:?}" + ); + + let _ = std::fs::remove_file(&path); +} + +/// An int64 tensor must still load correctly (proves the fix narrows only the +/// I32 mapping and leaves the genuine I64 path intact). +#[test] +fn int64_tensor_round_trips() { + let device = Device::Cpu; + let shape = [3usize]; + let vals: [i64; 3] = [10, -20, 30]; + let mut data = Vec::with_capacity(24); + for v in vals { + data.extend_from_slice(&v.to_le_bytes()); + } + let bytes = build_safetensors("transformer.output_head.bias", "I64", &shape, &data); + let path = write_temp(&bytes, "i64"); + + let map = load_safetensors(&path, &device).expect("i64 checkpoint must load"); + let t = map.get("transformer.output_head.bias").expect("key present"); + assert_eq!(t.dims(), &[3]); + assert_eq!(t.elem_count(), 3); + let got: Vec = t.to_vec1::().expect("to_vec i64"); + assert_eq!(got, vec![10, -20, 30]); + + let _ = std::fs::remove_file(&path); +} diff --git a/v2/crates/wifi-densepose-occworld-candle/tests/input_validation.rs b/v2/crates/wifi-densepose-occworld-candle/tests/input_validation.rs new file mode 100644 index 00000000..d28e28bd --- /dev/null +++ b/v2/crates/wifi-densepose-occworld-candle/tests/input_validation.rs @@ -0,0 +1,139 @@ +//! Input-validation boundary tests for `OccWorldCandle::predict`. +//! +//! Security review (Milestone #9, crate 4/4). `predict` takes an +//! externally-supplied occupancy tensor; per the project's "validate input at +//! system boundaries" rule it must reject degenerate / out-of-capacity shapes +//! with a clear domain error rather than surfacing a cryptic deep-pipeline +//! Candle error (over-capacity frame counts over-index the temporal positional +//! embedding) or processing a zero-element tensor. +//! +//! These exercise only the public API and live here (not inline in +//! `inference.rs`) to keep that module under the 500-line cap. + +use candle_core::{DType, Device, Tensor}; +use wifi_densepose_occworld_candle::config::OccWorldConfig; +use wifi_densepose_occworld_candle::inference::OccWorldCandle; +use wifi_densepose_occworld_candle::error::OccWorldError; + +fn small_cfg() -> OccWorldConfig { + OccWorldConfig { + grid_h: 8, + grid_w: 8, + grid_d: 4, + num_classes: 4, + free_class: 3, + base_channels: 8, + z_channels: 8, + codebook_size: 4, + embed_dim: 8, + num_frames: 2, + token_h: 4, + token_w: 4, + num_heads: 2, + num_layers: 1, + ffn_hidden: 16, + } +} + +/// Zero frames is a degenerate input that would otherwise feed a zero-element +/// tensor into the reshape/conv pipeline. Must be rejected at the boundary. +#[test] +fn predict_rejects_zero_frames() { + let device = Device::Cpu; + let cfg = small_cfg(); + let engine = OccWorldCandle::dummy(cfg.clone(), device.clone()).unwrap(); + let past = Tensor::zeros( + (1usize, 0usize, cfg.grid_h, cfg.grid_w, cfg.grid_d), + DType::U8, + &device, + ) + .unwrap(); + let result = engine.predict(&past); + assert!( + matches!(result, Err(OccWorldError::ShapeMismatch(_))), + "zero-frame input must be rejected with ShapeMismatch" + ); +} + +/// Zero batch must also be rejected (same zero-element-tensor hazard). +#[test] +fn predict_rejects_zero_batch() { + let device = Device::Cpu; + let cfg = small_cfg(); + let engine = OccWorldCandle::dummy(cfg.clone(), device.clone()).unwrap(); + let past = Tensor::zeros( + (0usize, cfg.num_frames, cfg.grid_h, cfg.grid_w, cfg.grid_d), + DType::U8, + &device, + ) + .unwrap(); + let result = engine.predict(&past); + assert!( + matches!(result, Err(OccWorldError::ShapeMismatch(_))), + "zero-batch input must be rejected with ShapeMismatch" + ); +} + +/// More frames than the temporal embedding can index (`> num_frames*2`). +/// +/// On the old code this over-indexed the temporal positional embedding deep in +/// the transformer and surfaced as a cryptic Candle "gather" `InvalidIndex` +/// error. The boundary guard now rejects it cleanly with `ShapeMismatch`. +#[test] +fn predict_rejects_too_many_frames() { + let device = Device::Cpu; + let cfg = small_cfg(); // num_frames = 2 → temporal capacity = 4 + let engine = OccWorldCandle::dummy(cfg.clone(), device.clone()).unwrap(); + let too_many = cfg.num_frames * 2 + 1; + let past = Tensor::zeros( + (1usize, too_many, cfg.grid_h, cfg.grid_w, cfg.grid_d), + DType::U8, + &device, + ) + .unwrap(); + let result = engine.predict(&past); + assert!( + matches!(result, Err(OccWorldError::ShapeMismatch(_))), + "over-capacity frame count must be rejected with ShapeMismatch" + ); +} + +/// A frame count exactly at capacity (`num_frames*2`) must still succeed — +/// the guard rejects only *over*-capacity, not the boundary value. +#[test] +fn predict_accepts_frame_count_at_capacity() { + let device = Device::Cpu; + let cfg = small_cfg(); + let engine = OccWorldCandle::dummy(cfg.clone(), device.clone()).unwrap(); + let at_cap = cfg.num_frames * 2; + let past = Tensor::zeros( + (1usize, at_cap, cfg.grid_h, cfg.grid_w, cfg.grid_d), + DType::U8, + &device, + ) + .unwrap(); + let out = engine + .predict(&past) + .expect("at-capacity frame count must predict"); + assert_eq!(out.sem_pred.dims()[1], at_cap, "frame dim preserved"); +} + +/// Wrong spatial geometry (H/W/D) is still rejected — pins the pre-existing +/// guard alongside the new frame/batch ones. +#[test] +fn predict_rejects_wrong_grid_dims() { + let device = Device::Cpu; + let cfg = small_cfg(); + let engine = OccWorldCandle::dummy(cfg.clone(), device.clone()).unwrap(); + let past = Tensor::zeros( + (1usize, cfg.num_frames, cfg.grid_h + 1, cfg.grid_w, cfg.grid_d), + DType::U8, + &device, + ) + .unwrap(); + let result = engine.predict(&past); + assert!( + matches!(result, Err(OccWorldError::ShapeMismatch(_))), + "wrong grid dims must be rejected with ShapeMismatch" + ); +}