wifi-densepose/docs/adr/ADR-179-occworld-candle-che...

82 lines
5.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-179: `wifi-densepose-occworld-candle` Checkpoint-Load Hardening
| Field | Value |
|-------|-------|
| **Status** | Accepted — 1 HIGH + 2 LOW bugs fixed + pinned (MEASURED on Windows) |
| **Date** | 2026-06-15 |
| **Deciders** | ruv |
| **Codename** | **OCCWORLD-DTYPE** |
| **Reviews** | `wifi-densepose-occworld-candle` (Candle occupancy-world model) |
| **Milestone** | #9 (ungated-crate security sweep) — crate 4 of 4 — **CLOSES the milestone** |
## Context
`wifi-densepose-occworld-candle` is a Candle-based occupancy-world model
(VQ-VAE + transformer over occupancy tokens). The real risk surface for an ML
crate is degenerate-input / malformed-weights handling: a `#[forbid(unsafe_code)]`
crate can still **panic** (a DoS, and under WASM an abort) when a tensor op hits an
inconsistent shape. The crate **builds and tests on Windows**, so all findings are
MEASURED.
## Decision
Fix the three reachable bugs, each pinned by a fails-on-old test; attest the rest
clean with evidence.
### Findings fixed (all MEASURED)
| # | Severity | Location | Issue | Fix |
|---|----------|----------|-------|-----|
| 1 | **HIGH** | `model.rs:95` (`Dtype::I32 => Some(DType::I64)`) | **Crash on any int32-tensor checkpoint.** An I32 byte buffer (4 B/elem) is handed to `from_raw_buffer(.., I64, shape, ..)`; candle derives `elem_count = data.len()/8`, **halving** the count while keeping the original shape → a tensor that claims 2× its storage. Reading it **panics** with a slice-OOB (`range end index 6 out of range for slice of length 3`) inside candle-core. A checkpoint with any int32 tensor (index/buffer tensors are common in PyTorch exports) → **DoS on load**. | Map `I32 → DType::I32`, `I16 → DType::I16` (both first-class candle dtypes). Pinned by `int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new). |
| 2 | LOW | `inference.rs::predict` | Frame/batch dims weren't validated (only H/W/D were): `f_in > num_frames*2` over-indexes the temporal embedding → a cryptic candle `InvalidIndex` *error* (not a panic — candle bounds-checks); zero frame/batch feeds a zero-element tensor. | Boundary guard rejects zero / over-capacity frame+batch with a clear `ShapeMismatch`. 5 pins. |
| 3 | LOW | `vqvae.rs:141` (`z.elem_count() / last`) | **Divide-by-zero panic** in public `VQCodebook::encode` on a rank-0 / empty-last-dim tensor (`last == 0`). | Fail-closed guard returns a clear error. Pinned by `encode_rejects_scalar_without_panicking`. |
The HIGH finding is the notable one: the crate's own dtype mapping **defeated**
the upstream `safetensors::validate()` byte-length guarantee by misdeclaring the
dtype — the one place malformed/widened weights could reach a panicking candle op.
### Dimensions confirmed clean (with evidence)
- **Panic surface** — grep for `unwrap()/expect()/panic!/unreachable!` across `src/`
**zero in production paths**; all ops use `?`/`map_err`; the `last().unwrap_or(&0)`
is now guarded. `as` casts operate only on config-bounded/internal values.
- **NaN-state-poisoning (the named class) — N/A.** The engine is **stateless between
`predict` calls** (no persistent world-model buffer to latch into), and input is
`u8` class indices (non-finite input structurally impossible). NaN weights flow to
`argmax` (deterministic, bounded to a valid class index) — no panic, no persistence.
- **Unbounded alloc / shape-data mismatch from malformed weights** — defended upstream
by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared
byte range + contiguous-offset + buffer-length checks), rejected before reaching
candle. Finding #1 was the one place the crate defeated that guarantee.
- **Model/path loading** — `load`/`load_safetensors` check `path.exists()` → typed
`CheckpointNotFound`; corrupt bytes → `CheckpointParse` (pinned). No path-traversal
surface (caller-supplied path, opened read-only, never joined with untrusted segments).
- **Secrets** — grep clean (only `token_h`/`token_w` config fields match `token`).
- **Determinism** — the crate's central honesty claim, verified by the pre-existing
`tests/predict_honesty.rs` (3 tests, still pass).
- `unsafe_code = "forbid"` in the manifest.
## Validation
- `cargo test -p wifi-densepose-occworld-candle --no-default-features`**31/31**
(lib 17, checkpoint_loading 4, input_validation 5, predict_honesty 3, doctests 2),
0 failed.
- `cargo test --workspace --no-default-features` → 0 failed across every crate (a lone
`wifi-densepose-desktop --test api_integration` "Access is denied (os error 5)" was a
Windows file-lock/AV flake — re-ran isolated 21/21, unrelated).
- `python archive/v1/data/proof/verify.py`**VERDICT: PASS**, hash `f8e76f21…46f7a`
unchanged (occworld off the signal proof path).
## Consequences
### Positive
- A checkpoint-load DoS (the int32 dtype-widening panic) and two degenerate-input
panics are closed in the world-model crate, each pinned. **Milestone #9 (all 4
ungated crates) is complete.**
### Negative / Neutral
- None. Guards reject only malformed/degenerate inputs.
## Links
- ADR-176 / ADR-177 / ADR-178 — sibling Milestone-#9 reviews (ruview-swarm, nvsim, desktop)