82 lines
5.1 KiB
Markdown
82 lines
5.1 KiB
Markdown
# ADR-179: `wifi-densepose-occworld-candle` Checkpoint-Load Hardening
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| **Status** | Accepted — 1 HIGH + 2 LOW bugs fixed + pinned (MEASURED on Windows) |
|
||
| **Date** | 2026-06-15 |
|
||
| **Deciders** | ruv |
|
||
| **Codename** | **OCCWORLD-DTYPE** |
|
||
| **Reviews** | `wifi-densepose-occworld-candle` (Candle occupancy-world model) |
|
||
| **Milestone** | #9 (ungated-crate security sweep) — crate 4 of 4 — **CLOSES the milestone** |
|
||
|
||
## Context
|
||
|
||
`wifi-densepose-occworld-candle` is a Candle-based occupancy-world model
|
||
(VQ-VAE + transformer over occupancy tokens). The real risk surface for an ML
|
||
crate is degenerate-input / malformed-weights handling: a `#[forbid(unsafe_code)]`
|
||
crate can still **panic** (a DoS, and under WASM an abort) when a tensor op hits an
|
||
inconsistent shape. The crate **builds and tests on Windows**, so all findings are
|
||
MEASURED.
|
||
|
||
## Decision
|
||
|
||
Fix the three reachable bugs, each pinned by a fails-on-old test; attest the rest
|
||
clean with evidence.
|
||
|
||
### Findings fixed (all MEASURED)
|
||
|
||
| # | Severity | Location | Issue | Fix |
|
||
|---|----------|----------|-------|-----|
|
||
| 1 | **HIGH** | `model.rs:95` (`Dtype::I32 => Some(DType::I64)`) | **Crash on any int32-tensor checkpoint.** An I32 byte buffer (4 B/elem) is handed to `from_raw_buffer(.., I64, shape, ..)`; candle derives `elem_count = data.len()/8`, **halving** the count while keeping the original shape → a tensor that claims 2× its storage. Reading it **panics** with a slice-OOB (`range end index 6 out of range for slice of length 3`) inside candle-core. A checkpoint with any int32 tensor (index/buffer tensors are common in PyTorch exports) → **DoS on load**. | Map `I32 → DType::I32`, `I16 → DType::I16` (both first-class candle dtypes). Pinned by `int32_tensor_loads_with_consistent_shape_and_values` (panics on old, passes on new). |
|
||
| 2 | LOW | `inference.rs::predict` | Frame/batch dims weren't validated (only H/W/D were): `f_in > num_frames*2` over-indexes the temporal embedding → a cryptic candle `InvalidIndex` *error* (not a panic — candle bounds-checks); zero frame/batch feeds a zero-element tensor. | Boundary guard rejects zero / over-capacity frame+batch with a clear `ShapeMismatch`. 5 pins. |
|
||
| 3 | LOW | `vqvae.rs:141` (`z.elem_count() / last`) | **Divide-by-zero panic** in public `VQCodebook::encode` on a rank-0 / empty-last-dim tensor (`last == 0`). | Fail-closed guard returns a clear error. Pinned by `encode_rejects_scalar_without_panicking`. |
|
||
|
||
The HIGH finding is the notable one: the crate's own dtype mapping **defeated**
|
||
the upstream `safetensors::validate()` byte-length guarantee by misdeclaring the
|
||
dtype — the one place malformed/widened weights could reach a panicking candle op.
|
||
|
||
### Dimensions confirmed clean (with evidence)
|
||
|
||
- **Panic surface** — grep for `unwrap()/expect()/panic!/unreachable!` across `src/`
|
||
→ **zero in production paths**; all ops use `?`/`map_err`; the `last().unwrap_or(&0)`
|
||
is now guarded. `as` casts operate only on config-bounded/internal values.
|
||
- **NaN-state-poisoning (the named class) — N/A.** The engine is **stateless between
|
||
`predict` calls** (no persistent world-model buffer to latch into), and input is
|
||
`u8` class indices (non-finite input structurally impossible). NaN weights flow to
|
||
`argmax` (deterministic, bounded to a valid class index) — no panic, no persistence.
|
||
- **Unbounded alloc / shape-data mismatch from malformed weights** — defended upstream
|
||
by `safetensors::validate()` (overflow-checked `nelements*dtype.size()` vs declared
|
||
byte range + contiguous-offset + buffer-length checks), rejected before reaching
|
||
candle. Finding #1 was the one place the crate defeated that guarantee.
|
||
- **Model/path loading** — `load`/`load_safetensors` check `path.exists()` → typed
|
||
`CheckpointNotFound`; corrupt bytes → `CheckpointParse` (pinned). No path-traversal
|
||
surface (caller-supplied path, opened read-only, never joined with untrusted segments).
|
||
- **Secrets** — grep clean (only `token_h`/`token_w` config fields match `token`).
|
||
- **Determinism** — the crate's central honesty claim, verified by the pre-existing
|
||
`tests/predict_honesty.rs` (3 tests, still pass).
|
||
- `unsafe_code = "forbid"` in the manifest.
|
||
|
||
## Validation
|
||
|
||
- `cargo test -p wifi-densepose-occworld-candle --no-default-features` → **31/31**
|
||
(lib 17, checkpoint_loading 4, input_validation 5, predict_honesty 3, doctests 2),
|
||
0 failed.
|
||
- `cargo test --workspace --no-default-features` → 0 failed across every crate (a lone
|
||
`wifi-densepose-desktop --test api_integration` "Access is denied (os error 5)" was a
|
||
Windows file-lock/AV flake — re-ran isolated 21/21, unrelated).
|
||
- `python archive/v1/data/proof/verify.py` → **VERDICT: PASS**, hash `f8e76f21…46f7a`
|
||
unchanged (occworld off the signal proof path).
|
||
|
||
## Consequences
|
||
|
||
### Positive
|
||
- A checkpoint-load DoS (the int32 dtype-widening panic) and two degenerate-input
|
||
panics are closed in the world-model crate, each pinned. **Milestone #9 (all 4
|
||
ungated crates) is complete.**
|
||
|
||
### Negative / Neutral
|
||
- None. Guards reject only malformed/degenerate inputs.
|
||
|
||
## Links
|
||
- ADR-176 / ADR-177 / ADR-178 — sibling Milestone-#9 reviews (ruview-swarm, nvsim, desktop)
|