Commit Graph

3 Commits

Author SHA1 Message Date
rUv c859f6f743
security(occworld-candle): int32-checkpoint crash + degenerate-input guards + ADR-179 (closes Milestone #9) (#1101)
* fix(occworld-candle): security review fixes — int32 checkpoint crash + predict input validation

Beyond-SOTA security + correctness review of wifi-densepose-occworld-candle
(Milestone #9, crate 4/4 — the last ungated crate).

Findings fixed:

1. HIGH (MEASURED) — checkpoint-load crash on any int32 tensor.
   model.rs mapped safetensors I32 -> candle DType::I64 and passed the raw
   int32 byte buffer (4 bytes/elem) to Tensor::from_raw_buffer(.., I64, ..).
   Candle derives elem_count = data.len() / dtype.size(), so the I64 path
   halved the count while keeping the original shape -> a tensor whose shape
   claims 2x its storage. Reading it PANICS (slice OOB: "range end index 6
   out of range for slice of length 3") on any checkpoint containing an int32
   tensor. Fixed: I32 -> DType::I32, I16 -> DType::I16 (both first-class
   candle dtypes). Reproduced on old code; pinned in tests/checkpoint_loading.rs.

2. LOW (MEASURED) — predict() lacked frame/batch validation at the input
   boundary. f_in > num_frames*2 over-indexed the temporal embedding (cryptic
   candle "gather" error); zero frame/batch fed a zero-element tensor in. Now
   rejected with a clear ShapeMismatch. Pinned in tests/input_validation.rs.

3. LOW (MEASURED) — divide-by-zero panic in the public VQCodebook::encode on a
   rank-0 / empty-last-dim tensor (last == 0). Now fails closed with a clear
   error. Pinned in vqvae.rs unit tests.

Dimensions confirmed clean with evidence: panic surface (no unwrap/expect/
panic in prod paths), NaN-state-poisoning (N/A — stateless engine, u8 input),
unbounded-alloc/shape-data mismatch (defended upstream by safetensors::
validate), secrets (none). unsafe_code = forbid.

Validation (MEASURED, Windows): crate 31/31 pass; workspace 0 failed (lone
desktop api_integration "Access is denied" file-lock flake passes 21/21 in
isolation); Python proof VERDICT PASS, hash f8e76f21…446f7a unchanged.

Warrants ADR slot 179 (parent to author).

Co-Authored-By: claude-flow <ruv@ruv.net>

* docs(adr): ADR-179 — occworld-candle checkpoint-load hardening (closes Milestone #9)

Records the HIGH int32-checkpoint crash fix (I32→I64 dtype-widening → slice-OOB
panic on load = DoS) + 2 LOW degenerate-input fixes from 5e77f47e5. Stateless
engine (NaN-poisoning N/A), unsafe forbidden, safetensors validate() defends
malloc upstream. occworld 31/31. Final ungated crate — Milestone #9 complete.

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-15 12:35:29 -04:00
ruv 2754af804e feat(occworld): real conv encoder/decoder forward pass + honesty flag
Replace the `Tensor::randn` stubs in occworld-candle's VQVAE encoder
(`encode_occupancy`) and decoder (`decode_to_logits`) with a real,
deterministic, input-dependent convolutional forward pass. Previously
`predict()` emitted trajectory waypoints + confidence that were a function
of RANDOM NOISE, independent of the input and silently presented as model
output — the exact "AI slop" the project must eliminate.

occworld-candle:
- New `cnn.rs`: `Encoder2D` (3× Conv2d + GELU, interpolate2d to pin the
  token grid) and `Decoder2D` (upsample_nearest2d + Conv2d + 1×1 head).
  Both are deterministic functions of the input — same input → identical
  output; different input → different output. No randn in any forward path.
- Deterministic weight init (`det_fill`, seeded xorshift64*) across all
  `dummy()` constructors (encoder/decoder, VQ codebook, quant-convs,
  transformer), so untrained engines are bit-for-bit reproducible.
- `InferenceOutput.weights_trained: bool` — honest disclosure flag. `false`
  for `dummy()` (real but untrained net), `true` only after `load()` reads a
  real checkpoint. Priors are always from the real forward pass, never faked.
- VQ codebook + quant/post-quant convs kept and wired encoder→VQ→decoder.
- Centerpiece tests in `tests/predict_honesty.rs` (input-dependence,
  run-to-run + cross-engine determinism, untrained flag). All three FAIL on
  the old randn stub (verified by temporarily reinstating randn).

pointcloud:
- Optimize `to_gaussian_splats` hot path: 9 separate `.iter().sum()` passes
  per voxel → 2 fused accumulation passes. Bit-identical output.
- `benches/splats_bench.rs` (criterion) measures old 9-pass vs new 2-pass
  with a parity guard. ~1.3× faster on representative cloud sizes.
- Confirmed: no `randn`/placeholder in any claimed production path. The
  remaining synthetic generators (`send_test_frames`, `demo_depth_cloud`)
  and honestly-flagged heuristics (`heuristic_pose_from_amplitude`,
  luminance pseudo-depth fallback) are explicitly disclosed, not faked output.

DATA-GATED: a trained checkpoint. An untrained-but-real net is the honest
deliverable; accuracy is flagged via `weights_trained`, never claimed.

Tests: occworld 16 unit + 3 integration + 2 doc, pointcloud 18 — all pass
(CPU `Device::Cpu`; CUDA feature is GPU-gated and untouched).

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-11 21:47:19 -04:00
ruv 9ad550d95f feat(worldmodel): Candle Rust port + GCP GPU scripts (ADR-147 Phase 4+6)
Candle native port — wifi-densepose-occworld-candle v0.3.0:
- config.rs: OccWorldConfig (14 params matching occworld.py)
- vqvae.rs: ClassEmbedding(18→64), VQCodebook(512×512, squared-L2),
  QuantConv/PostQuantConv(1×1 Conv2d), fold_3d_to_2d helpers
  ResNet encoder/decoder are documented stubs (Phase 5 checkpoint pending)
- transformer.rs: full Candle MHA transformer (2 layers, temporal+spatial
  cross-attention, FFN, pre-norm residuals)
- inference.rs: OccWorldCandle::dummy() + ::load() + predict()
  InferenceOutput: sem_pred(1,15,200,200,16) + trajectory_priors
- 14/14 tests pass (12 lib + 2 doctests)

GCP GPU scripts — scripts/gcp/:
- provision_training.sh: a2-highgpu-8g (8×A100 40GB) for Phase 5 retraining
- run_training.sh: rsync + torchrun 8-GPU train + checkpoint download
- provision_cosmos.sh: a2-ultragpu-1g (A100 80GB) for Cosmos evaluation
- cosmos_eval.sh: run Cosmos-Transfer2.5 inference, download results
- teardown.sh: safe checkpoint download + instance delete

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-29 20:52:51 -04:00