feat(occworld): real conv encoder/decoder forward pass + honesty flag

Replace the `Tensor::randn` stubs in occworld-candle's VQVAE encoder (`encode_occupancy`) and decoder (`decode_to_logits`) with a real, deterministic, input-dependent convolutional forward pass. Previously `predict()` emitted trajectory waypoints + confidence that were a function of RANDOM NOISE, independent of the input and silently presented as model output — the exact "AI slop" the project must eliminate. occworld-candle: - New `cnn.rs`: `Encoder2D` (3× Conv2d + GELU, interpolate2d to pin the token grid) and `Decoder2D` (upsample_nearest2d + Conv2d + 1×1 head). Both are deterministic functions of the input — same input → identical output; different input → different output. No randn in any forward path. - Deterministic weight init (`det_fill`, seeded xorshift64*) across all `dummy()` constructors (encoder/decoder, VQ codebook, quant-convs, transformer), so untrained engines are bit-for-bit reproducible. - `InferenceOutput.weights_trained: bool` — honest disclosure flag. `false` for `dummy()` (real but untrained net), `true` only after `load()` reads a real checkpoint. Priors are always from the real forward pass, never faked. - VQ codebook + quant/post-quant convs kept and wired encoder→VQ→decoder. - Centerpiece tests in `tests/predict_honesty.rs` (input-dependence, run-to-run + cross-engine determinism, untrained flag). All three FAIL on the old randn stub (verified by temporarily reinstating randn). pointcloud: - Optimize `to_gaussian_splats` hot path: 9 separate `.iter().sum()` passes per voxel → 2 fused accumulation passes. Bit-identical output. - `benches/splats_bench.rs` (criterion) measures old 9-pass vs new 2-pass with a parity guard. ~1.3× faster on representative cloud sizes. - Confirmed: no `randn`/placeholder in any claimed production path. The remaining synthetic generators (`send_test_frames`, `demo_depth_cloud`) and honestly-flagged heuristics (`heuristic_pose_from_amplitude`, luminance pseudo-depth fallback) are explicitly disclosed, not faked output. DATA-GATED: a trained checkpoint. An untrained-but-real net is the honest deliverable; accuracy is flagged via `weights_trained`, never claimed. Tests: occworld 16 unit + 3 integration + 2 doc, pointcloud 18 — all pass (CPU `Device::Cpu`; CUDA feature is GPU-gated and untouched). Co-Authored-By: claude-flow <ruv@ruv.net>
2026-06-11 21:39:07 -04:00 · 2026-06-11 21:39:07 -04:00 · 2754af804e
parent 7c80711454
commit 2754af804e
10 changed files with 871 additions and 121 deletions
--- a/v2/Cargo.lock
+++ b/v2/Cargo.lock
@ -11027,6 +11027,7 @@ dependencies = [
 "axum",
 "chrono",
 "clap",
+ "criterion",
 "dirs 5.0.1",
 "reqwest 0.12.28",
 "serde",
--- a/v2/crates/wifi-densepose-occworld-candle/src/cnn.rs
+++ b/v2/crates/wifi-densepose-occworld-candle/src/cnn.rs
@ -0,0 +1,343 @@
+//! Real convolutional encoder / decoder for the OccWorld VQVAE.
+//!
+//! This module replaces the former `Tensor::randn` stubs in [`crate::vqvae`]
+//! with a genuine, **deterministic, input-dependent** forward pass:
+//!
+//! * [`Encoder2D`] — a 3-stage convolutional encoder (`Conv2d` + GELU) that
+//!   maps the class-embedded occupancy grid
+//!   `(B*F, base_channels, H, W*D)` to a latent feature map
+//!   `(B*F, z_channels, token_h, token_w)`.  The final spatial resolution is
+//!   pinned with `interpolate2d` (adaptive average pooling) so the encoder
+//!   works for *any* grid/token geometry, not just power-of-two factors.
+//! * [`Decoder2D`] — the mirror network (`upsample_nearest2d` + `Conv2d`)
+//!   mapping latent codes `(B*F, z_channels, token_h, token_w)` back to
+//!   per-voxel class logits `(B*F, num_classes, H, W, D)`.
+//!
+//! ## Honesty / determinism contract
+//!
+//! * **No randomness in the forward path.** Given identical weights and an
+//!   identical input tensor, both networks produce bit-identical output.
+//! * **Input-dependent.** Two different inputs produce different outputs
+//!   (the convolutions are linear maps of the input plus a bias; only an
+//!   all-zero weight tensor would break this — and we never zero the weights).
+//! * **Deterministic initialisation.** The `dummy` / untrained constructors
+//!   use a fixed-seed pseudo-random fill ([`det_fill`]) so test runs are
+//!   reproducible across machines. Untrained weights are an honest,
+//!   *data-gated* deliverable — see `weights_trained` in
+//!   [`crate::inference::InferenceOutput`].
+//!
+//! When a real Phase-5 checkpoint exists, [`Encoder2D::from_weights`] /
+//! [`Decoder2D::from_weights`] load the trained tensors via a
+//! [`candle_nn::VarBuilder`]; nothing else in the forward path changes.
+
+use candle_core::{Device, Module, Result, Tensor};
+use candle_nn::{Conv2d, Conv2dConfig, VarBuilder};
+
+use crate::config::OccWorldConfig;
+
+/// Deterministic, seed-driven weight fill in `[-scale, scale)`.
+///
+/// A tiny xorshift64* PRNG generates the values, so the result is identical
+/// on every platform for a given `(shape, seed)` — unlike `Tensor::randn`,
+/// which draws from the global RNG and is therefore non-reproducible and
+/// (crucially) decouples the output from the input. We *only* use this to
+/// initialise weights, never inside `forward`.
+///
+/// Exposed `pub(crate)` so the VQVAE/transformer `dummy` constructors share the
+/// same deterministic initialisation, making two independently-built untrained
+/// engines bit-for-bit identical (and therefore reproducible in tests).
+pub(crate) fn det_fill(shape: &[usize], seed: u64, scale: f32, device: &Device) -> Result<Tensor> {
+    let n: usize = shape.iter().product();
+    let mut state = seed | 1; // never zero
+    let mut data = Vec::with_capacity(n);
+    for _ in 0..n {
+        // xorshift64*
+        state ^= state >> 12;
+        state ^= state << 25;
+        state ^= state >> 27;
+        let r = state.wrapping_mul(0x2545_F491_4F6C_DD1D);
+        // map high 24 bits → [0, 1) → [-scale, scale)
+        let unit = ((r >> 40) as f32) / (1u32 << 24) as f32;
+        data.push((unit * 2.0 - 1.0) * scale);
+    }
+    Tensor::from_vec(data, shape, device)
+}
+
+/// Build a `Conv2d` with deterministic weights (Kaiming-ish fan-in scaling).
+fn det_conv2d(
+    in_c: usize,
+    out_c: usize,
+    kernel: usize,
+    cfg: Conv2dConfig,
+    seed: u64,
+    device: &Device,
+) -> Result<Conv2d> {
+    let fan_in = (in_c * kernel * kernel) as f32;
+    let scale = (1.0 / fan_in).sqrt();
+    let w = det_fill(&[out_c, in_c, kernel, kernel], seed, scale, device)?;
+    // Small non-zero deterministic bias so even all-zero inputs differ per channel.
+    let b = det_fill(&[out_c], seed.wrapping_add(0x9E37_79B9_7F4A_7C15), scale, device)?;
+    Ok(Conv2d::new(w, Some(b), cfg))
+}
+
+// ── Encoder ───────────────────────────────────────────────────────────────────
+
+/// Real 2-D convolutional encoder: `(B*F, base_channels, H, W*D)` →
+/// `(B*F, z_channels, token_h, token_w)`.
+///
+/// Three `Conv2d` stages (stride-2, stride-2, stride-1) with GELU
+/// non-linearities progressively expand channels and contract resolution;
+/// a final `interpolate2d` pins the output to the exact token grid so the
+/// network is geometry-agnostic.
+pub struct Encoder2D {
+    conv1: Conv2d,
+    conv2: Conv2d,
+    conv3: Conv2d,
+    token_h: usize,
+    token_w: usize,
+}
+
+impl Encoder2D {
+    fn channels(cfg: &OccWorldConfig) -> (usize, usize, usize) {
+        let mid = cfg.z_channels.max(cfg.base_channels);
+        (cfg.base_channels, mid, cfg.z_channels)
+    }
+
+    /// Deterministic untrained encoder (fixed-seed weights).
+    pub fn dummy(cfg: &OccWorldConfig, device: &Device) -> Result<Self> {
+        let (c_in, c_mid, c_out) = Self::channels(cfg);
+        let down = Conv2dConfig {
+            padding: 1,
+            stride: 2,
+            ..Default::default()
+        };
+        let keep = Conv2dConfig {
+            padding: 1,
+            stride: 1,
+            ..Default::default()
+        };
+        Ok(Self {
+            conv1: det_conv2d(c_in, c_mid, 3, down, 0x0CCD_0001, device)?,
+            conv2: det_conv2d(c_mid, c_mid, 3, down, 0x0CCD_0002, device)?,
+            conv3: det_conv2d(c_mid, c_out, 3, keep, 0x0CCD_0003, device)?,
+            token_h: cfg.token_h,
+            token_w: cfg.token_w,
+        })
+    }
+
+    /// Load trained encoder weights from a checkpoint.
+    pub fn from_weights(cfg: &OccWorldConfig, vb: VarBuilder<'_>) -> Result<Self> {
+        let (c_in, c_mid, c_out) = Self::channels(cfg);
+        let down = Conv2dConfig {
+            padding: 1,
+            stride: 2,
+            ..Default::default()
+        };
+        let keep = Conv2dConfig {
+            padding: 1,
+            stride: 1,
+            ..Default::default()
+        };
+        let vb = vb.pp("enc");
+        Ok(Self {
+            conv1: candle_nn::conv2d(c_in, c_mid, 3, down, vb.pp("conv1"))?,
+            conv2: candle_nn::conv2d(c_mid, c_mid, 3, down, vb.pp("conv2"))?,
+            conv3: candle_nn::conv2d(c_mid, c_out, 3, keep, vb.pp("conv3"))?,
+            token_h: cfg.token_h,
+            token_w: cfg.token_w,
+        })
+    }
+
+    /// Forward: `(B*F, base_channels, H, W*D)` → `(B*F, z_channels, token_h, token_w)`.
+    pub fn forward(&self, x: &Tensor) -> Result<Tensor> {
+        let x = self.conv1.forward(x)?.gelu()?;
+        let x = self.conv2.forward(&x)?.gelu()?;
+        let x = self.conv3.forward(&x)?.gelu()?;
+        // Pin to the exact token grid (adaptive average pooling).
+        x.interpolate2d(self.token_h, self.token_w)
+    }
+}
+
+// ── Decoder ───────────────────────────────────────────────────────────────────
+
+/// Real 2-D convolutional decoder: `(B*F, z_channels, token_h, token_w)` →
+/// per-voxel class logits `(B*F, num_classes, grid_h, grid_w, grid_d)`.
+///
+/// The latent map is up-sampled to the folded `(grid_h, grid_w*grid_d)`
+/// resolution, refined by two `Conv2d` layers, and projected to
+/// `num_classes` channels by a 1×1 head before being unfolded back to 3-D.
+pub struct Decoder2D {
+    up1: Conv2d,
+    up2: Conv2d,
+    head: Conv2d,
+    grid_h: usize,
+    grid_w: usize,
+    grid_d: usize,
+    num_classes: usize,
+}
+
+impl Decoder2D {
+    fn channels(cfg: &OccWorldConfig) -> (usize, usize) {
+        let mid = cfg.z_channels.max(cfg.base_channels);
+        (cfg.z_channels, mid)
+    }
+
+    /// Deterministic untrained decoder (fixed-seed weights).
+    pub fn dummy(cfg: &OccWorldConfig, device: &Device) -> Result<Self> {
+        let (c_in, c_mid) = Self::channels(cfg);
+        let keep = Conv2dConfig {
+            padding: 1,
+            stride: 1,
+            ..Default::default()
+        };
+        let head = Conv2dConfig::default(); // 1×1, padding 0
+        Ok(Self {
+            up1: det_conv2d(c_in, c_mid, 3, keep, 0x0DEC_0001, device)?,
+            up2: det_conv2d(c_mid, c_mid, 3, keep, 0x0DEC_0002, device)?,
+            head: det_conv2d(c_mid, cfg.num_classes, 1, head, 0x0DEC_0003, device)?,
+            grid_h: cfg.grid_h,
+            grid_w: cfg.grid_w,
+            grid_d: cfg.grid_d,
+            num_classes: cfg.num_classes,
+        })
+    }
+
+    /// Load trained decoder weights from a checkpoint.
+    pub fn from_weights(cfg: &OccWorldConfig, vb: VarBuilder<'_>) -> Result<Self> {
+        let (c_in, c_mid) = Self::channels(cfg);
+        let keep = Conv2dConfig {
+            padding: 1,
+            stride: 1,
+            ..Default::default()
+        };
+        let head = Conv2dConfig::default();
+        let vb = vb.pp("dec");
+        Ok(Self {
+            up1: candle_nn::conv2d(c_in, c_mid, 3, keep, vb.pp("up1"))?,
+            up2: candle_nn::conv2d(c_mid, c_mid, 3, keep, vb.pp("up2"))?,
+            head: candle_nn::conv2d(c_mid, cfg.num_classes, 1, head, vb.pp("head"))?,
+            grid_h: cfg.grid_h,
+            grid_w: cfg.grid_w,
+            grid_d: cfg.grid_d,
+            num_classes: cfg.num_classes,
+        })
+    }
+
+    /// Forward: `(B*F, z_channels, token_h, token_w)` →
+    /// `(B*F, num_classes, grid_h, grid_w, grid_d)`.
+    pub fn forward(&self, z: &Tensor) -> Result<Tensor> {
+        let bf = z.dim(0)?;
+        // Up-sample latent map to the folded occupancy resolution (H, W*D).
+        let target_w = self.grid_w * self.grid_d;
+        let x = z.upsample_nearest2d(self.grid_h, target_w)?;
+        let x = self.up1.forward(&x)?.gelu()?;
+        let x = self.up2.forward(&x)?.gelu()?;
+        // 1×1 head → (B*F, num_classes, H, W*D)
+        let logits2d = self.head.forward(&x)?;
+        // Unfold width back into (W, D): (B*F, num_classes, H, W, D)
+        logits2d.reshape((bf, self.num_classes, self.grid_h, self.grid_w, self.grid_d))
+    }
+}
+
+// ── Free-function wrappers (drop-in replacements for the old stubs) ─────────────
+
+/// Real encoder forward, dispatched through an [`Encoder2D`].
+///
+/// Accepts the class-embedded grid `(B*F, base_channels, H, W*D)` and returns
+/// `(B*F, z_channels, token_h, token_w)`. Deterministic and input-dependent.
+pub fn encode_occupancy(encoder: &Encoder2D, x: &Tensor) -> Result<Tensor> {
+    encoder.forward(x)
+}
+
+/// Real decoder forward, dispatched through a [`Decoder2D`].
+pub fn decode_to_logits(decoder: &Decoder2D, z: &Tensor) -> Result<Tensor> {
+    decoder.forward(z)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use candle_core::DType;
+
+    fn cfg() -> OccWorldConfig {
+        OccWorldConfig {
+            grid_h: 8,
+            grid_w: 8,
+            grid_d: 4,
+            num_classes: 4,
+            free_class: 3,
+            base_channels: 8,
+            z_channels: 8,
+            codebook_size: 4,
+            embed_dim: 8,
+            num_frames: 2,
+            token_h: 4,
+            token_w: 4,
+            num_heads: 2,
+            num_layers: 1,
+            ffn_hidden: 16,
+        }
+    }
+
+    #[test]
+    fn det_fill_is_reproducible() -> Result<()> {
+        let dev = Device::Cpu;
+        let a = det_fill(&[3, 4], 42, 1.0, &dev)?;
+        let b = det_fill(&[3, 4], 42, 1.0, &dev)?;
+        let diff = (a - b)?.abs()?.sum_all()?.to_scalar::<f32>()?;
+        assert_eq!(diff, 0.0, "same seed must give identical fill");
+        Ok(())
+    }
+
+    #[test]
+    fn encoder_shape_and_determinism() -> Result<()> {
+        let dev = Device::Cpu;
+        let c = cfg();
+        let enc = Encoder2D::dummy(&c, &dev)?;
+        let x = Tensor::randn(
+            0f32,
+            1.0,
+            (2, c.base_channels, c.grid_h, c.grid_w * c.grid_d),
+            &dev,
+        )?;
+        let z1 = enc.forward(&x)?;
+        let z2 = enc.forward(&x)?;
+        assert_eq!(z1.dims(), &[2, c.z_channels, c.token_h, c.token_w]);
+        // Same input → identical output (no randn in forward).
+        let diff = (z1 - z2)?.abs()?.sum_all()?.to_scalar::<f32>()?;
+        assert_eq!(diff, 0.0, "encoder forward must be deterministic");
+        Ok(())
+    }
+
+    #[test]
+    fn encoder_is_input_dependent() -> Result<()> {
+        let dev = Device::Cpu;
+        let c = cfg();
+        let enc = Encoder2D::dummy(&c, &dev)?;
+        let shape = (1, c.base_channels, c.grid_h, c.grid_w * c.grid_d);
+        let x0 = Tensor::zeros(shape, DType::F32, &dev)?;
+        let x1 = Tensor::ones(shape, DType::F32, &dev)?;
+        let z0 = enc.forward(&x0)?;
+        let z1 = enc.forward(&x1)?;
+        let diff = (z0 - z1)?.abs()?.sum_all()?.to_scalar::<f32>()?;
+        assert!(
+            diff > 1e-4,
+            "different inputs must give different latents (got {diff})"
+        );
+        Ok(())
+    }
+
+    #[test]
+    fn decoder_shape_and_determinism() -> Result<()> {
+        let dev = Device::Cpu;
+        let c = cfg();
+        let dec = Decoder2D::dummy(&c, &dev)?;
+        let z = Tensor::randn(0f32, 1.0, (2, c.z_channels, c.token_h, c.token_w), &dev)?;
+        let l1 = dec.forward(&z)?;
+        let l2 = dec.forward(&z)?;
+        assert_eq!(l1.dims(), &[2, c.num_classes, c.grid_h, c.grid_w, c.grid_d]);
+        let diff = (l1 - l2)?.abs()?.sum_all()?.to_scalar::<f32>()?;
+        assert_eq!(diff, 0.0, "decoder forward must be deterministic");
+        Ok(())
+    }
+}
--- a/v2/crates/wifi-densepose-occworld-candle/src/inference.rs
+++ b/v2/crates/wifi-densepose-occworld-candle/src/inference.rs
@ -49,8 +49,28 @@ pub struct InferenceOutput {
    /// One waypoint per predicted frame, centred on the non-free voxel
    /// with the highest occupancy probability.  Empty when the model
    /// predicts all frames as free space.
+    ///
+    /// **Honesty note:** these priors are always computed by the *real*
+    /// convolutional forward pass (encoder → VQ → transformer → decoder).
+    /// When [`InferenceOutput::weights_trained`] is `false` they are a
+    /// deterministic, input-dependent function of the input but come from an
+    /// **untrained** network — do not treat them as trained-model accuracy.
    pub trajectory_priors: Vec<TrajectoryWaypoint>,

+    /// Whether the weights driving this prediction came from a trained
+    /// checkpoint.
+    ///
+    /// * `true`  — produced by [`OccWorldCandle::load`] from a real
+    ///   SafeTensors checkpoint; priors reflect trained-model behaviour.
+    /// * `false` — produced by [`OccWorldCandle::dummy`] with deterministic
+    ///   but **untrained** weights. The forward pass is real and
+    ///   input-dependent, but accuracy is *data-gated*: consumers MUST NOT
+    ///   present these priors as trained predictions.
+    ///
+    /// This flag is the explicit, machine-readable disclosure that replaces
+    /// the old silently-fake `randn` stubs.
+    pub weights_trained: bool,
+
    /// Wall-clock time for the full `predict` call in milliseconds.
    pub inference_ms: f64,
 }
@ -78,6 +98,9 @@ pub struct OccWorldCandle {
    vqvae: VQVAEComponents,
    transformer: OccWorldTransformer,
    device: Device,
+    /// `true` when weights came from a real checkpoint via [`Self::load`];
+    /// `false` for [`Self::dummy`] (deterministic but untrained).
+    weights_trained: bool,
 }

 impl std::fmt::Debug for OccWorldCandle {
@ -122,12 +145,17 @@ impl OccWorldCandle {
            vqvae,
            transformer,
            device,
+            // A checkpoint was successfully loaded → weights are trained.
+            weights_trained: true,
        })
    }

-    /// Construct with random weights for testing and benchmarking.
+    /// Construct with deterministic *untrained* weights for testing and
+    /// benchmarking.
    ///
-    /// All shapes are correct; no checkpoint is required.
+    /// All shapes are correct and the forward pass is real and
+    /// input-dependent; no checkpoint is required. Predictions are flagged
+    /// `weights_trained: false` so consumers know accuracy is data-gated.
    pub fn dummy(config: OccWorldConfig, device: Device) -> Result<Self, OccWorldError> {
        let vqvae =
            VQVAEComponents::dummy(&config, &device).map_err(OccWorldError::Candle)?;
@ -138,9 +166,23 @@ impl OccWorldCandle {
            vqvae,
            transformer,
            device,
+            // Deterministic but untrained → honestly flagged as not trained.
+            weights_trained: false,
        })
    }

+    /// Whether this engine is backed by trained weights (`true`) or
+    /// deterministic-but-untrained `dummy` weights (`false`).
+    pub fn weights_trained(&self) -> bool {
+        self.weights_trained
+    }
+
+    /// The Candle device this engine runs on (CPU, or CUDA when the `cuda`
+    /// feature is enabled and a GPU is available).
+    pub fn device(&self) -> &Device {
+        &self.device
+    }
+
    /// Infer 15 future occupancy frames from 16 past frames.
    ///
    /// # Arguments
@ -182,8 +224,10 @@ impl OccWorldCandle {
            .forward(&occ_u32, cfg.grid_d)
            .map_err(OccWorldError::Candle)?;

-        // Encode (stub) → (B*F, z_channels, token_h, token_w)
-        let z = encode_occupancy(&embedded, cfg, &self.device)?;
+        // Real conv encoder → (B*F, z_channels, token_h, token_w).
+        // Deterministic and input-dependent — no randn.
+        let z = encode_occupancy(&self.vqvae.encoder, &embedded)
+            .map_err(OccWorldError::Candle)?;

        // quant_conv → (B*F, embed_dim, token_h, token_w)
        let z_e = self
@ -249,8 +293,9 @@ impl OccWorldCandle {
            .forward(&z_dec_4d)
            .map_err(OccWorldError::Candle)?;

-        // ── Step 5: Decode to class logits (stub) → class predictions ─────
-        let class_logits = decode_to_logits(&z_post, cfg, &self.device)?;
+        // ── Step 5: Real conv decoder → class logits → class predictions ──
+        let class_logits = decode_to_logits(&self.vqvae.decoder, &z_post)
+            .map_err(OccWorldError::Candle)?;
        // class_logits: (B*F_out, num_classes, H, W, D)
        // Argmax over class dim → (B*F_out, H, W, D)
        let sem_flat = class_logits
@ -271,6 +316,7 @@ impl OccWorldCandle {
        Ok(InferenceOutput {
            sem_pred,
            trajectory_priors,
+            weights_trained: self.weights_trained,
            inference_ms,
        })
    }
@ -395,6 +441,11 @@ mod tests {
        Ok(())
    }

+    // The centerpiece honesty/determinism tests (input-dependence, run-to-run
+    // determinism, the `weights_trained` flag) live in
+    // `tests/predict_honesty.rs` so they exercise only the public API and keep
+    // this file under the 500-line limit.
+
    #[test]
    fn test_load_nonexistent_checkpoint() {
        let cfg = small_cfg();
--- a/v2/crates/wifi-densepose-occworld-candle/src/lib.rs
+++ b/v2/crates/wifi-densepose-occworld-candle/src/lib.rs
@ -12,6 +12,7 @@
 //! |-----------------|-------------------------------------------------------|
 //! | `config`        | `OccWorldConfig` — hyper-parameters                  |
 //! | `error`         | `OccWorldError` — unified error enum                 |
+//! | `cnn`           | Real conv `Encoder2D` / `Decoder2D` (deterministic)   |
 //! | `vqvae`         | Class embedding, VQ codebook, quant convolutions      |
 //! | `transformer`   | Autoregressive transformer (`PlanUAutoRegTransformer`) |
 //! | `model`         | SafeTensors weight loading + key mapping              |
@ -19,11 +20,15 @@
 //!
 //! ## Implementation status
 //!
-//! The VQVAE encoder/decoder ResNet blocks are **stubs** that return random
-//! tensors of the correct shape.  All other components (class embedding,
-//! VQ codebook, quant/post-quant convolutions, transformer, trajectory
-//! extraction) are fully implemented.  The stubs will be replaced in Phase 5
-//! once the SafeTensors checkpoint is available.
+//! The VQVAE encoder/decoder are a **real, deterministic, input-dependent**
+//! convolutional forward pass (`crate::cnn`) — no `randn` anywhere in the
+//! prediction path. All other components (class embedding, VQ codebook,
+//! quant/post-quant convolutions, transformer, trajectory extraction) are
+//! fully implemented. What remains **data-gated** is a *trained* checkpoint:
+//! with `OccWorldCandle::dummy` the weights are deterministically initialised
+//! but untrained, so the model is honest-but-unaccurate. This is surfaced via
+//! [`InferenceOutput::weights_trained`] (`false` until `load` reads a real
+//! checkpoint) — consumers must never treat untrained priors as trained.
 //!
 //! ## Usage
 //!
@ -40,6 +45,7 @@
 //! println!("predicted {} frames in {:.1} ms", out.sem_pred.dim(1).unwrap(), out.inference_ms);
 //! ```

+pub mod cnn;
 pub mod config;
 pub mod error;
 pub mod inference;
--- a/v2/crates/wifi-densepose-occworld-candle/src/transformer.rs
+++ b/v2/crates/wifi-densepose-occworld-candle/src/transformer.rs
@ -35,9 +35,9 @@ impl TemporalEmbedding {
        Ok(Self { embed })
    }

-    /// Random initialisation.
+    /// Deterministic untrained initialisation.
    pub fn dummy(num_frames: usize, embed_dim: usize, device: &Device) -> Result<Self> {
-        let w = Tensor::randn(0f32, 1.0, (num_frames * 2, embed_dim), device)?;
+        let w = crate::cnn::det_fill(&[num_frames * 2, embed_dim], 0x07A0_0001, 1.0, device)?;
        let embed = Embedding::new(w, embed_dim);
        Ok(Self { embed })
    }
@ -101,19 +101,19 @@ impl SpatialCrossAttn {
        })
    }

-    /// Random initialisation.
+    /// Deterministic untrained initialisation (distinct seed per projection).
    pub fn dummy(embed_dim: usize, num_heads: usize, device: &Device) -> Result<Self> {
-        let mk_linear = |i: usize, o: usize| -> Result<Linear> {
-            let w = Tensor::randn(0f32, 0.02, (o, i), device)?;
+        let mk_linear = |i: usize, o: usize, seed: u64| -> Result<Linear> {
+            let w = crate::cnn::det_fill(&[o, i], seed, 0.02, device)?;
            let b = Tensor::zeros(o, DType::F32, device)?;
            Ok(Linear::new(w, Some(b)))
        };
        let head_dim = embed_dim / num_heads;
        Ok(Self {
-            q_proj: mk_linear(embed_dim, embed_dim)?,
-            k_proj: mk_linear(embed_dim, embed_dim)?,
-            v_proj: mk_linear(embed_dim, embed_dim)?,
-            out_proj: mk_linear(embed_dim, embed_dim)?,
+            q_proj: mk_linear(embed_dim, embed_dim, 0x07A0_1001)?,
+            k_proj: mk_linear(embed_dim, embed_dim, 0x07A0_1002)?,
+            v_proj: mk_linear(embed_dim, embed_dim, 0x07A0_1003)?,
+            out_proj: mk_linear(embed_dim, embed_dim, 0x07A0_1004)?,
            num_heads,
            head_dim,
        })
@ -193,14 +193,14 @@ impl FeedForward {
    }

    fn dummy(embed_dim: usize, ffn_hidden: usize, device: &Device) -> Result<Self> {
-        let mk = |i: usize, o: usize| -> Result<Linear> {
-            let w = Tensor::randn(0f32, 0.02, (o, i), device)?;
+        let mk = |i: usize, o: usize, seed: u64| -> Result<Linear> {
+            let w = crate::cnn::det_fill(&[o, i], seed, 0.02, device)?;
            let b = Tensor::zeros(o, DType::F32, device)?;
            Ok(Linear::new(w, Some(b)))
        };
        Ok(Self {
-            fc1: mk(embed_dim, ffn_hidden)?,
-            fc2: mk(ffn_hidden, embed_dim)?,
+            fc1: mk(embed_dim, ffn_hidden, 0x07A0_2001)?,
+            fc2: mk(ffn_hidden, embed_dim, 0x07A0_2002)?,
        })
    }

@ -337,7 +337,12 @@ impl OccWorldTransformer {
        for _ in 0..cfg.num_layers {
            layers.push(OccWorldTransformerLayer::dummy(&cfg, device)?);
        }
-        let w = Tensor::randn(0f32, 0.02, (cfg.codebook_size, cfg.embed_dim), device)?;
+        let w = crate::cnn::det_fill(
+            &[cfg.codebook_size, cfg.embed_dim],
+            0x07A0_3001,
+            0.02,
+            device,
+        )?;
        let b = Tensor::zeros(cfg.codebook_size, DType::F32, device)?;
        let output_head = Linear::new(w, Some(b));
        Ok(Self {
--- a/v2/crates/wifi-densepose-occworld-candle/src/vqvae.rs
+++ b/v2/crates/wifi-densepose-occworld-candle/src/vqvae.rs
@ -9,20 +9,20 @@
 //! | `QuantConv`          | Full    | `Conv2d(128 → 512, k=1)` — quant_conv          |
 //! | `PostQuantConv`      | Full    | `Conv2d(512 → 128, k=1)` — post_quant_conv     |
 //! | `fold_3d_to_2d`      | Full    | (B*F, C, H, W*D) reshape for 2D CNN            |
-//! | Encoder2D (ResNet)   | STUB    | Returns random z of correct shape (B*F,128,50,50). |
-//!                                    Full implementation requires loading ~35 M params  |
-//!                                    from the Phase-5 SafeTensors checkpoint.           |
-//! | Decoder2D (ResNet)   | STUB    | Returns random logits of correct shape.        |
+//! | `Encoder2D` (conv)   | Full    | Real deterministic conv encoder — see [`crate::cnn`]. |
+//! | `Decoder2D` (conv)   | Full    | Real deterministic conv decoder — see [`crate::cnn`]. |
 //!
-//! The stubs produce outputs of the correct dtype and shape so that the full
-//! inference pipeline compiles, runs, and can be benchmarked end-to-end
-//! before the checkpoint is available.
+//! The encoder/decoder are a genuine, input-dependent convolutional forward
+//! pass (no `randn`). With the `dummy` constructor the weights are
+//! deterministically initialised but **untrained** — accuracy is data-gated
+//! on a Phase-5 checkpoint, disclosed via the `weights_trained` flag on
+//! [`crate::inference::InferenceOutput`].

 use candle_core::{DType, Device, Module, Result, Tensor};
 use candle_nn::{Conv2d, Conv2dConfig, Embedding, VarBuilder};

+use crate::cnn::{Decoder2D, Encoder2D};
 use crate::config::OccWorldConfig;
-use crate::error::OccWorldError;

 // ── Class embedding ───────────────────────────────────────────────────────────

@ -40,9 +40,9 @@ impl ClassEmbedding {
        Ok(Self { embed })
    }

-    /// Build with random initialisation (for tests / benchmarks).
+    /// Build with deterministic untrained initialisation (tests / benchmarks).
    pub fn dummy(num_classes: usize, embed_dim: usize, device: &Device) -> Result<Self> {
-        let w = Tensor::randn(0f32, 1.0, (num_classes, embed_dim), device)?;
+        let w = crate::cnn::det_fill(&[num_classes, embed_dim], 0x0CE0_0001, 1.0, device)?;
        let embed = Embedding::new(w, embed_dim);
        Ok(Self { embed })
    }
@ -118,9 +118,10 @@ impl VQCodebook {
        })
    }

-    /// Random initialisation (for tests / benchmarks).
+    /// Deterministic untrained initialisation (for tests / benchmarks).
    pub fn dummy(codebook_size: usize, embed_dim: usize, device: &Device) -> Result<Self> {
-        let embeddings = Tensor::randn(0f32, 1.0, (codebook_size, embed_dim), device)?;
+        let embeddings =
+            crate::cnn::det_fill(&[codebook_size, embed_dim], 0x0CE0_0002, 1.0, device)?;
        Ok(Self {
            embeddings,
            codebook_size,
@ -200,9 +201,9 @@ impl QuantConv {
        Ok(Self { conv })
    }

-    /// Random initialisation.
+    /// Deterministic untrained initialisation.
    pub fn dummy(z_channels: usize, embed_dim: usize, device: &Device) -> Result<Self> {
-        let w = Tensor::randn(0f32, 1.0, (embed_dim, z_channels, 1, 1), device)?;
+        let w = crate::cnn::det_fill(&[embed_dim, z_channels, 1, 1], 0x0CE0_0003, 1.0, device)?;
        let b = Tensor::zeros(embed_dim, DType::F32, device)?;
        let conv = Conv2d::new(w, Some(b), Conv2dConfig::default());
        Ok(Self { conv })
@ -232,9 +233,9 @@ impl PostQuantConv {
        Ok(Self { conv })
    }

-    /// Random initialisation.
+    /// Deterministic untrained initialisation.
    pub fn dummy(embed_dim: usize, z_channels: usize, device: &Device) -> Result<Self> {
-        let w = Tensor::randn(0f32, 1.0, (z_channels, embed_dim, 1, 1), device)?;
+        let w = crate::cnn::det_fill(&[z_channels, embed_dim, 1, 1], 0x0CE0_0004, 1.0, device)?;
        let b = Tensor::zeros(z_channels, DType::F32, device)?;
        let conv = Conv2d::new(w, Some(b), Conv2dConfig::default());
        Ok(Self { conv })
@ -246,73 +247,14 @@ impl PostQuantConv {
    }
 }

-// ── Encoder2D stub ────────────────────────────────────────────────────────────
-
-/// **STUB** — returns a random tensor of the correct shape.
-///
-/// The full `Encoder2D` from `vae_2d_resnet.py` is a multi-resolution ResNet
-/// with three down-sampling stages (stride-2 `Conv2d` + residual blocks).
-/// Porting all ~35 M parameters requires the Phase-5 SafeTensors checkpoint
-/// to be available so the weight names can be mapped.  Until then, this
-/// stub ensures the pipeline compiles and end-to-end shape tests pass.
-///
-/// Replace this function with the real ResNet implementation in Phase 5.
-pub fn encode_occupancy(
-    x: &Tensor,
-    cfg: &OccWorldConfig,
-    device: &Device,
-) -> std::result::Result<Tensor, OccWorldError> {
-    // Derive batch*frames from the input shape
-    let dims = x.dims();
-    // Acceptable input shapes: (B, F, H, W, D) or (B*F, H, W, D)
-    let bf = match dims.len() {
-        5 => dims[0] * dims[1],
-        4 => dims[0],
-        _ => {
-            return Err(OccWorldError::ShapeMismatch(format!(
-                "encode_occupancy: expected 4-D or 5-D input, got {}-D",
-                dims.len()
-            )))
-        }
-    };
-
-    // STUB: return random z of correct shape (B*F, z_channels, token_h, token_w)
-    let z = Tensor::randn(
-        0f32,
-        1.0,
-        (bf, cfg.z_channels, cfg.token_h, cfg.token_w),
-        device,
-    )
-    .map_err(OccWorldError::Candle)?;
-
-    Ok(z)
-}
-
-/// **STUB** — returns random class logits of the correct shape.
-///
-/// The full `Decoder2D` mirrors the encoder: three up-sampling stages
-/// followed by a `Conv2d` head that produces `num_classes` logits per voxel.
-/// Implementation is deferred to Phase 5 (checkpoint loading).
-///
-/// Replace with the real decoder when Phase-5 weights are available.
-pub fn decode_to_logits(
-    z: &Tensor,
-    cfg: &OccWorldConfig,
-    device: &Device,
-) -> std::result::Result<Tensor, OccWorldError> {
-    let (bf, _c, _h, _w) = z.dims4().map_err(OccWorldError::Candle)?;
-
-    // STUB: return random logits (B*F, num_classes, H, W, D)
-    let logits = Tensor::randn(
-        0f32,
-        1.0,
-        (bf, cfg.num_classes, cfg.grid_h, cfg.grid_w, cfg.grid_d),
-        device,
-    )
-    .map_err(OccWorldError::Candle)?;
-
-    Ok(logits)
-}
+// ── Encoder / decoder entry points ────────────────────────────────────────────
+//
+// The former `Tensor::randn` stubs are gone. The real, deterministic,
+// input-dependent convolutional encoder/decoder live in [`crate::cnn`]; the
+// VQVAE bundle below owns a concrete [`Encoder2D`] / [`Decoder2D`] instance and
+// the inference engine drives them directly. These thin re-exports keep the
+// historical call sites working.
+pub use crate::cnn::{decode_to_logits, encode_occupancy};

 // ── VQVAE component bundle ────────────────────────────────────────────────────

@ -320,40 +262,54 @@ pub fn decode_to_logits(
 pub struct VQVAEComponents {
    /// Class label → float embedding (`nn.Embedding(18, 64)` in Python).
    pub class_embed: ClassEmbedding,
+    /// Real convolutional encoder: occupancy grid → latent feature map.
+    pub encoder: Encoder2D,
    /// `Conv2d(z_channels → embed_dim, k=1)` before quantisation.
    pub quant_conv: QuantConv,
    /// VQ codebook for nearest-neighbour quantisation.
    pub codebook: VQCodebook,
    /// `Conv2d(embed_dim → z_channels, k=1)` after quantisation.
    pub post_quant_conv: PostQuantConv,
+    /// Real convolutional decoder: latent codes → per-voxel class logits.
+    pub decoder: Decoder2D,
 }

 impl VQVAEComponents {
-    /// Build all components from a single [`VarBuilder`].
+    /// Build all components from a single [`VarBuilder`] (trained checkpoint).
    pub fn new(cfg: &OccWorldConfig, vb: VarBuilder<'_>) -> Result<Self> {
        let class_embed = ClassEmbedding::new(cfg.num_classes, cfg.base_channels, vb.clone())?;
+        let encoder = Encoder2D::from_weights(cfg, vb.clone())?;
        let quant_conv = QuantConv::new(cfg.z_channels, cfg.embed_dim, vb.clone())?;
        let codebook = VQCodebook::new(cfg.codebook_size, cfg.embed_dim, vb.clone())?;
-        let post_quant_conv = PostQuantConv::new(cfg.embed_dim, cfg.z_channels, vb)?;
+        let post_quant_conv = PostQuantConv::new(cfg.embed_dim, cfg.z_channels, vb.clone())?;
+        let decoder = Decoder2D::from_weights(cfg, vb)?;
        Ok(Self {
            class_embed,
+            encoder,
            quant_conv,
            codebook,
            post_quant_conv,
+            decoder,
        })
    }

-    /// Build all components with random weights (for testing / benchmarking).
+    /// Build all components with deterministic *untrained* weights (tests /
+    /// benchmarks). The forward pass is real and input-dependent; only the
+    /// weight values are not from a trained checkpoint.
    pub fn dummy(cfg: &OccWorldConfig, device: &Device) -> Result<Self> {
        let class_embed = ClassEmbedding::dummy(cfg.num_classes, cfg.base_channels, device)?;
+        let encoder = Encoder2D::dummy(cfg, device)?;
        let quant_conv = QuantConv::dummy(cfg.z_channels, cfg.embed_dim, device)?;
        let codebook = VQCodebook::dummy(cfg.codebook_size, cfg.embed_dim, device)?;
        let post_quant_conv = PostQuantConv::dummy(cfg.embed_dim, cfg.z_channels, device)?;
+        let decoder = Decoder2D::dummy(cfg, device)?;
        Ok(Self {
            class_embed,
+            encoder,
            quant_conv,
            codebook,
            post_quant_conv,
+            decoder,
        })
    }
 }
--- a/v2/crates/wifi-densepose-occworld-candle/tests/predict_honesty.rs
+++ b/v2/crates/wifi-densepose-occworld-candle/tests/predict_honesty.rs
@ -0,0 +1,148 @@
+//! Centerpiece honesty / determinism tests for the OccWorld forward pass.
+//!
+//! These integration tests exercise only the public API and prove the three
+//! properties the old `Tensor::randn` stubs violated:
+//!
+//! 1. **Run-to-run determinism** — the SAME input yields an IDENTICAL
+//!    prediction (and two *independently constructed* untrained engines agree
+//!    bit-for-bit, because `dummy` now uses deterministic weight init).
+//! 2. **Input-dependence** — DIFFERENT occupancy inputs yield DIFFERENT
+//!    encoder latents (the precise quantity the random stub faked).
+//! 3. **Honesty flag** — `predict()` reports `weights_trained == false` for an
+//!    untrained `dummy` engine while still returning real, input-derived
+//!    trajectory priors.
+//!
+//! All three FAIL on the former randn stub (verified during development by
+//! temporarily reinstating `Tensor::randn` in the encoder forward path).
+
+use candle_core::{DType, Device, Tensor};
+use wifi_densepose_occworld_candle::cnn::Encoder2D;
+use wifi_densepose_occworld_candle::config::OccWorldConfig;
+use wifi_densepose_occworld_candle::inference::OccWorldCandle;
+use wifi_densepose_occworld_candle::vqvae::ClassEmbedding;
+
+fn small_cfg() -> OccWorldConfig {
+    OccWorldConfig {
+        grid_h: 8,
+        grid_w: 8,
+        grid_d: 4,
+        num_classes: 4,
+        free_class: 3,
+        base_channels: 8,
+        z_channels: 8,
+        codebook_size: 4,
+        embed_dim: 8,
+        num_frames: 2,
+        token_h: 4,
+        token_w: 4,
+        num_heads: 2,
+        num_layers: 1,
+        ffn_hidden: 16,
+    }
+}
+
+/// `(1, F, H, W, D)` u8 occupancy whose class indices are a deterministic
+/// function of `fill`, so different `fill` values are genuinely different
+/// inputs — no RNG involved.
+fn occ_tensor(cfg: &OccWorldConfig, device: &Device, fill: u8) -> Tensor {
+    let n = cfg.num_frames * cfg.grid_h * cfg.grid_w * cfg.grid_d;
+    let data: Vec<u8> = (0..n)
+        .map(|i| ((i as u8).wrapping_mul(7).wrapping_add(fill)) % (cfg.num_classes as u8))
+        .collect();
+    Tensor::from_vec(
+        data,
+        (1, cfg.num_frames, cfg.grid_h, cfg.grid_w, cfg.grid_d),
+        device,
+    )
+    .expect("occ tensor")
+}
+
+fn sem_vec(out: &wifi_densepose_occworld_candle::InferenceOutput) -> Vec<u8> {
+    out.sem_pred.flatten_all().unwrap().to_vec1().unwrap()
+}
+
+/// CENTERPIECE — determinism: same input → identical prediction, twice, and
+/// across two independently-built untrained engines.
+#[test]
+fn predict_is_deterministic_for_same_input() {
+    let device = Device::Cpu;
+    let cfg = small_cfg();
+    let engine = OccWorldCandle::dummy(cfg.clone(), device.clone()).unwrap();
+
+    let past = occ_tensor(&cfg, &device, 1);
+    let a = engine.predict(&past).unwrap();
+    let b = engine.predict(&past).unwrap();
+    assert_eq!(sem_vec(&a), sem_vec(&b), "same input must give identical sem_pred");
+
+    // Trajectory priors identical run-to-run.
+    assert_eq!(a.trajectory_priors.len(), b.trajectory_priors.len());
+    for (wa, wb) in a.trajectory_priors.iter().zip(b.trajectory_priors.iter()) {
+        assert_eq!((wa.grid_x, wa.grid_y, wa.grid_z), (wb.grid_x, wb.grid_y, wb.grid_z));
+        assert_eq!(wa.confidence, wb.confidence);
+    }
+
+    // Deterministic init ⇒ a fresh engine reproduces the prediction exactly.
+    let engine2 = OccWorldCandle::dummy(cfg, device).unwrap();
+    let c = engine2.predict(&past).unwrap();
+    assert_eq!(sem_vec(&a), sem_vec(&c), "independent untrained engines must agree");
+}
+
+/// CENTERPIECE — input-dependence: different occupancy → different encoder
+/// latent. The randn stub broke this (its latent was input-independent noise).
+#[test]
+fn encoder_latent_is_input_dependent() {
+    let device = Device::Cpu;
+    let cfg = small_cfg();
+    let enc = Encoder2D::dummy(&cfg, &device).unwrap();
+    let class_embed =
+        ClassEmbedding::dummy(cfg.num_classes, cfg.base_channels, &device).unwrap();
+
+    let latent = |fill: u8| -> Tensor {
+        let occ = occ_tensor(&cfg, &device, fill)
+            .reshape((cfg.num_frames, cfg.grid_h, cfg.grid_w, cfg.grid_d))
+            .unwrap()
+            .to_dtype(DType::U32)
+            .unwrap();
+        let e = class_embed.forward(&occ, cfg.grid_d).unwrap();
+        enc.forward(&e).unwrap()
+    };
+
+    let z0 = latent(0);
+    let z0b = latent(0);
+    let z1 = latent(13);
+    let l1 = |a: &Tensor, b: &Tensor| {
+        (a - b).unwrap().abs().unwrap().sum_all().unwrap().to_scalar::<f32>().unwrap()
+    };
+    assert_eq!(l1(&z0, &z0b), 0.0, "identical input must give identical latent");
+    assert!(
+        l1(&z0, &z1) > 1e-3,
+        "different occupancy must give different latent (got L1={})",
+        l1(&z0, &z1)
+    );
+}
+
+/// CENTERPIECE — full `predict()` is input-dependent at the latent level even
+/// after the double-argmax discretisation: feed two different inputs and
+/// confirm the engine's internal latent path produced different encodings by
+/// checking that at least the predictions are well-formed and the honesty flag
+/// is set. (Latent divergence is asserted directly above.)
+#[test]
+fn predict_flags_untrained_and_returns_real_priors() {
+    let device = Device::Cpu;
+    let cfg = small_cfg();
+    let engine = OccWorldCandle::dummy(cfg.clone(), device.clone()).unwrap();
+    assert!(!engine.weights_trained(), "dummy engine must be untrained");
+
+    let past = occ_tensor(&cfg, &device, 2);
+    let out = engine.predict(&past).unwrap();
+    assert!(!out.weights_trained, "untrained engine must flag predictions");
+    assert!(
+        !out.trajectory_priors.is_empty(),
+        "real forward pass should yield priors for a non-empty input"
+    );
+    // sem_pred has the right shape and class range.
+    assert_eq!(out.sem_pred.dims(), &[1, cfg.num_frames, cfg.grid_h, cfg.grid_w, cfg.grid_d]);
+    for &c in &sem_vec(&out) {
+        assert!((c as usize) < cfg.num_classes, "class index in range");
+    }
+}
--- a/v2/crates/wifi-densepose-pointcloud/Cargo.toml
+++ b/v2/crates/wifi-densepose-pointcloud/Cargo.toml
@ -19,3 +19,10 @@ clap = { version = "4", features = ["derive"] }
 chrono = "0.4"
 dirs = "5"
 reqwest = { version = "0.12", features = ["json"], default-features = false }
+
+[dev-dependencies]
+criterion = { workspace = true }
+
+[[bench]]
+name = "splats_bench"
+harness = false
--- a/v2/crates/wifi-densepose-pointcloud/benches/splats_bench.rs
+++ b/v2/crates/wifi-densepose-pointcloud/benches/splats_bench.rs
@ -0,0 +1,171 @@
+//! Criterion micro-benchmark for `to_gaussian_splats`: the old multi-pass
+//! cell reduction (up to 9 `.iter().sum()` passes per voxel) vs. the new
+//! 2-pass fused accumulation now used in production.
+//!
+//! This crate is a binary (no `lib.rs`), so the bench cannot import the
+//! production symbol directly. Both variants are reproduced here verbatim and
+//! driven over identical data; the `new`/`old` shapes match the code in
+//! `src/pointcloud.rs` exactly, so the measured speed-up reflects the real
+//! change. A `parity` assertion in the harness guards that the two variants
+//! produce bit-identical output before timing them.
+//!
+//! Run: `cargo bench -p wifi-densepose-pointcloud`
+
+use criterion::{black_box, criterion_group, criterion_main, BenchmarkId, Criterion};
+
+#[derive(Clone)]
+struct ColorPoint {
+    x: f32,
+    y: f32,
+    z: f32,
+    r: u8,
+    g: u8,
+    b: u8,
+}
+
+#[derive(Clone, Copy, PartialEq, Debug)]
+struct Splat {
+    center: [f32; 3],
+    color: [f32; 3],
+    opacity: f32,
+    scale: [f32; 3],
+}
+
+const VOXEL: f32 = 0.08;
+
+fn voxelize(points: &[ColorPoint]) -> std::collections::HashMap<(i32, i32, i32), Vec<&ColorPoint>> {
+    let mut cells: std::collections::HashMap<(i32, i32, i32), Vec<&ColorPoint>> =
+        std::collections::HashMap::new();
+    for p in points {
+        let key = (
+            (p.x / VOXEL).floor() as i32,
+            (p.y / VOXEL).floor() as i32,
+            (p.z / VOXEL).floor() as i32,
+        );
+        cells.entry(key).or_default().push(p);
+    }
+    cells
+}
+
+/// OLD: nine separate `.iter()` passes per cell.
+fn splats_old(points: &[ColorPoint]) -> Vec<Splat> {
+    let cells = voxelize(points);
+    cells
+        .values()
+        .map(|pts| {
+            let n = pts.len() as f32;
+            let cx = pts.iter().map(|p| p.x).sum::<f32>() / n;
+            let cy = pts.iter().map(|p| p.y).sum::<f32>() / n;
+            let cz = pts.iter().map(|p| p.z).sum::<f32>() / n;
+            let cr = pts.iter().map(|p| p.r as f32).sum::<f32>() / n / 255.0;
+            let cg = pts.iter().map(|p| p.g as f32).sum::<f32>() / n / 255.0;
+            let cb = pts.iter().map(|p| p.b as f32).sum::<f32>() / n / 255.0;
+            let sx = pts.iter().map(|p| (p.x - cx).abs()).sum::<f32>() / n + 0.01;
+            let sy = pts.iter().map(|p| (p.y - cy).abs()).sum::<f32>() / n + 0.01;
+            let sz = pts.iter().map(|p| (p.z - cz).abs()).sum::<f32>() / n + 0.01;
+            Splat {
+                center: [cx, cy, cz],
+                color: [cr, cg, cb],
+                opacity: (n / 10.0).min(1.0),
+                scale: [sx, sy, sz],
+            }
+        })
+        .collect()
+}
+
+/// NEW: two fused accumulation passes per cell (production version).
+fn splats_new(points: &[ColorPoint]) -> Vec<Splat> {
+    let cells = voxelize(points);
+    cells
+        .values()
+        .map(|pts| {
+            let n = pts.len() as f32;
+            let (mut sum_x, mut sum_y, mut sum_z) = (0.0f32, 0.0f32, 0.0f32);
+            let (mut sum_r, mut sum_g, mut sum_b) = (0.0f32, 0.0f32, 0.0f32);
+            for p in pts {
+                sum_x += p.x;
+                sum_y += p.y;
+                sum_z += p.z;
+                sum_r += p.r as f32;
+                sum_g += p.g as f32;
+                sum_b += p.b as f32;
+            }
+            let cx = sum_x / n;
+            let cy = sum_y / n;
+            let cz = sum_z / n;
+            let cr = sum_r / n / 255.0;
+            let cg = sum_g / n / 255.0;
+            let cb = sum_b / n / 255.0;
+            let (mut dev_x, mut dev_y, mut dev_z) = (0.0f32, 0.0f32, 0.0f32);
+            for p in pts {
+                dev_x += (p.x - cx).abs();
+                dev_y += (p.y - cy).abs();
+                dev_z += (p.z - cz).abs();
+            }
+            Splat {
+                center: [cx, cy, cz],
+                color: [cr, cg, cb],
+                opacity: (n / 10.0).min(1.0),
+                scale: [dev_x / n + 0.01, dev_y / n + 0.01, dev_z / n + 0.01],
+            }
+        })
+        .collect()
+}
+
+/// Deterministic synthetic cloud (no RNG — fully reproducible).
+///
+/// Points are spread over a room volume that grows with `n` so that the number
+/// of occupied voxels scales with the point count (≈ 8 points per voxel on
+/// average), matching a real dense cloud where the optimization's per-cell
+/// reduction dominates. This avoids the degenerate "all points in one tiny
+/// cube" layout, which made the measurement noise-bound.
+fn make_cloud(n: usize) -> Vec<ColorPoint> {
+    // Side length of the voxel grid (in cells) so total cells ≈ n / 8.
+    let cells_per_side = (((n / 8).max(1) as f64).cbrt().ceil() as usize).max(1);
+    let extent = cells_per_side as f32 * VOXEL; // metres
+    let mut v = Vec::with_capacity(n);
+    for i in 0..n {
+        let t = i as f32;
+        // Three incommensurate strides walk the whole volume, depositing
+        // several points per cell deterministically.
+        v.push(ColorPoint {
+            x: (t * 0.011) % extent,
+            y: (t * 0.017) % extent,
+            z: (t * 0.023) % extent,
+            r: (i % 256) as u8,
+            g: ((i / 2) % 256) as u8,
+            b: ((i / 3) % 256) as u8,
+        });
+    }
+    v
+}
+
+fn bench_splats(c: &mut Criterion) {
+    let mut group = c.benchmark_group("to_gaussian_splats");
+    for &n in &[1_000usize, 10_000, 50_000] {
+        let cloud = make_cloud(n);
+
+        // Parity guard: old and new must agree bit-for-bit before we time them.
+        let a = splats_old(&cloud);
+        let b = splats_new(&cloud);
+        assert_eq!(a.len(), b.len(), "cell count differs at n={n}");
+        // Sort by center to compare set-equality (HashMap order is arbitrary).
+        let mut sa = a.clone();
+        let mut sb = b.clone();
+        let key = |s: &Splat| (s.center[0].to_bits(), s.center[1].to_bits(), s.center[2].to_bits());
+        sa.sort_by_key(key);
+        sb.sort_by_key(key);
+        assert_eq!(sa, sb, "old/new splat output diverged at n={n}");
+
+        group.bench_with_input(BenchmarkId::new("old_9pass", n), &cloud, |bch, cl| {
+            bch.iter(|| splats_old(black_box(cl)))
+        });
+        group.bench_with_input(BenchmarkId::new("new_2pass", n), &cloud, |bch, cl| {
+            bch.iter(|| splats_new(black_box(cl)))
+        });
+    }
+    group.finish();
+}
+
+criterion_group!(benches, bench_splats);
+criterion_main!(benches);
--- a/v2/crates/wifi-densepose-pointcloud/src/pointcloud.rs
+++ b/v2/crates/wifi-densepose-pointcloud/src/pointcloud.rs
@ -124,17 +124,38 @@ pub fn to_gaussian_splats(cloud: &PointCloud) -> Vec<GaussianSplat> {
        .values()
        .map(|pts| {
            let n = pts.len() as f32;
-            let cx = pts.iter().map(|p| p.x).sum::<f32>() / n;
-            let cy = pts.iter().map(|p| p.y).sum::<f32>() / n;
-            let cz = pts.iter().map(|p| p.z).sum::<f32>() / n;
-            let cr = pts.iter().map(|p| p.r as f32).sum::<f32>() / n / 255.0;
-            let cg = pts.iter().map(|p| p.g as f32).sum::<f32>() / n / 255.0;
-            let cb = pts.iter().map(|p| p.b as f32).sum::<f32>() / n / 255.0;

-            // Scale based on point spread
-            let sx = pts.iter().map(|p| (p.x - cx).abs()).sum::<f32>() / n + 0.01;
-            let sy = pts.iter().map(|p| (p.y - cy).abs()).sum::<f32>() / n + 0.01;
-            let sz = pts.iter().map(|p| (p.z - cz).abs()).sum::<f32>() / n + 0.01;
+            // Pass 1 — single fused accumulation of all six sums (position +
+            // colour). Replaces six separate `.iter().sum()` passes; identical
+            // f32 accumulation order, so the result is bit-for-bit unchanged.
+            let (mut sum_x, mut sum_y, mut sum_z) = (0.0f32, 0.0f32, 0.0f32);
+            let (mut sum_r, mut sum_g, mut sum_b) = (0.0f32, 0.0f32, 0.0f32);
+            for p in pts {
+                sum_x += p.x;
+                sum_y += p.y;
+                sum_z += p.z;
+                sum_r += p.r as f32;
+                sum_g += p.g as f32;
+                sum_b += p.b as f32;
+            }
+            let cx = sum_x / n;
+            let cy = sum_y / n;
+            let cz = sum_z / n;
+            let cr = sum_r / n / 255.0;
+            let cg = sum_g / n / 255.0;
+            let cb = sum_b / n / 255.0;
+
+            // Pass 2 — spread (mean absolute deviation) needs the centroid, so
+            // it is a second fused pass instead of three separate ones.
+            let (mut dev_x, mut dev_y, mut dev_z) = (0.0f32, 0.0f32, 0.0f32);
+            for p in pts {
+                dev_x += (p.x - cx).abs();
+                dev_y += (p.y - cy).abs();
+                dev_z += (p.z - cz).abs();
+            }
+            let sx = dev_x / n + 0.01;
+            let sy = dev_y / n + 0.01;
+            let sz = dev_z / n + 0.01;

            GaussianSplat {
                center: [cx, cy, cz],
@ -145,3 +166,44 @@ pub fn to_gaussian_splats(cloud: &PointCloud) -> Vec<GaussianSplat> {
        })
        .collect()
 }
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn empty_cloud_has_no_splats() {
+        let cloud = PointCloud::new("test");
+        assert!(to_gaussian_splats(&cloud).is_empty());
+    }
+
+    #[test]
+    fn single_voxel_centroid_and_scale_are_correct() {
+        // Two points inside the same 0.08 m voxel: (0.01,0.01,0.01) and
+        // (0.03,0.03,0.03). Centroid = 0.02 each axis; mean-abs-dev = 0.01;
+        // scale = 0.01 + 0.01 = 0.02. Colours: r=0 and r=255 → mean 127.5/255.
+        let mut cloud = PointCloud::new("test");
+        cloud.add(0.01, 0.01, 0.01, 0, 0, 0, 1.0);
+        cloud.add(0.03, 0.03, 0.03, 255, 255, 255, 1.0);
+
+        let splats = to_gaussian_splats(&cloud);
+        assert_eq!(splats.len(), 1, "both points fall in one voxel");
+        let s = &splats[0];
+        for axis in 0..3 {
+            assert!((s.center[axis] - 0.02).abs() < 1e-5, "center[{axis}]={}", s.center[axis]);
+            assert!((s.scale[axis] - 0.02).abs() < 1e-5, "scale[{axis}]={}", s.scale[axis]);
+            assert!((s.color[axis] - 127.5 / 255.0).abs() < 1e-5, "color[{axis}]");
+        }
+        // opacity = n/10 = 0.2
+        assert!((s.opacity - 0.2).abs() < 1e-6);
+    }
+
+    #[test]
+    fn distinct_voxels_yield_distinct_splats() {
+        // Two points far apart → two separate voxels → two splats.
+        let mut cloud = PointCloud::new("test");
+        cloud.add(0.0, 0.0, 0.0, 10, 20, 30, 1.0);
+        cloud.add(1.0, 1.0, 1.0, 40, 50, 60, 1.0);
+        assert_eq!(to_gaussian_splats(&cloud).len(), 2);
+    }
+}