# ADR-058: Dual-Modal WASM Browser Pose Estimation — Live Video + WiFi CSI Fusion - **Status**: Proposed - **Date**: 2026-03-12 - **Deciders**: ruv - **Tags**: wasm, browser, cnn, pose-estimation, ruvector, video, multimodal, fusion ## Context WiFi-DensePose estimates human poses from WiFi CSI (Channel State Information). The `ruvector-cnn` crate provides a pure Rust CNN (MobileNet-V3) with WASM bindings. Both modalities exist independently — what's missing is **fusing live webcam video with WiFi CSI** in a single browser demo to achieve robust pose estimation that works even when one modality degrades (occlusion, signal noise, poor lighting). Existing assets: 1. **`wifi-densepose-wasm`** — CSI signal processing compiled to WASM 2. **`wifi-densepose-sensing-server`** — Axum server streaming live CSI via WebSocket 3. **`ruvector-cnn`** — Pure Rust CNN with MobileNet-V3 backbones, SIMD, contrastive learning 4. **`ruvector-cnn-wasm`** — wasm-bindgen bindings: `WasmCnnEmbedder`, `SimdOps`, `LayerOps`, contrastive losses 5. **`vendor/ruvector/examples/wasm-vanilla/`** — Reference vanilla JS WASM example Research shows multi-modal fusion (camera + WiFi) significantly outperforms either alone: - Camera fails under occlusion, poor lighting, privacy constraints - WiFi CSI fails with signal noise, multipath, low spatial resolution - Fusion compensates: WiFi provides through-wall coverage, camera provides fine-grained detail ## Decision Build a **dual-modal browser demo** at `examples/wasm-browser-pose/` that: 1. Captures **live webcam video** via `getUserMedia` API 2. Receives **live WiFi CSI** via WebSocket from the sensing server 3. Processes **both streams** through separate CNN pipelines in `ruvector-cnn-wasm` 4. **Fuses embeddings** with learned attention weights for combined pose estimation 5. Renders **video overlay** with skeleton + WiFi confidence heatmap on Canvas 6. Runs entirely in the browser — all inference client-side via WASM ### Architecture ``` ┌──────────────────────────────────────────────────────────────────┐ │ Browser │ │ │ │ ┌────────────┐ ┌────────────────┐ ┌───────────────────┐ │ │ │ getUserMedia│───▶│ Video Frame │───▶│ CNN WASM │ │ │ │ (Webcam) │ │ Capture │ │ (Visual Embedder) │ │ │ └────────────┘ │ 224×224 RGB │ │ → 512-dim │ │ │ └────────────────┘ └────────┬──────────┘ │ │ │ │ │ visual_embedding │ │ │ │ │ ┌──────▼──────┐ │ │ ┌────────────┐ ┌────────────────┐ │ │ │ │ │ WebSocket │───▶│ CSI WASM │ │ Attention │ │ │ │ Client │ │ (densepose- │ │ Fusion │ │ │ │ │ │ wasm) │ │ Module │ │ │ └────────────┘ └───────┬────────┘ │ │ │ │ │ └──────┬──────┘ │ │ ┌───────▼────────┐ │ │ │ │ CNN WASM │ fused_embedding │ │ │ (CSI Embedder) │ │ │ │ │ → 512-dim │ ┌──────▼──────┐ │ │ └───────┬────────┘ │ Pose │ │ │ │ │ Decoder │ │ │ csi_embedding │ → 17 kpts │ │ │ │ └──────┬──────┘ │ │ └──────────────────────┘ │ │ │ │ │ ┌──────────────┐ ┌─────▼──────┐ │ │ │ Video Canvas │◀────────│ Overlay │ │ │ │ + Skeleton │ │ Renderer │ │ │ │ + Heatmap │ └────────────┘ │ │ └──────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────┘ ▲ ▲ │ getUserMedia │ WebSocket │ (camera) │ (ws://host:3030/ws/csi) │ │ ┌────┴────┐ ┌───────┴─────────┐ │ Webcam │ │ Sensing Server │ └─────────┘ └─────────────────┘ ``` ### Dual Pipeline Design Two parallel CNN pipelines run on each frame tick (~30 FPS): | Pipeline | Input | Preprocessing | CNN Config | Output | |----------|-------|---------------|------------|--------| | **Visual** | Webcam frame (640×480) | Resize to 224×224 RGB, ImageNet normalize | MobileNet-V3 Small, 512-dim | Visual embedding | | **CSI** | CSI frame (ADR-018 binary) | Amplitude/phase/delta → 224×224 pseudo-RGB | MobileNet-V3 Small, 512-dim | CSI embedding | Both use the same `WasmCnnEmbedder` but with separate instances and weight sets. ### Fusion Strategy **Learned attention-weighted fusion** combines the two 512-dim embeddings: ```javascript // Attention fusion: learn which modality to trust per-dimension // α ∈ [0,1]^512 — attention weights (shipped as JSON, trained offline) // visual_emb, csi_emb ∈ R^512 function fuseEmbeddings(visual_emb, csi_emb, attention_weights) { const fused = new Float32Array(512); for (let i = 0; i < 512; i++) { const α = attention_weights[i]; fused[i] = α * visual_emb[i] + (1 - α) * csi_emb[i]; } return fused; } ``` **Dynamic confidence gating** adjusts fusion based on signal quality: | Condition | Behavior | |-----------|----------| | Good video + good CSI | Balanced fusion (α ≈ 0.5) | | Poor lighting / occlusion | CSI-dominant (α → 0, WiFi takes over) | | CSI noise / no ESP32 | Video-dominant (α → 1, camera only) | | Video-only mode (no WiFi) | α = 1.0, pure visual CNN pose estimation | | CSI-only mode (no camera) | α = 0.0, pure WiFi pose estimation | Quality detection: - **Video quality**: Frame brightness variance (dark = low quality), motion blur score - **CSI quality**: Signal-to-noise ratio from `wifi-densepose-wasm`, coherence gate output ### CSI-to-Image Encoding CSI data encoded as 3-channel pseudo-image for the CSI CNN pipeline: | Channel | Data | Normalization | |---------|------|---------------| | R | CSI amplitude (subcarrier × time window) | Min-max to [0, 255] | | G | CSI phase (unwrapped, subcarrier × time window) | Min-max to [0, 255] | | B | Temporal difference (frame-to-frame Δ amplitude) | Abs, min-max to [0, 255] | ### Video Processing Webcam frames processed through standard ImageNet pipeline: ```javascript // Capture frame from video element const frame = captureVideoFrame(videoElement, 224, 224); // Returns Uint8Array RGB // ImageNet normalization happens inside WasmCnnEmbedder.extract() const visual_embedding = visual_embedder.extract(frame, 224, 224); ``` ### Pose Keypoint Mapping 17 COCO-format keypoints decoded from the fused 512-dim embedding: ``` 0: nose 1: left_eye 2: right_eye 3: left_ear 4: right_ear 5: left_shoulder 6: right_shoulder 7: left_elbow 8: right_elbow 9: left_wrist 10: right_wrist 11: left_hip 12: right_hip 13: left_knee 14: right_knee 15: left_ankle 16: right_ankle ``` Each keypoint decoded as (x, y, confidence) = 51 values from the 512-dim embedding via a learned linear projection. ### Operating Modes The demo supports three modes, selectable in the UI: | Mode | Video | CSI | Fusion | Use Case | |------|-------|-----|--------|----------| | **Dual (default)** | ✅ | ✅ | Attention-weighted | Best accuracy, full demo | | **Video Only** | ✅ | ❌ | α = 1.0 | No ESP32 available, quick demo | | **CSI Only** | ❌ | ✅ | α = 0.0 | Privacy mode, through-wall sensing | **Video Only mode works without any hardware** — just a webcam — making the demo instantly accessible for anyone wanting to try it. ### File Layout ``` examples/wasm-browser-pose/ ├── index.html # Single-page app (vanilla JS, no bundler) ├── js/ │ ├── app.js # Main entry, mode selection, orchestration │ ├── video-capture.js # getUserMedia, frame extraction, quality detection │ ├── csi-processor.js # WebSocket CSI client, frame parsing, pseudo-image encoding │ ├── fusion.js # Attention-weighted embedding fusion, confidence gating │ ├── pose-decoder.js # Fused embedding → 17 keypoints │ └── canvas-renderer.js # Video overlay, skeleton, CSI heatmap, confidence bars ├── data/ │ ├── visual-weights.json # Visual CNN → embedding projection (placeholder until trained) │ ├── csi-weights.json # CSI CNN → embedding projection (placeholder until trained) │ ├── fusion-weights.json # Attention fusion α weights (512 values) │ └── pose-weights.json # Fused embedding → keypoint projection ├── css/ │ └── style.css # Dark theme UI styling ├── pkg/ # Built WASM packages (gitignored, built by script) │ ├── wifi_densepose_wasm/ │ └── ruvector_cnn_wasm/ ├── build.sh # wasm-pack build script for both packages └── README.md # Setup and usage instructions ``` ### Build Pipeline ```bash #!/bin/bash # build.sh — builds both WASM packages into pkg/ set -e # Build wifi-densepose-wasm (CSI processing) wasm-pack build ../../v2/crates/wifi-densepose-wasm \ --target web --out-dir "$(pwd)/pkg/wifi_densepose_wasm" --no-typescript # Build ruvector-cnn-wasm (CNN inference for both video and CSI) wasm-pack build ../../vendor/ruvector/crates/ruvector-cnn-wasm \ --target web --out-dir "$(pwd)/pkg/ruvector_cnn_wasm" --no-typescript echo "Build complete. Serve with: python3 -m http.server 8080" ``` ### UI Layout ``` ┌─────────────────────────────────────────────────────────┐ │ WiFi-DensePose — Live Dual-Modal Pose Estimation │ │ [Dual Mode ▼] [⚙ Settings] FPS: 28 ◉ Live │ ├───────────────────────────┬─────────────────────────────┤ │ │ │ │ ┌───────────────────┐ │ ┌───────────────────┐ │ │ │ │ │ │ │ │ │ │ Video + Skeleton │ │ │ CSI Heatmap │ │ │ │ Overlay │ │ │ (amplitude × │ │ │ │ (main canvas) │ │ │ subcarrier) │ │ │ │ │ │ │ │ │ │ └───────────────────┘ │ └───────────────────┘ │ │ │ │ ├───────────────────────────┴─────────────────────────────┤ │ Fusion Confidence: ████████░░ 78% │ │ Video: ██████████ 95% │ CSI: ██████░░░░ 61% │ ├─────────────────────────────────────────────────────────┤ │ ┌─────────────────────────────────────────────────┐ │ │ │ Embedding Space (2D projection) │ │ │ │ · · · │ │ │ │ · · · · · · (color = pose cluster) │ │ │ │ · · · · │ │ │ └─────────────────────────────────────────────────┘ │ ├─────────────────────────────────────────────────────────┤ │ Latency: Video 12ms │ CSI 8ms │ Fusion 1ms │ Total 21ms│ │ [▶ Record] [📷 Snapshot] [Confidence: ████ 0.6] │ └─────────────────────────────────────────────────────────┘ ``` ### WASM Module Structure | Package | Source Crate | Provides | Size (est.) | |---------|-------------|----------|-------------| | `wifi_densepose_wasm` | `wifi-densepose-wasm` | CSI frame parsing, signal processing, feature extraction | ~200KB | | `ruvector_cnn_wasm` | `ruvector-cnn-wasm` | `WasmCnnEmbedder` (×2 instances), `SimdOps`, `LayerOps`, contrastive losses | ~150KB | Two `WasmCnnEmbedder` instances are created — one for video frames, one for CSI pseudo-images. They share the same WASM module but have independent state. ### Browser API Requirements | API | Purpose | Required | Fallback | |-----|---------|----------|----------| | `getUserMedia` | Webcam capture | For video mode | CSI-only mode | | WebAssembly | CNN inference | Yes | None (hard requirement) | | WASM SIMD128 | Accelerated inference | No | Scalar fallback (~2× slower) | | WebSocket | CSI data stream | For CSI mode | Video-only mode | | Canvas 2D | Rendering | Yes | None | | `requestAnimationFrame` | Render loop | Yes | `setTimeout` fallback | | ES Modules | Code organization | Yes | None | Target: Chrome 89+, Firefox 89+, Safari 15+, Edge 89+ ### Performance Budget | Stage | Target Latency | Notes | |-------|---------------|-------| | Video frame capture + resize | <3ms | `drawImage` to offscreen canvas | | Video CNN embedding | <15ms | 224×224 RGB → 512-dim | | CSI receive + parse | <2ms | Binary WebSocket message | | CSI pseudo-image encoding | <3ms | Amplitude/phase/delta channels | | CSI CNN embedding | <15ms | 224×224 pseudo-RGB → 512-dim | | Attention fusion | <1ms | Element-wise weighted sum | | Pose decoding | <1ms | Linear projection | | Canvas overlay render | <3ms | Video + skeleton + heatmap | | **Total (dual mode)** | **<33ms** | **30 FPS capable** | | **Total (video only)** | **<22ms** | **45 FPS capable** | Note: Video and CSI CNN pipelines can run in parallel using Web Workers, reducing dual-mode latency to ~max(15, 15) + 5 = ~20ms (50 FPS). ### Contrastive Learning Integration The demo optionally shows real-time contrastive learning in the browser: - **InfoNCE loss** (`WasmInfoNCELoss`): Compare video vs CSI embeddings for the same pose — trains cross-modal alignment - **Triplet loss** (`WasmTripletLoss`): Push apart different poses, pull together same pose across modalities - **SimdOps**: Accelerated dot products for real-time similarity computation - **Embedding space panel**: Live 2D projection shows video and CSI embeddings converging when viewing the same person ### Relationship to Existing Crates | Existing Crate | Role in This Demo | |---------------|-------------------| | `ruvector-cnn-wasm` | CNN inference for **both** video frames and CSI pseudo-images | | `wifi-densepose-wasm` | CSI frame parsing and signal processing | | `wifi-densepose-sensing-server` | WebSocket CSI data source | | `wifi-densepose-core` | ADR-018 frame format definitions | | `ruvector-cnn` | Underlying MobileNet-V3, layers, contrastive learning | No new Rust crates are needed. The example is pure HTML/JS consuming existing WASM packages. ## Consequences ### Positive - **Instant demo**: Video-only mode works with just a webcam — no ESP32 needed - **Multi-modal showcase**: Demonstrates camera + WiFi fusion, the core innovation of the project - **Graceful degradation**: Works with video-only, CSI-only, or both - **Through-wall capability**: CSI mode shows pose estimation where cameras cannot reach - **Zero-install**: Anyone with a browser can try it - **Training data collection**: Can record paired (video, CSI) data for offline model training - **Reusable**: JS modules embed directly in the Tauri desktop app's webview ### Negative - **Model weights**: Requires offline-trained weights for visual CNN, CSI CNN, fusion, and pose decoder (~200KB total JSON) - **WASM size**: Two WASM modules total ~350KB (acceptable) - **No GPU**: CPU-only WASM inference; adequate at 224×224 but limits resolution scaling - **Camera privacy**: Video mode requires camera permission (mitigated: CSI-only mode available) - **Two CNN instances**: Memory footprint doubles vs single-modal (~10MB total, acceptable for desktop browsers) ### Risks - **Cross-modal alignment**: Video and CSI embeddings must be trained jointly for fusion to work; without proper training, fusion may be worse than either modality alone - **Latency on mobile**: Dual CNN on mobile browsers may exceed 33ms; implement automatic quality reduction - **WebSocket drops**: Network jitter → CSI frame gaps; buffer last 3 frames, interpolate missing data ## Implementation Plan 1. **Phase 1 — Scaffold**: File layout, build.sh, index.html shell, mode selector UI 2. **Phase 2 — Video pipeline**: getUserMedia → frame capture → CNN embedding → basic pose display 3. **Phase 3 — CSI pipeline**: WebSocket client → CSI parsing → pseudo-image → CNN embedding 4. **Phase 4 — Fusion**: Attention-weighted combination, confidence gating, mode switching 5. **Phase 5 — Pose decoder**: Linear projection with placeholder weights → 17 keypoints 6. **Phase 6 — Overlay renderer**: Video canvas with skeleton overlay, CSI heatmap panel 7. **Phase 7 — Training**: Use `wifi-densepose-train` to generate real weights for both CNNs + fusion + decoder 8. **Phase 8 — Contrastive demo**: Embedding space visualization, cross-modal similarity display 9. **Phase 9 — Web Workers**: Move CNN inference to workers for parallel video + CSI processing 10. **Phase 10 — Polish**: Recording, snapshots, adaptive quality, mobile optimization ## Alternatives Considered ### 1. CSI-Only (No Video) Rejected: Misses the opportunity to show multi-modal fusion and makes the demo less accessible (requires ESP32 hardware). Video-only mode as a fallback is strictly better. ### 2. Server-Side Video Inference Rejected: Adds latency, requires webcam stream upload (privacy concern), and defeats the WASM-first architecture. All inference must be client-side. ### 3. TensorFlow.js for Video, ruvector-cnn-wasm for CSI Rejected: Would require two different ML frameworks. Using `ruvector-cnn-wasm` for both keeps a single WASM module, unified embedding space, and simpler fusion. ### 4. Pre-recorded Video Demo Rejected: Live webcam input is far more compelling for demonstrations. Pre-recorded mode can be added as a secondary option. ### 5. React/Vue Framework Rejected: Adds build tooling. Vanilla JS + ES modules keeps the demo self-contained. ## References - [ADR-018: Binary CSI Frame Format](ADR-018-binary-csi-frame-format.md) - [ADR-024: Contrastive CSI Embedding / AETHER](ADR-024-contrastive-csi-embedding.md) - [ADR-055: Integrated Sensing Server](ADR-055-integrated-sensing-server.md) - `vendor/ruvector/crates/ruvector-cnn/src/lib.rs` — CNN embedder implementation - `vendor/ruvector/crates/ruvector-cnn-wasm/src/lib.rs` — WASM bindings - `vendor/ruvector/examples/wasm-vanilla/index.html` — Reference vanilla JS WASM pattern - Person-in-WiFi: Fine-grained Person Perception using WiFi (ICCV 2019) — camera+WiFi fusion precedent - WiPose: Multi-Person WiFi Pose Estimation (TMC 2022) — cross-modal embedding approach