diff --git a/docs/adr/ADR-081-gesture-controlled-visualization.md b/docs/adr/ADR-081-gesture-controlled-visualization.md index e34f2f24..6f4a00c5 100644 --- a/docs/adr/ADR-081-gesture-controlled-visualization.md +++ b/docs/adr/ADR-081-gesture-controlled-visualization.md @@ -439,9 +439,107 @@ node scripts/train-gesture-model.js \ **Training data per gesture:** ~20 examples × 11 gestures = 220 labeled samples. With augmentation (time warp, amplitude noise): ~1,000 effective samples. +### Optimization: ruvector-cnn Spectrogram Gesture Classification + +Replace DTW template matching with a CNN operating on CSI spectrograms via the +`ruvector-cnn` WASM package (ADR-076). This treats each gesture as an image +classification problem on the CSI time-frequency representation. + +#### Why CNN Over DTW + +| | DTW (current, ADR-029) | CNN Spectrogram (proposed) | +|---|---|---| +| Input | 1D keypoint trajectories | 2D CSI spectrogram image | +| Features | Hand-crafted (wrist velocity, elbow angle) | Learned end-to-end | +| Robustness | Sensitive to speed variation | Warp-invariant (pooling layers) | +| Multi-scale | Single scale | Hierarchical (dilated convolutions) | +| Training | Template recording + DTW distance | Supervised from camera labels | +| New gestures | Record new template | Retrain (or few-shot with embedding) | +| Accuracy | ~85% (DTW literature) | ~95%+ (CNN on spectrograms, literature) | + +#### Pipeline + +``` +CSI [N_subcarriers, T=30] (1-second window) + ↓ +Spectrogram transform: STFT per subcarrier + → [N_sub, F_bins, T_bins] ≈ [35, 16, 15] + ↓ +Reshape to grayscale image: [35×16, 15] = [560, 15] + → Resize to [64, 64] (bilinear) + ↓ +ruvector-cnn CnnEmbedder (WASM-accelerated) + → 128-dim gesture embedding + ↓ +Classifier head: Linear(128 → 18 gestures) + softmax + → gesture_id + confidence +``` + +#### ruvector-cnn Integration + +The `@ruvector/cnn` WASM package provides: + +```javascript +const { init, CnnEmbedder, InfoNCELoss } = require('@ruvector/cnn'); +await init(); + +// Create embedder for 64x64 CSI spectrogram "images" +const embedder = new CnnEmbedder({ + inputSize: 64, + embeddingDim: 128, + normalize: true, +}); + +// Extract embedding from CSI spectrogram +const spectrogram = csiToSpectrogram(csiWindow); // [64, 64] Uint8Array +const embedding = embedder.extract(spectrogram, 64, 64); + +// Classify gesture via nearest-neighbor to trained templates +const gesture = classifyGesture(embedding, gestureTemplates); +``` + +#### Training with Contrastive + Classification + +Two-phase training using ruvector-cnn's built-in losses: + +**Phase 1: Contrastive embedding (unsupervised)** +```javascript +const loss = new InfoNCELoss(0.07); +// Same gesture performed at different speeds → positive pairs +// Different gestures → negative pairs +// Train CnnEmbedder to cluster same-gesture spectrograms +``` + +**Phase 2: Gesture classification (supervised)** +```javascript +// Linear classifier on frozen embeddings +// 18 gestures × 20 examples each = 360 labeled samples +// Camera auto-labels: MediaPipe Hands detects gesture type +``` + +#### Dual-Path Architecture + +Run both CNN and DTW in parallel for maximum robustness: + +``` +CSI input ──┬──→ WiFlow → keypoints → DTW templates → gesture_A (conf_A) + │ + └──→ Spectrogram → ruvector-cnn → embedding → classifier → gesture_B (conf_B) + +Fusion: if gesture_A == gesture_B → conf = max(conf_A, conf_B) + 0.15 + if conflict → pick higher confidence + if only one detects → use it at 0.8× confidence +``` + +This dual-path approach provides: +- **DTW** catches gestures the CNN might miss (novel variations) +- **CNN** provides higher accuracy for trained gesture types +- **Fusion** reduces false positives (both must agree for high-confidence) + ### Optimization: Temporal Gesture Encoding -Instead of classifying single frames, encode gesture trajectories: +Alternative lightweight path for when ruvector-cnn WASM overhead matters +(e.g., ESP32 edge deployment): ``` Keypoint sequence [T=30 frames, 1 second]: