diff --git a/docs/adr/ADR-081-gesture-controlled-visualization.md b/docs/adr/ADR-081-gesture-controlled-visualization.md new file mode 100644 index 00000000..2fd7621f --- /dev/null +++ b/docs/adr/ADR-081-gesture-controlled-visualization.md @@ -0,0 +1,433 @@ +# ADR-081: Gesture-Controlled Data Visualization + +- **Status**: Proposed +- **Date**: 2026-04-07 +- **Deciders**: ruv +- **Relates to**: ADR-079 (Camera Ground-Truth Training), ADR-029 (RuvSense Gesture Recognition), ADR-072 (WiFlow Architecture), ADR-076 (CNN Spectrogram Embeddings) + +## Context + +RuView can now track 17 COCO keypoints at 92.9% PCK@20 (ADR-079) and detect gestures +via DTW template matching (ADR-029). These capabilities exist independently — pose +estimation produces skeleton coordinates, and the UI displays static charts. There is no +system that connects hand/arm movements to interactive data exploration. + +Gesture-controlled visualization would let users manipulate charts and graphs by waving +their hands in front of the ESP32 sensing zone — no mouse, no touchscreen, no wearable. +This is particularly valuable for: + +- **Lab/cleanroom** — gloved hands can't use touchscreens +- **Kitchen/workshop** — dirty or wet hands +- **Presentations** — stand back and gesture at projected dashboards +- **Accessibility** — motor impairments that make mouse use difficult +- **Digital signage** — public displays without touch hardware + +### Why Camera + CSI Fusion + +Camera alone can do gesture control (e.g., Leap Motion, MediaPipe Hands). CSI alone can +detect coarse gestures (ADR-029). The fusion provides: + +| Modality | Strengths | Weaknesses | +|----------|-----------|-----------| +| Camera (MediaPipe Hands) | 21 hand landmarks, finger-level precision, 30fps | Requires line of sight, lighting dependent, privacy concern | +| CSI (ESP32) | Through-wall, works in dark, privacy-preserving, $9 | Coarse spatial resolution, no finger tracking | +| **Fusion** | **Finger precision near camera + coarse tracking everywhere** | Requires both sensors during training | + +The fusion model trains on camera + CSI pairs (like ADR-079), then deploys in two modes: +1. **Camera-assisted** — full precision when camera is available +2. **CSI-only** — reduced but functional gesture control without camera + +## Decision + +Build a gesture-to-visualization control system that maps hand/arm movements to chart +interactions using fused camera + CSI input. + +### Gesture Vocabulary + +#### Navigation Gestures (arm-level, CSI-detectable) + +| Gesture | Motion | Chart Action | CSI Feasibility | +|---------|--------|-------------|-----------------| +| **Swipe left** | Open hand sweeps left | Pan chart left / previous dataset | High — clear directional motion | +| **Swipe right** | Open hand sweeps right | Pan chart right / next dataset | High | +| **Swipe up** | Open hand sweeps up | Scroll up / zoom out | High | +| **Swipe down** | Open hand sweeps down | Scroll down / zoom in | High | +| **Push forward** | Palm pushes toward screen | Select / drill into data point | Medium — depth motion harder | +| **Pull back** | Hand pulls away from screen | Back / zoom out | Medium | +| **Circular CW** | Hand circles clockwise | Increase value / rotate view | Medium — temporal pattern | +| **Circular CCW** | Hand circles counter-clockwise | Decrease value / rotate back | Medium | +| **Hold still** | Hand stationary 2+ seconds | Hover / show tooltip | High — absence of motion | +| **Both hands apart** | Arms spread outward | Expand / zoom into selection | High — bilateral motion | +| **Both hands together** | Arms move inward | Collapse / zoom out | High | + +#### Precision Gestures (finger-level, camera-required) + +| Gesture | Motion | Chart Action | Sensor | +|---------|--------|-------------|--------| +| **Pinch zoom** | Thumb + index spread/close | Continuous zoom | Camera only | +| **Point** | Index finger extended | Cursor position on chart | Camera only | +| **Grab** | Close fist | Grab and drag data point | Camera only | +| **Thumb up** | Thumbs up | Confirm / approve | Camera only | +| **Thumb down** | Thumbs down | Reject / undo | Camera only | +| **Two-finger rotate** | Two fingers twist | Rotate 3D visualization | Camera only | +| **Finger slider** | Index finger moves along axis | Adjust parameter value | Camera only | + +### Architecture + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ Input Layer │ +│ │ +│ ESP32 CSI (UDP 5005) ──→ CSI Gesture Detector (DTW + WiFlow) │ +│ ↓ │ +│ Webcam (MediaPipe Hands) ──→ Hand Landmark Tracker (21 joints) │ +│ ↓ │ +│ Gesture Fusion Engine │ +│ ├── CSI coarse: swipe/circle/hold │ +│ ├── Camera fine: pinch/point/grab │ +│ └── Confidence weighting by modality │ +└──────────────────────────────────────────────────────────────────┘ + ↓ +┌──────────────────────────────────────────────────────────────────┐ +│ Gesture Interpreter │ +│ │ +│ Raw gestures ──→ State Machine ──→ Chart Commands │ +│ │ +│ States: │ +│ IDLE ──(motion detected)──→ TRACKING │ +│ TRACKING ──(gesture matched)──→ ACTING │ +│ ACTING ──(gesture complete)──→ COOLDOWN │ +│ COOLDOWN ──(500ms)──→ IDLE │ +│ │ +│ Debounce: 200ms minimum gesture duration │ +│ Cooldown: 500ms between consecutive gestures │ +│ Confidence threshold: 0.7 for CSI, 0.9 for camera │ +└──────────────────────────────────────────────────────────────────┘ + ↓ +┌──────────────────────────────────────────────────────────────────┐ +│ Visualization Controller │ +│ │ +│ Chart Commands ──→ WebSocket ──→ UI │ +│ │ +│ Commands: │ +│ { type: "pan", dx: -0.1, dy: 0 } │ +│ { type: "zoom", factor: 1.2, center: [0.5, 0.5] } │ +│ { type: "select", x: 0.45, y: 0.62 } │ +│ { type: "rotate", angle: 15 } │ +│ { type: "slider", axis: "x", value: 0.73 } │ +│ { type: "hover", x: 0.45, y: 0.62 } │ +│ { type: "back" } │ +│ { type: "confirm" } │ +│ { type: "reject" } │ +└──────────────────────────────────────────────────────────────────┘ + ↓ +┌──────────────────────────────────────────────────────────────────┐ +│ Visualization UI │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Line Chart │ │ Bar Chart │ │ 3D Scatter │ │ +│ │ (time │ │ (category │ │ (spatial │ │ +│ │ series) │ │ compare) │ │ data) │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +│ │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Heatmap │ │ Gauge │ │ Spectrogram │ │ +│ │ (CSI grid) │ │ (vitals) │ │ (frequency) │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +│ │ +│ Visual feedback: gesture cursor overlay + action indicator │ +│ Framework: D3.js / Observable Plot in existing UI │ +└──────────────────────────────────────────────────────────────────┘ +``` + +### Gesture Detection Pipeline + +#### CSI Gesture Detection (arm-level) + +Extends the existing DTW gesture classifier (ADR-029) with WiFlow pose input: + +``` +CSI [35, 20] ──→ WiFlow lite ──→ 17 keypoints ──→ Extract arm features: + - Wrist velocity (dx/dt, dy/dt) + - Elbow angle (shoulder-elbow-wrist) + - Bilateral symmetry (left vs right) + - Motion energy (frame differencing) + ↓ + DTW template matching: + - 11 gesture templates + - Sliding window (1s) + - Top match + confidence +``` + +#### Camera Gesture Detection (finger-level) + +Uses MediaPipe Hands (21 landmarks per hand, 30fps): + +``` +Webcam ──→ MediaPipe Hands ──→ 21 landmarks × 2 hands ──→ Extract: + - Finger states (extended/curled) + - Pinch distance (thumb-index) + - Grab state (all fingers curled) + - Point direction (index ray) + - Hand center velocity + ↓ + Rule-based classifier: + - Pinch: thumb-index < 0.05 + - Point: only index extended + - Grab: all fingers curled + - Thumbs up/down: thumb angle +``` + +#### Fusion Strategy + +``` +CSI confidence ──┐ + ├──→ Weighted fusion ──→ Final gesture + confidence +Camera conf ──┘ + +Rules: + - If both agree: confidence = max(csi_conf, cam_conf) + 0.1 * min(csi_conf, cam_conf) + - If only CSI: use CSI gesture, confidence *= 0.8 + - If only camera: use camera gesture, confidence *= 0.95 + - If conflict: prefer camera for fine gestures, CSI for coarse gestures + - Minimum confidence for action: 0.6 +``` + +### Chart Interaction Mapping + +#### Line Chart (Time Series) + +| Gesture | Action | Parameters | +|---------|--------|-----------| +| Swipe left/right | Pan time axis | dx proportional to swipe speed | +| Pinch zoom | Zoom time axis | Continuous, centered on hand position | +| Both hands apart/together | Zoom (CSI-only alternative) | Binary zoom in/out | +| Point | Show tooltip at nearest data point | x from index finger position | +| Hold still | Sticky tooltip | Duration-based activation | +| Swipe up/down | Switch dataset / Y-axis scale | Discrete steps | + +#### Bar Chart (Category Comparison) + +| Gesture | Action | Parameters | +|---------|--------|-----------| +| Swipe left/right | Navigate categories | One category per swipe | +| Point | Highlight bar | Nearest bar to finger X position | +| Push forward | Select bar for drill-down | Depth gesture | +| Grab + drag | Reorder bars | Camera-only | +| Circular | Sort ascending/descending | Direction determines order | + +#### 3D Scatter Plot + +| Gesture | Action | Parameters | +|---------|--------|-----------| +| Swipe left/right | Rotate Y axis | Angle proportional to speed | +| Swipe up/down | Rotate X axis | Angle proportional to speed | +| Two-finger rotate | Rotate Z axis | Camera-only | +| Pinch zoom | Zoom | Camera-only | +| Both hands apart | Zoom in (CSI alternative) | Binary | +| Point | Highlight nearest point | Ray-cast from finger direction | + +#### Heatmap (CSI Grid) + +| Gesture | Action | Parameters | +|---------|--------|-----------| +| Swipe | Pan view | dx, dy | +| Pinch | Zoom region | Center + scale | +| Hold | Show cell value | Position-based | +| Circular | Adjust color scale range | CW = expand, CCW = contract | + +#### Gauge (Vital Signs) + +| Gesture | Action | Parameters | +|---------|--------|-----------| +| Swipe left/right | Switch vital (HR → BR → SpO2) | Discrete | +| Circular CW | Set high alert threshold | Continuous | +| Circular CCW | Set low alert threshold | Continuous | +| Thumb up | Acknowledge alert | Binary | + +### Visual Feedback + +The UI provides real-time feedback so users know the system is tracking them: + +``` +┌────────────────────────────────────────────────┐ +│ │ +│ ┌──────────────────────────────────────┐ │ +│ │ │ │ +│ │ Chart (D3.js) │ │ +│ │ │ │ +│ │ ◉ ← gesture cursor │ │ +│ │ │ │ +│ └──────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────┐ │ +│ │ 🖐 Tracking │ ← Swipe Right │ 0.87 │ │ +│ └──────────────────────────────────────┘ │ +│ Gesture bar: state | detected gesture | conf │ +│ │ +│ ┌────┐ ┌────┐ ┌────┐ ┌────┐ │ +│ │ CSI│ │ CAM│ │ FPS│ │MODE│ │ +│ │ ● │ │ ● │ │ 28 │ │FUSE│ │ +│ └────┘ └────┘ └────┘ └────┘ │ +│ Status: sensor health, framerate, mode │ +└────────────────────────────────────────────────┘ +``` + +**Cursor types:** +- **Open hand** (🖐) — tracking, ready for gesture +- **Pointing** (👆) — hovering over data point +- **Grabbing** (✊) — dragging / selected +- **Pinching** (🤏) — zooming +- **Ghost cursor** — CSI-only mode (lower opacity, larger radius) + +### Data Flow Protocol + +WebSocket messages from gesture engine to UI: + +```typescript +interface GestureEvent { + type: 'gesture'; + gesture: 'swipe_left' | 'swipe_right' | 'swipe_up' | 'swipe_down' + | 'pinch_zoom' | 'point' | 'grab' | 'hold' | 'circle_cw' + | 'circle_ccw' | 'push' | 'pull' | 'spread' | 'contract' + | 'thumb_up' | 'thumb_down'; + confidence: number; // 0-1 + source: 'csi' | 'camera' | 'fusion'; + position?: [number, number]; // Normalized [0,1] hand position + velocity?: [number, number]; // Hand velocity for proportional control + param?: number; // Gesture-specific parameter (pinch distance, rotation angle) +} + +interface CursorEvent { + type: 'cursor'; + x: number; // 0-1 normalized + y: number; // 0-1 normalized + state: 'tracking' | 'pointing' | 'grabbing' | 'pinching' | 'idle'; + hands: number; // 0, 1, or 2 +} + +interface StatusEvent { + type: 'status'; + csi_active: boolean; + camera_active: boolean; + mode: 'fusion' | 'csi_only' | 'camera_only'; + fps: number; + gesture_count: number; // Total gestures detected this session +} +``` + +### Training the CSI Gesture Model + +Extends ADR-079's camera ground-truth pipeline: + +```bash +# 1. Collect gesture training data (camera + CSI, 10 min) +# Perform each gesture 20+ times with natural variation +python scripts/collect-gesture-gt.py --duration 600 --gestures all --preview + +# 2. Label gesture segments (auto-detected from camera) +node scripts/label-gestures.js \ + --gt data/ground-truth/gestures-*.jsonl \ + --csi data/recordings/csi-*.jsonl + +# 3. Train gesture classifier +node scripts/train-gesture-model.js \ + --data data/gestures/labeled-*.jsonl \ + --scale lite + +# 4. Deploy +# CSI-only mode: gestures detected from WiFlow keypoint motion +# Fusion mode: camera adds finger-level precision +``` + +**Training data per gesture:** ~20 examples × 11 gestures = 220 labeled samples. +With augmentation (time warp, amplitude noise): ~1,000 effective samples. + +### Optimization: Temporal Gesture Encoding + +Instead of classifying single frames, encode gesture trajectories: + +``` +Keypoint sequence [T=30 frames, 1 second]: + wrist_x[0..29], wrist_y[0..29], + elbow_angle[0..29], + hand_velocity[0..29] + ↓ +1D CNN (k=5, d=[1,2,4]) → 64-dim gesture embedding + ↓ +Nearest-neighbor to gesture templates (cosine distance) + ↓ +Top gesture + confidence +``` + +This is lighter than DTW for real-time use and can be trained end-to-end with +the WiFlow backbone (shared TCN features). + +## File Structure + +``` +scripts/ + collect-gesture-gt.py # Camera + CSI gesture data collection + label-gestures.js # Auto-label gesture segments from camera + train-gesture-model.js # Train CSI gesture classifier + gesture-server.js # WebSocket gesture detection server + +ui/ + components/ + GestureOverlay.js # Cursor + feedback overlay + GestureChart.js # Gesture-controlled chart wrapper + GestureStatus.js # Sensor health bar + services/ + gesture.service.js # WebSocket client for gesture events +``` + +## Consequences + +### Positive + +- **Hands-free data exploration** — manipulate charts without touching anything +- **Works in dark/dirty/gloved conditions** — CSI-only mode needs no camera +- **Natural interaction** — swipe, pinch, point are intuitive +- **Builds on existing infrastructure** — WiFlow + DTW + MediaPipe all exist +- **Dual-mode deployment** — degrade gracefully from fusion to CSI-only +- **Low latency** — WiFlow inference is 0.79ms, gesture detection adds ~5ms + +### Negative + +- **Learning curve** — users must learn gesture vocabulary +- **False positives** — normal movement may trigger gestures (mitigated by state machine + cooldown) +- **CSI-only precision** — coarse gestures only without camera +- **Single-user** — multi-user gesture disambiguation is hard + +### Risks + +| Risk | Probability | Impact | Mitigation | +|------|-------------|--------|------------| +| Gesture false positives from normal movement | Medium | High | State machine with IDLE→TRACKING threshold, 200ms debounce, 0.7 confidence gate | +| CSI gestures too coarse for chart control | Medium | Medium | Camera fallback for precision; CSI handles navigation-level gestures only | +| Latency > 100ms feels unresponsive | Low | High | WiFlow 0.79ms + gesture 5ms + WebSocket <10ms = ~16ms total | +| User fatigue ("gorilla arm") | Medium | Medium | Support seated gestures; small wrist movements, not full arm sweeps | +| MediaPipe Hands not detecting in low light | Medium | Low | CSI-only fallback; works in complete darkness | + +## Implementation Plan + +| Phase | Task | Effort | Dependencies | +|-------|------|--------|-------------| +| P1 | `gesture-server.js` — WebSocket server with camera hand tracking | 3 hrs | MediaPipe Hands model | +| P2 | Camera gesture classifier (rule-based from hand landmarks) | 2 hrs | P1 | +| P3 | CSI gesture classifier (WiFlow keypoints → DTW templates) | 3 hrs | WiFlow model (ADR-079) | +| P4 | Fusion engine (confidence-weighted merge) | 2 hrs | P2 + P3 | +| P5 | `GestureOverlay.js` — cursor + feedback UI component | 2 hrs | P1 | +| P6 | `GestureChart.js` — gesture-controlled D3 chart wrapper | 4 hrs | P4 + P5 | +| P7 | Gesture training data collection + model training | 2 hrs | P3 | +| P8 | Integration with existing sensing UI | 2 hrs | P6 | +| **Total** | | **~20 hrs** | | + +## References + +- MediaPipe Hands — Google's 21-landmark hand tracking (30fps, CPU) +- ADR-029 — RuvSense DTW gesture recognition +- ADR-079 — Camera ground-truth training pipeline (92.9% PCK@20) +- Leap Motion — commercial gesture controller (comparison point) +- SolidJS/D3 gesture interaction patterns +- "GestureWiFi" (IEEE 2023) — WiFi gesture recognition survey