diff --git a/docs/adr/ADR-070-self-supervised-pretraining.md b/docs/adr/ADR-070-self-supervised-pretraining.md new file mode 100644 index 00000000..0653dcd3 --- /dev/null +++ b/docs/adr/ADR-070-self-supervised-pretraining.md @@ -0,0 +1,203 @@ +# ADR-070: Self-Supervised Pretraining from Live ESP32 CSI + Cognitum Seed + +| Field | Value | +|------------|----------------------------------------------------------| +| Status | Accepted | +| Date | 2026-04-02 | +| Authors | rUv, claude-flow | +| Drivers | README limitation "No pre-trained model weights provided"| +| Related | ADR-069 (Cognitum Seed pipeline), ADR-027 (MERIDIAN), ADR-024 (AETHER contrastive), ADR-015 (MM-Fi dataset) | + +## Context + +The README lists "No pre-trained model weights are provided; training from scratch is required" as a known limitation. Users must collect their own CSI dataset and train from scratch, which is a significant barrier to adoption. + +We now have the infrastructure to generate pre-trained weights directly from live hardware: + +- **2 ESP32-S3 nodes** (COM8 node_id=2 at 192.168.1.104, COM9 node_id=1 at 192.168.1.105) streaming CSI + vitals + 8-dim feature vectors at 1 Hz each +- **Cognitum Seed** (Pi Zero 2 W) with RVF vector store, kNN search, witness chain, and environmental sensors (BME280, PIR, vibration) +- **Recording API** in sensing-server (`POST /api/v1/recording/start`) that saves CSI frames to `.csi.jsonl` +- **Self-supervised training** via `rapid_adapt.rs` (contrastive TTT + entropy minimization) +- **AETHER contrastive embeddings** (ADR-024) for environment-independent representations + +### Why Self-Supervised? + +No cameras or labels are needed. The system learns from: + +1. **Temporal coherence** — Frames close in time should have similar embeddings (positive pairs), frames far apart should differ (negative pairs) +2. **Multi-node consistency** — The same person seen from 2 nodes should produce correlated features, different people should produce decorrelated features +3. **Cognitum Seed ground truth** — PIR sensor, BME280 environment changes, and kNN cluster transitions provide weak supervision without human labeling +4. **Physical constraints** — Breathing 6-30 BPM, heart rate 40-150 BPM, person count 0-4, RSSI physics + +## Decision + +Implement a 4-phase pretraining pipeline that collects CSI from 2 ESP32 nodes, stores feature vectors in the Cognitum Seed, and produces distributable pre-trained weights. + +### Phase 1: Data Collection (30 min) + +Capture labeled scenarios using the sensing-server recording API and Cognitum Seed: + +| Scenario | Duration | Label | Activity | +|----------|----------|-------|----------| +| Empty room | 5 min | `empty` | No one present, establish baseline | +| 1 person stationary | 5 min | `1p-still` | Sit at desk, normal breathing | +| 1 person walking | 5 min | `1p-walk` | Walk around room, varied paths | +| 1 person varied | 5 min | `1p-varied` | Stand, sit, wave arms, turn | +| 2 people | 5 min | `2p` | Both moving in room | +| Transitions | 5 min | `transitions` | Enter/exit room, appear/disappear | + +**Data rate per scenario:** +- 2 nodes × 100 Hz CSI = 200 frames/sec = 60,000 frames per 5 min +- 2 nodes × 1 Hz features = 2 vectors/sec = 600 vectors per 5 min +- Total: 360,000 CSI frames + 3,600 feature vectors per collection run + +**Cognitum Seed role:** +- Stores all feature vectors with witness chain attestation +- PIR sensor provides binary presence ground truth +- BME280 tracks environmental conditions during collection +- kNN graph clusters naturally emerge from the vector distribution + +### Phase 2: Contrastive Pretraining + +Train a contrastive encoder on the collected CSI data: + +``` +Input: Raw CSI frame (128 subcarriers × 2 I/Q = 256 features) + ↓ + TCN temporal encoder (3 layers, kernel=7) + ↓ + Projection head → 128-dim embedding + ↓ + Contrastive loss (InfoNCE): + positive: frames within 0.5s window from same node + negative: frames >5s apart or from different scenario + cross-node positive: same timestamp, different node +``` + +**Self-supervised signals:** +- Temporal adjacency (frames within 500ms = positive pair) +- Cross-node agreement (same person seen from 2 viewpoints) +- PIR consistency (embedding should cluster by PIR state) +- Scenario boundary (embeddings should shift at label transitions) + +### Phase 3: Downstream Head Training + +Attach lightweight heads for each task: + +| Head | Architecture | Output | Supervision | +|------|-------------|--------|-------------| +| Presence | Linear(128→1) + sigmoid | 0.0-1.0 | PIR sensor (free) | +| Person count | Linear(128→4) + softmax | 0-3 people | Scenario labels | +| Activity | Linear(128→4) + softmax | still/walk/varied/empty | Scenario labels | +| Vital signs | Linear(128→2) | BR, HR (BPM) | ESP32 edge vitals | + +### Phase 4: Package & Distribute + +Produce distributable artifacts: + +| Artifact | Format | Size | Description | +|----------|--------|------|-------------| +| `pretrained-encoder.onnx` | ONNX | ~2 MB | Contrastive encoder (TCN backbone) | +| `pretrained-heads.onnx` | ONNX | ~100 KB | Task-specific heads | +| `pretrained.rvf` | RVF | ~500 KB | RuVector format with metadata | +| `room-profiles.json` | JSON | ~10 KB | Environment calibration profiles | +| `collection-witness.json` | JSON | ~5 KB | Seed witness chain attestation proving data provenance | + +Include in GitHub release alongside firmware binaries. Users download and run: + +```bash +# Use pre-trained model (no training needed) +cargo run -p wifi-densepose-sensing-server -- --model pretrained.rvf --http-port 3000 +``` + +## Hardware Setup + +``` + 192.168.1.20 (Host laptop) + ┌──────────────────────────┐ + │ sensing-server │ + │ Recording API │ + │ Training pipeline │ + │ │ + │ seed_csi_bridge.py │ + │ Feature → Seed ingest │ + └────┬──────────┬───────────┘ + │ │ + UDP:5006 │ │ HTTPS:8443 + ┌───────────────────┤ ├───────────────┐ + │ │ │ │ + ▼ ▼ ▼ │ +┌──────────┐ ┌──────────┐ ┌──────────────┐ │ +│ ESP32 #1 │ │ ESP32 #2 │ │Cognitum Seed │◄───┘ +│ COM9 │ │ COM8 │ │ Pi Zero 2W │ +│ node=1 │ │ node=2 │ │ USB │ +│ .1.105 │ │ .1.104 │ │ .42.1/8443 │ +│ v0.5.4 │ │ v0.5.4 │ │ v0.8.1 │ +└──────────┘ └──────────┘ │ PIR, BME280 │ + │ RVF store │ + │ Witness chain│ + └──────────────┘ +``` + +## Data Collection Protocol + +### Step 1: Start Seed ingest (background) + +```bash +export SEED_TOKEN="your-token" +python scripts/seed_csi_bridge.py \ + --seed-url https://169.254.42.1:8443 --token "$SEED_TOKEN" \ + --udp-port 5006 --batch-size 10 --validate & +``` + +### Step 2: Start sensing-server with recording + +```bash +cargo run -p wifi-densepose-sensing-server -- \ + --source esp32 --udp-port 5006 --http-port 3000 +``` + +### Step 3: Record each scenario + +```bash +# Empty room (leave room for 5 min) +curl -X POST http://localhost:3000/api/v1/recording/start \ + -H 'Content-Type: application/json' \ + -d '{"session_name":"pretrain-empty","label":"empty","duration_secs":300}' + +# 1 person stationary (sit at desk for 5 min) +curl -X POST http://localhost:3000/api/v1/recording/start \ + -d '{"session_name":"pretrain-1p-still","label":"1p-still","duration_secs":300}' + +# ... repeat for each scenario +``` + +### Step 4: Verify with Seed + +```bash +python scripts/seed_csi_bridge.py --token "$SEED_TOKEN" --stats +# Should show 3,600+ vectors from the collection run +``` + +## Risks + +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| +| 2 nodes insufficient for spatial diversity | Medium | Lower pretraining quality | Place nodes 3-5m apart at different heights | +| PIR sensor has limited range | Low | Weak presence labels | BME280 temp changes + kNN clusters as backup | +| Contrastive pretraining collapses | Low | Useless embeddings | Temperature scheduling, hard negative mining | +| Model too large for ESP32 inference | N/A | N/A | Inference on host/Seed, not on ESP32 | +| Room-specific overfitting | Medium | Poor generalization | MERIDIAN domain randomization (ADR-027), LoRA adaptation | + +## Consequences + +### Positive +- Users get working model out of the box — no training needed +- Witness chain proves data provenance (when/where/which hardware) +- Pre-trained encoder transfers to new environments via LoRA fine-tuning +- Removes the #1 adoption barrier from the README + +### Negative +- 30 min of manual data collection per pretraining run +- Pre-trained weights are room-specific without adaptation +- ONNX runtime dependency for inference