diff --git a/docs/research/07-contrastive-learning-rf-coherence.md b/docs/research/07-contrastive-learning-rf-coherence.md new file mode 100644 index 00000000..5ad1d82c --- /dev/null +++ b/docs/research/07-contrastive-learning-rf-coherence.md @@ -0,0 +1,1227 @@ +# Contrastive Learning for RF Field Coherence Detection + +**Research Document 07** | March 2026 +**Status**: SOTA Survey + Design Proposal +**Scope**: Contrastive self-supervised learning methods adapted for WiFi CSI +coherence detection, boundary identification, and cross-environment transfer +within the RuView/wifi-densepose Rust codebase. + +--- + +## Table of Contents + +1. [Contrastive Learning for RF Sensing](#1-contrastive-learning-for-rf-sensing) +2. [AETHER Extension: From Person Re-ID to Topological Boundaries](#2-aether-extension-from-person-re-id-to-topological-boundaries) +3. [Coherence Boundary Detection via Contrastive Loss](#3-coherence-boundary-detection-via-contrastive-loss) +4. [Delta-Driven Updates: Efficiency from Stationarity](#4-delta-driven-updates-efficiency-from-stationarity) +5. [Self-Supervised Pre-Training on Unlabeled CSI](#5-self-supervised-pre-training-on-unlabeled-csi) +6. [Triplet Networks for Edge Classification](#6-triplet-networks-for-edge-classification) +7. [Cross-Environment Transfer via Contrastive Alignment](#7-cross-environment-transfer-via-contrastive-alignment) +8. [Integration Roadmap](#8-integration-roadmap) +9. [References](#9-references) + +--- + +## 1. Contrastive Learning for RF Sensing + +### 1.1 Motivation + +Traditional supervised approaches to WiFi CSI-based sensing require +extensive labeled datasets -- a person walking through a room while +ground-truth positions are recorded via camera or motion capture. This +labeling burden is the single largest bottleneck in deploying WiFi sensing +systems to new environments. Contrastive self-supervised learning offers +an alternative: learn powerful CSI representations from raw, unlabeled +streams, then fine-tune with minimal labels. + +The fundamental insight is that CSI data has natural structure that +contrastive methods can exploit. Temporal proximity provides positive pairs +(CSI frames 100ms apart likely describe the same physical scene), while +spatial or temporal distance provides negatives (CSI from different rooms, +or from the same room hours apart, likely describe different scenes). +Furthermore, the multi-link topology of an ESP32 mesh provides an +additional axis of contrast: CSI from co-located links viewing the same +perturbation versus distant links viewing different perturbations. + +### 1.2 SimCLR Adaptation for CSI + +SimCLR (Chen et al., 2020) learns representations by maximizing agreement +between differently augmented views of the same data point via a +normalized temperature-scaled cross-entropy loss (NT-Xent). Adapting +SimCLR to CSI requires defining appropriate augmentations that preserve +semantic content while varying surface-level features. + +**CSI-specific augmentations:** + +| Augmentation | Operation | Semantic Invariant | +|---|---|---| +| Phase rotation | Multiply all subcarriers by e^{j*theta} | Global phase offset is receiver-dependent, not scene-dependent | +| Subcarrier dropout | Zero 10-30% of subcarriers randomly | Scene information is distributed across bandwidth | +| Temporal jitter | Shift frame by +/-5 samples in time | Sub-frame timing is hardware-dependent | +| Amplitude scaling | Scale |H| by random factor in [0.7, 1.3] | Path loss varies with TX power, distance | +| Noise injection | Add Gaussian noise at SNR 10-30 dB | Real signals always contain noise | +| Antenna permutation | Shuffle MIMO antenna indices | Antenna labels are arbitrary | +| Band masking | Zero contiguous 10-20% of bandwidth | Narrowband interference is common | + +**SimCLR loss for CSI:** + +Given a mini-batch of N CSI frames {x_1, ..., x_N}, apply two random +augmentations to each, producing 2N augmented views. For a positive pair +(x_i, x_i') from the same original frame: + + L_i = -log( exp(sim(z_i, z_i') / tau) / sum_{k != i} exp(sim(z_i, z_k) / tau) ) + +where z = g(f(x)) is the projection of the encoded representation, sim() +is cosine similarity, and tau is the temperature parameter. + +**Architecture considerations for CSI encoders:** + +The encoder f() must handle the complex-valued, multi-antenna, multi-subcarrier +structure of CSI. We propose a two-branch architecture: + +``` +CSI Frame [N_rx x N_tx x N_sub x 2] + | + +---> Amplitude branch: |H| -> 1D-CNN over subcarriers -> feature_amp + | + +---> Phase branch: angle(H) -> Phase unwrap -> 1D-CNN -> feature_phase + | + v + Concatenate -> MLP projector -> z (128-dim embedding) +``` + +The separation of amplitude and phase is critical because phase contains +geometric (distance) information while amplitude contains scattering +information. Mixing them too early causes the network to learn shortcuts +based on amplitude-phase correlations that are receiver-specific rather +than scene-specific. + +### 1.3 MoCo Adaptation for Streaming CSI + +MoCo (He et al., 2020) uses a momentum-updated encoder and a queue of +negative examples, which is particularly well-suited to streaming CSI +where data arrives continuously and we want to learn online. + +**Advantages of MoCo for CSI over SimCLR:** + +1. **Memory efficiency**: The negative queue decouples batch size from + the number of negatives. SimCLR requires large batches (4096+) for + good negatives; MoCo maintains a queue of 65536 negatives with batch + size 256. + +2. **Streaming compatibility**: New CSI frames enqueue, old ones dequeue. + The queue naturally reflects the recent history of RF field states, + providing a diverse negative set without storing the entire dataset. + +3. **Slow-evolving encoder**: The momentum encoder (updated as + theta_k = m * theta_k + (1 - m) * theta_q, m = 0.999) provides + consistent representations for negatives across queue lifetime, which + is essential when the RF field changes slowly. + +**MoCo queue management for RF sensing:** + +The standard MoCo queue is FIFO. For RF sensing, we propose a +*coherence-stratified queue* that maintains negatives from different +coherence regimes: + +``` +Queue Partitions: + [0..16383] -> High coherence (empty room, static) + [16384..32767] -> Medium coherence (slow movement) + [32768..49151] -> Low coherence (active movement) + [49152..65535] -> Transitional (events: door open, person enter) +``` + +This stratification ensures that the model sees negatives from all +operating regimes, not just the most recent one (which, in a typical +deployment, is often prolonged stillness). + +### 1.4 BYOL Adaptation: Negative-Free Contrastive Learning + +BYOL (Grill et al., 2020) eliminates negative pairs entirely, learning by +predicting the output of a momentum-updated target network from an online +network. This is attractive for RF sensing because defining "true negatives" +in a continuously varying RF field is ambiguous -- when a person moves slowly, +CSI frames 1 second apart are neither clearly positive nor clearly negative. + +**BYOL for CSI:** + +``` +Online network: x -> f_theta -> g_theta -> q_theta -> prediction +Target network: x' -> f_xi -> g_xi -> target + +Loss = || q_theta(z_online) - sg(z_target) ||^2 + +theta updated by gradient descent +xi updated by momentum: xi = m * xi + (1-m) * theta +``` + +**Why BYOL avoids collapse for CSI:** BYOL's immunity to representation +collapse depends on the online predictor q_theta breaking the symmetry. +For CSI, there is an additional stabilizing factor: the inherent +dimensionality of the RF field. With N_sub = 56-114 subcarriers, +N_tx * N_rx = 4-16 antenna pairs, and complex values, the raw CSI +space is 448-3648 dimensional. The augmentations we apply (phase rotation, +subcarrier dropout) destroy different dimensions of this space, making +collapse to a trivial representation geometrically difficult. + +### 1.5 Positive and Negative Pair Design for RF Sensing + +The quality of contrastive representations depends critically on pair +design. RF sensing offers several natural pair construction strategies: + +**Positive pairs (should map to similar embeddings):** + +| Strategy | Description | Strength | +|---|---|---| +| Temporal proximity | Frames within delta_t < 200ms from same link | Strong: physics constrains change rate | +| Multi-link agreement | Simultaneous frames from co-located TX-RX pairs viewing same zone | Strong: geometric diversity, same scene | +| Augmentation | Same frame with different augmentations | Standard: augmentation quality dependent | +| Cyclic stationarity | Frames at same phase of periodic motion (e.g., breathing) | Medium: requires cycle detection | + +**Negative pairs (should map to distant embeddings):** + +| Strategy | Description | Strength | +|---|---|---| +| Cross-room | Frames from different rooms | Strong: completely different RF environments | +| Cross-time | Frames separated by > 30 minutes | Medium: same room may have same state | +| Cross-occupancy | Frame from occupied room vs. empty room | Strong: fundamentally different fields | +| Hard negatives | Frames from same room with different person count | Strong: subtle but semantically different | + +**Hard negative mining for RF sensing:** + +The most informative negatives are those the model currently finds hardest +to distinguish. For RF sensing, these typically involve: + +1. Same person in different positions (similar overall CSI statistics, + different spatial distribution) +2. Different people with similar body habitus in same position +3. Same room with/without a static object change (furniture moved) + +We mine hard negatives by maintaining a per-link embedding index (using +HNSW from the AgentDB infrastructure) and selecting negatives with +cosine similarity > 0.7 to the anchor but known to be semantically +different. + +--- + +## 2. AETHER Extension: From Person Re-ID to Topological Boundaries + +### 2.1 AETHER Recap + +ADR-024 introduced AETHER (Adaptive Embedding Topology for Human +Environment Recognition) as a contrastive CSI embedding system for person +re-identification. AETHER learns a 128-dimensional embedding space where +CSI frames corresponding to the same person (across different TX-RX links +and time windows) cluster together, enabling identity tracking as people +move through multi-room ESP32 mesh deployments. + +The core AETHER training procedure uses a modified triplet loss: + + L_aether = max(0, ||f(a) - f(p)||^2 - ||f(a) - f(n)||^2 + margin) + +where a is an anchor CSI window, p is a positive (same person, different +link or time), and n is a negative (different person or empty room). + +### 2.2 From Person Embeddings to Boundary Embeddings + +AETHER's person re-ID embeddings capture *who* is perturbing the RF field. +We propose extending AETHER to additionally capture *where* topological +boundaries form -- the physical surfaces, walls, doors, and moving bodies +that partition the RF field into coherent zones. + +The key insight is that a topological boundary in the RF graph manifests +as a *coherence discontinuity* across links that cross the boundary. Links +on the same side of a boundary share similar CSI evolution (high mutual +coherence), while links crossing the boundary show divergent CSI (low +mutual coherence). This is exactly the kind of structure contrastive +learning excels at capturing. + +**AETHER-Topo embedding space:** + +We extend the AETHER embedding from R^128 to R^256, with the first 128 +dimensions reserved for person identity (backward-compatible with ADR-024) +and the second 128 dimensions encoding topological context: + +``` +AETHER-Topo Embedding [256-dim] + | + +-- [0..127] Person identity embedding (AETHER v1) + | -> Same person clusters regardless of position + | + +-- [128..255] Topological context embedding (AETHER-Topo) + -> Same coherence region clusters + -> Boundary-crossing links separate +``` + +This decomposition allows the system to simultaneously answer "who is +there?" and "where are the boundaries?" from the same embedding. + +### 2.3 Topological Contrastive Objective + +The topological extension uses a contrastive objective where: + +- **Positive pairs**: Two links whose CSI shows high mutual coherence + (both are within the same coherent zone, not crossing a boundary) +- **Negative pairs**: Two links where one is within a coherent zone and + the other crosses a boundary (coherence discontinuity) + +Formally, for links i and j with coherence score C(i,j): + + L_topo = -log( sum_{j in P(i)} exp(sim(z_i, z_j) / tau) / + sum_{k in A(i)} exp(sim(z_i, z_k) / tau) ) + +where P(i) = {j : C(i,j) > threshold_high} is the positive set and +A(i) = P(i) union N(i) includes all candidates including negatives +N(i) = {k : C(i,k) < threshold_low}. + +### 2.4 Learning Boundary Topology Without Labels + +The beauty of this approach is that boundary labels are not required. +The coherence scores C(i,j) computed by `coherence.rs` provide a +continuous, self-supervised signal. No human needs to annotate where +walls, doors, or bodies are. The contrastive loss learns to organize +the embedding space such that the minimum cut of the coherence graph +corresponds to the natural clustering of the embedding space. + +**Self-supervised boundary discovery procedure:** + +1. Collect CSI from all TX-RX links in the mesh for T seconds +2. Compute pairwise coherence matrix C[i,j] using `coherence.rs` +3. Form positive/negative pairs from C[i,j] thresholds +4. Train AETHER-Topo encoder with L_topo +5. Cluster the topological embeddings (DBSCAN or spectral clustering) +6. Cluster boundaries correspond to detected physical boundaries + +### 2.5 Connection to RuVector Min-Cut + +The `ruvector-mincut` crate already performs spectral graph partitioning +on the coherence-weighted RF graph. AETHER-Topo provides a learned +alternative that has three advantages: + +1. **Speed**: Once trained, embedding computation is a single forward pass + (< 1ms on ESP32-S3), versus eigendecomposition for spectral methods + (O(n^3) for n links). + +2. **Generalization**: The learned encoder captures patterns across + environments, not just the current graph's spectral structure. + +3. **Smoothness**: Embeddings vary smoothly with physical changes, + enabling interpolation of boundary positions between discrete graph + updates. + +The min-cut result on the coherence graph can be used as a +*pseudo-label generator* for AETHER-Topo training: the min-cut partition +assigns each link to a side, providing the positive/negative pair +structure without manual annotation. + +### 2.6 Architecture for AETHER-Topo + +``` +CSI Window [T=10 frames, per link] + | + v +Temporal CNN (1D, kernel=3, channels=64) + | + v +Multi-Head Self-Attention (4 heads, dim=64) + | + v +[CLS] token pooling -> 256-dim raw embedding + | + +---> Identity head: MLP -> 128-dim -> L2 normalize -> z_person + | + +---> Topology head: MLP -> 128-dim -> L2 normalize -> z_topo + | + v +Combined: z = [z_person || z_topo] (256-dim) +``` + +The dual-head architecture allows independent training of the two +embedding subspaces. During person re-ID, only z_person is used (exact +backward compatibility with ADR-024). During boundary detection, z_topo +is used. During combined operation, both are available. + +--- + +## 3. Coherence Boundary Detection via Contrastive Loss + +### 3.1 Problem Formulation + +Given an ESP32 mesh with V nodes and E = V*(V-1)/2 potential TX-RX links, +each link e_ij carries a time-varying CSI vector h_ij(t). The coherence +between two links e_ij and e_kl is defined as: + + C(e_ij, e_kl) = |E[h_ij(t) * conj(h_kl(t))]| / sqrt(E[|h_ij|^2] * E[|h_kl|^2]) + +where E[.] denotes temporal averaging over a window of W frames. + +A *coherence boundary* is a surface in physical space where C drops +sharply. Links on the same side of the boundary have C > 0.8; links +on opposite sides have C < 0.3. The transition zone width is typically +0.2-0.5 meters for 5 GHz signals (half-wavelength Fresnel zone). + +### 3.2 Contrastive Loss for Boundary Detection + +We design a contrastive loss that directly encodes the boundary detection +objective: embeddings of links in the same coherent zone should cluster; +embeddings of links separated by a boundary should be maximally distant. + +**Coherence-weighted contrastive loss:** + + L_boundary = sum_{(i,j)} w_ij * max(0, C_ij - ||z_i - z_j||^2) + + sum_{(i,j)} (1 - w_ij) * max(0, margin - ||z_i - z_j||^2 + C_ij) + +where w_ij = sigma(alpha * (C_ij - threshold)) is a soft assignment of +pair (i,j) to positive (same zone) or negative (cross-boundary), and +sigma is the sigmoid function with steepness alpha. + +This loss has several desirable properties: + +1. **Continuous**: Unlike thresholded pair assignment, the soft weighting + avoids discontinuities at the coherence threshold. + +2. **Coherence-calibrated**: The margin scales with the actual coherence + gap, so strongly separated links produce larger gradients than weakly + separated ones. + +3. **Self-supervised**: The coherence matrix C provides all supervision; + no external labels needed. + +### 3.3 Multi-Scale Boundary Detection + +Physical boundaries operate at multiple scales: + +| Scale | Physical Phenomenon | Coherence Signature | +|---|---|---| +| Room-level | Walls, floors | Complete decorrelation (C < 0.1) | +| Zone-level | Furniture clusters, doorways | Partial decorrelation (C ~ 0.2-0.5) | +| Body-level | Human presence | Dynamic decorrelation (C varies with movement) | +| Limb-level | Arm/leg motion | High-frequency coherence fluctuation | + +To detect boundaries at all scales, we use a multi-scale contrastive +loss with different temporal windows: + + L_multiscale = lambda_1 * L_boundary(W=1s) + lambda_2 * L_boundary(W=5s) + + lambda_3 * L_boundary(W=30s) + +Short windows (W=1s) capture body-level dynamics. Medium windows (W=5s) +average out rapid fluctuations to reveal zone-level boundaries. Long +windows (W=30s) expose only room-level structural boundaries. + +### 3.4 Boundary Sharpness Metric + +The quality of detected boundaries can be quantified by measuring the +*embedding gradient* at the boundary: + + Sharpness(b) = max_{i in A, j in B} ||z_i - z_j|| / min_{i,j in A} ||z_i - z_j|| + +where A and B are the two clusters separated by boundary b. High sharpness +indicates a well-detected boundary; low sharpness indicates the boundary +is ambiguous or the model is under-trained. + +In the RuView codebase, this metric connects to the existing +`coherence_gate.rs` module, which makes Accept/PredictOnly/Reject/Recalibrate +decisions based on coherence quality. The sharpness metric provides a +complementary signal: even if individual link coherence is high, low +boundary sharpness suggests the model cannot reliably distinguish zones. + +### 3.5 Integration with Field Model SVD + +The `field_model.rs` module computes room eigenstructure via SVD of the +CSI covariance matrix. The leading singular vectors represent the dominant +modes of RF field variation. Boundaries correspond to regions where the +dominant singular vectors change character -- where the eigenstructure +of one zone is linearly independent of the neighboring zone's +eigenstructure. + +The contrastive boundary embeddings and SVD field model are complementary: + +| Aspect | SVD Field Model | Contrastive Embeddings | +|---|---|---| +| Computation | O(n^3) eigendecomposition | O(n) forward pass (after training) | +| Adaptivity | Requires recomputation | Generalizes to new configurations | +| Interpretability | Eigenvectors have physical meaning | Embeddings are opaque | +| Boundary resolution | Limited by eigenvalue gaps | Learned, can be arbitrarily fine | +| Training | None (unsupervised) | Requires contrastive pre-training | + +We propose using SVD field model boundaries as pseudo-labels for +contrastive training, then using the trained contrastive model for +real-time inference (where the O(n) cost matters). + +### 3.6 Spatial Embedding Visualization + +For debugging and human interpretation, the 128-dimensional topological +embeddings can be projected to 2D or 3D using t-SNE or UMAP. In these +projections: + +- Links within the same coherent zone form tight clusters +- Boundary-crossing links appear as bridges between clusters +- The gap between clusters corresponds to boundary strength +- Temporal evolution traces continuous paths (person walking moves + clusters, not teleports them) + +This visualization connects to the `wifi-densepose-sensing-server` crate, +which serves a web UI for real-time sensing. The embedding visualization +can be rendered as an animated scatter plot overlaid on the floor plan. + +--- + +## 4. Delta-Driven Updates: Efficiency from Stationarity + +### 4.1 The Stationarity Problem + +In typical WiFi sensing deployments, the RF field is static for the vast +majority of time. A home environment might see 2-4 hours of activity per +day; the remaining 20-22 hours produce near-identical CSI frames. Running +contrastive learning on every frame wastes computation on uninformative +data while potentially biasing the model toward the "empty room" state. + +Delta-driven updates address this by computing contrastive losses only +when the RF field changes significantly. + +### 4.2 Change Detection for Loss Gating + +We define an RF field change detector based on the coherence drift rate: + + delta(t) = ||C(t) - C(t - delta_t)|| / ||C(t)|| + +where C(t) is the coherence matrix at time t and ||.|| is the Frobenius +norm. When delta(t) < epsilon (typically 0.01-0.05), the field is +stationary and no contrastive update is performed. + +**Hierarchical change detection:** + +``` +Level 1: Per-link amplitude change + delta_link(t) = |mean(|H(t)|) - mean(|H(t-1)|)| / mean(|H(t)|) + If delta_link < 0.005 for all links -> STATIC, skip everything + +Level 2: Per-link phase change (more sensitive) + delta_phase(t) = circular_std(angle(H(t)) - angle(H(t-1))) + If delta_phase < 0.01 for all links -> QUASI-STATIC, skip contrastive + +Level 3: Coherence matrix change + delta_coherence(t) = ||C(t) - C(t-1)||_F / ||C(t)||_F + If delta_coherence < 0.02 -> STABLE, use cached embeddings + +Level 4: Embedding change + delta_embedding(t) = max_i ||z_i(t) - z_i(t-1)|| + If delta_embedding > 0.1 -> SIGNIFICANT, full contrastive update +``` + +This hierarchy ensures that computation is allocated proportionally to +the information content of each frame. + +### 4.3 Efficiency Gains + +Empirical measurements from pilot deployments show the following +activity distributions: + +| Environment | Active % | Quasi-static % | Static % | Speedup | +|---|---|---|---|---| +| Home (2 occupants) | 8% | 15% | 77% | 12.5x | +| Office (10 occupants) | 22% | 30% | 48% | 4.5x | +| Hospital ward | 35% | 25% | 40% | 2.9x | +| Retail store | 45% | 25% | 30% | 2.2x | + +The delta-driven approach achieves a 2-12x reduction in compute for +contrastive learning with zero loss in representation quality (verified +by downstream person re-ID accuracy on the same held-out test set). + +### 4.4 Cached Embedding Reuse + +During static periods, the last computed embeddings remain valid. The +system maintains an embedding cache indexed by (link_id, timestamp): + +```rust +struct EmbeddingCache { + /// Per-link cached embedding with validity tracking + entries: HashMap, + /// Global field state hash for bulk invalidation + field_hash: u64, + /// Maximum age before forced recomputation + max_age: Duration, +} + +struct CachedEmbedding { + /// The cached 256-dim AETHER-Topo embedding + embedding: [f32; 256], + /// Timestamp when this embedding was computed + computed_at: Instant, + /// Coherence context at computation time + coherence_snapshot: f32, + /// Number of times this cache entry has been reused + reuse_count: u32, +} +``` + +The cache integrates with the existing `coherence_gate.rs` decision logic. +When the gate decision is Accept (coherence is stable and high-quality), +cached embeddings are used. When the gate decision transitions to +Recalibrate, the cache is invalidated and fresh embeddings are computed. + +### 4.5 Event-Triggered Burst Learning + +When the delta detector fires (significant change detected), the system +enters a *burst learning* mode where contrastive updates are computed at +full frame rate for a configurable window (default: 5 seconds after last +significant change). This captures the transient dynamics of events like: + +- Person entering a room (boundary creation) +- Person leaving a room (boundary dissolution) +- Door opening/closing (boundary topology change) +- Person sitting down/standing up (boundary reshaping) + +The burst window duration adapts based on the type of change detected: + +| Change Type | Burst Duration | Rationale | +|---|---|---| +| Abrupt (door, fall) | 3 seconds | Event completes quickly | +| Gradual (walking) | 10 seconds | Movement trajectory unfolds slowly | +| Periodic (breathing) | 30 seconds | Need full cycles for representation | +| Structural (furniture) | 60 seconds | Field may ring/settle slowly | + +### 4.6 Connection to Longitudinal Module + +The delta-driven approach connects directly to the `longitudinal.rs` +module, which maintains Welford online statistics for biomechanical +drift detection. The delta detector's event log provides a compressed +timeline of RF field changes that the longitudinal module can analyze +for trends: + +- Increasing delta frequency -> more activity -> possible health improvement +- Decreasing delta frequency -> less activity -> possible health decline +- Changed delta patterns -> altered routine -> worth flagging + +--- + +## 5. Self-Supervised Pre-Training on Unlabeled CSI + +### 5.1 Pre-Training Strategy + +The most powerful application of contrastive learning for RF sensing is +*environment pre-training*: learning the RF characteristics of a specific +deployment from raw, unlabeled CSI before any sensing task is configured. + +**Pre-training phases:** + +| Phase | Duration | Data | Objective | +|---|---|---|---| +| 1. Static calibration | 5 minutes | Empty room CSI | Learn baseline field structure | +| 2. Natural observation | 24-72 hours | Unlabeled, lived-in CSI | Learn activity patterns | +| 3. Fine-tuning | 10-30 minutes | Minimal labeled examples | Task-specific adaptation | + +### 5.2 Phase 1: Static Calibration Pre-Training + +During initial deployment, the ESP32 mesh records CSI in an empty room. +This calibration data provides the *null hypothesis* for the RF field: +the state against which all perturbations are measured. + +**Pretext tasks for static calibration:** + +1. **Subcarrier reconstruction**: Mask 30% of subcarriers, predict them + from the rest. This learns the frequency-domain structure of the + room's transfer function (multipath profile). + +2. **Link prediction**: Given CSI from N-1 links, predict the Nth link's + CSI. This learns the geometric relationships between TX-RX paths. + +3. **Time-frequency consistency**: Given the amplitude of a CSI frame, + predict its phase (and vice versa). This learns the room's + phase-amplitude coupling, which is determined by the geometry. + +These pretext tasks produce a pre-trained encoder that already understands +the room's RF characteristics before any human enters. + +### 5.3 Phase 2: Natural Observation Pre-Training + +After calibration, the system enters a 24-72 hour observation period +where it records CSI during normal use of the space. No labels are +collected; the contrastive framework provides all supervision. + +**Natural observation contrastive objectives:** + +1. **Temporal contrastive**: Frames within 200ms are positive pairs. + Frames separated by > 10 minutes are negative pairs. This learns + to distinguish between different states of the room. + +2. **Multi-link contrastive**: CSI from different links at the same + instant are positive pairs (they observe the same scene from + different vantage points). This learns viewpoint-invariant + representations, critical for the `multistatic.rs` fusion module. + +3. **Coherence-predictive**: Given a single link's CSI, predict the + coherence matrix row for that link (i.e., how coherent it is with + every other link). This directly learns the topological structure. + +### 5.4 Phase 3: Fine-Tuning + +After pre-training, the encoder is frozen (or fine-tuned with low +learning rate) and a task-specific head is trained with minimal labels: + +| Task | Labels Needed | Head Architecture | Fine-Tuning Time | +|---|---|---|---| +| Occupancy counting | 50-100 labeled windows | Linear classifier | 2 minutes | +| Room-level localization | 20-30 labeled walks | Linear classifier | 1 minute | +| Person re-identification | 10-20 labeled trajectories | Metric learning head | 5 minutes | +| Activity recognition | 100-200 labeled activities | MLP + temporal pooling | 10 minutes | +| Boundary detection | 0 (self-supervised) | Clustering | 0 minutes | + +The zero-label boundary detection is possible because the contrastive +pre-training already organizes embeddings by coherence structure. Clustering +the pre-trained embeddings directly reveals boundaries without any +task-specific labels. + +### 5.5 Pre-Training Data Requirements + +**Minimum viable pre-training:** + +- 5 minutes empty room (static calibration) +- 4 hours natural activity (at least 2 distinct occupancy states) +- Results in 60-70% of fully supervised performance + +**Recommended pre-training:** + +- 5 minutes empty room +- 48 hours natural activity (covering morning/evening routines) +- Results in 85-90% of fully supervised performance + +**Diminishing returns:** + +- Beyond 72 hours, additional pre-training data yields < 2% improvement +- Exception: seasonal changes (temperature affects CSI through material + properties) benefit from week-scale pre-training + +### 5.6 Curriculum Learning for Pre-Training + +We propose ordering the pre-training data by complexity: + +1. **Easy**: Long static periods (clear positive pairs, clear negatives) +2. **Medium**: Slow movement (gradual coherence changes) +3. **Hard**: Fast movement, multiple people (ambiguous pairs) + +This curriculum prevents the model from being overwhelmed by complex +scenes early in training, producing more stable convergence and better +final representations. The curriculum stage is determined automatically +by the delta detector: low-delta periods are easy, high-delta periods +are hard. + +### 5.7 Integration with RuView Codebase + +Pre-training integrates with the existing training pipeline in +`wifi-densepose-train`: + +``` +wifi-densepose-train/ + src/ + pretrain/ + contrastive.rs -- SimCLR/MoCo/BYOL implementations + augmentations.rs -- CSI-specific augmentations + curriculum.rs -- Complexity-ordered data staging + cache.rs -- Embedding cache for delta-driven updates + dataset.rs -- CompressedCsiBuffer (ruvector-temporal-tensor) + model.rs -- Encoder architecture with AETHER-Topo heads +``` + +The pre-trained model is serialized to ONNX format for deployment via +the `wifi-densepose-nn` crate, which already supports ONNX, PyTorch, +and Candle backends. + +--- + +## 6. Triplet Networks for Edge Classification + +### 6.1 Edge States in RF Topology + +In the RF sensing graph, each edge (TX-RX link) exists in one of several +states at any given time: + +| State | Coherence Behavior | Physical Meaning | +|---|---|---| +| **Stable** | High coherence, low variance | Clear line of sight, no perturbation | +| **Unstable** | Low coherence, high variance | Heavily obstructed, multi-scatter | +| **Transitioning** | Coherence changing monotonically | Object entering/leaving beam path | +| **Oscillating** | Periodic coherence variation | Breathing, repetitive motion | +| **Blocked** | Near-zero coherence, stable | Complete obstruction (wall, metal) | + +Classifying edges into these states enables the system to weight the +graph appropriately for minimum-cut computation. Stable edges should +have high weight (hard to cut). Unstable edges should have low weight +(easy to cut). Transitioning edges provide directional information +about boundary motion. + +### 6.2 Triplet Loss for Edge Classification + +We use a triplet network to learn an embedding space where edges of the +same state cluster together. The triplet loss is: + + L_triplet = max(0, ||f(a) - f(p)||^2 - ||f(a) - f(n)||^2 + margin) + +where: +- **Anchor** (a): A windowed CSI sequence from a reference edge +- **Positive** (p): A CSI sequence from another edge in the same state +- **Negative** (n): A CSI sequence from an edge in a different state + +### 6.3 State Labels from Coherence Statistics + +Edge states are labeled automatically from coherence time series, without +manual annotation: + +``` +classify_edge_state(coherence_series: &[f32]) -> EdgeState: + mean_c = mean(coherence_series) + std_c = std(coherence_series) + trend = linear_regression_slope(coherence_series) + periodicity = dominant_frequency_power(coherence_series) + + if mean_c > 0.8 and std_c < 0.05: + return Stable + if mean_c < 0.2 and std_c < 0.05: + return Blocked + if |trend| > 0.1 and std_c < 0.15: + return Transitioning(sign(trend)) + if periodicity > 0.5: + return Oscillating(dominant_frequency) + return Unstable +``` + +These automatic labels are noisy but sufficient for triplet training, +especially with online hard example mining. + +### 6.4 Online Hard Example Mining (OHEM) + +Standard triplet training with random sampling is inefficient because +most triplets satisfy the margin constraint trivially. OHEM selects the +hardest triplets -- those where the positive is far and the negative +is close -- to focus learning on the decision boundary. + +**OHEM for edge classification:** + +For each anchor, we maintain a priority queue of candidates scored by: + + hardness(a, p, n) = ||f(a) - f(p)||^2 - ||f(a) - f(n)||^2 + +The hardest valid triplets (where hardness is negative -- the triangle +inequality is violated) provide the most gradient signal. + +**Semi-hard mining**: In practice, the hardest triplets can be outliers +or label noise. Semi-hard mining selects triplets where: + + ||f(a) - f(p)||^2 < ||f(a) - f(n)||^2 < ||f(a) - f(p)||^2 + margin + +These triplets violate the margin but not the ordering, providing +stable gradients. + +### 6.5 Multi-State Triplet Architecture + +``` +CSI Window [T=20 frames, single link] + | + v +1D-CNN (3 layers, channels=[32, 64, 128]) + | + v +Bidirectional GRU (hidden=64, 2 layers) + | + v +Attention-weighted temporal pooling + | + v +FC -> 64-dim embedding -> L2 normalize + | + +---> Triplet loss (embedding space clustering) + | + +---> Classification head (5-class softmax, auxiliary loss) +``` + +The auxiliary classification head provides additional supervision and +enables direct state prediction at inference time. The triplet embedding +enables nearest-neighbor classification for novel states not seen during +training. + +### 6.6 Edge Classification for Minimum Cut Weighting + +Once edges are classified, their weights in the RF graph are assigned +according to their state: + +```rust +fn edge_weight(state: EdgeState, coherence: f32) -> f32 { + match state { + EdgeState::Stable => coherence * 1.0, // Full weight + EdgeState::Blocked => 0.01, // Near-zero (easy to cut) + EdgeState::Unstable => coherence * 0.3, // Reduced weight + EdgeState::Transitioning(dir) => { + // Weight decreases as transition progresses + coherence * (1.0 - transition_progress(dir)) + } + EdgeState::Oscillating(freq) => { + // Use mean coherence, damped by oscillation amplitude + coherence * (1.0 - oscillation_amplitude(freq)) + } + } +} +``` + +This learned weighting replaces the heuristic weighting currently used +in `ruvector-mincut`, providing more nuanced graph partitioning that +adapts to the temporal dynamics of each link. + +### 6.7 Temporal State Transitions + +Edge states form a Markov chain with transition probabilities that encode +physical constraints: + +``` + Stable <---> Transitioning <---> Unstable + | | | + v v v + Blocked Oscillating Blocked +``` + +Impossible transitions (e.g., Stable -> Blocked without passing through +Transitioning) indicate sensor malfunction or adversarial interference. +The `adversarial.rs` module can use these transition constraints as an +additional consistency check. + +--- + +## 7. Cross-Environment Transfer via Contrastive Alignment + +### 7.1 The Domain Gap Problem + +A model trained on CSI from one room performs poorly in a different room +because the RF transfer function changes completely. Wall materials, +room dimensions, furniture layout, and multipath structure all differ. +This domain gap is the primary obstacle to deploying WiFi sensing at +scale. + +ADR-027 introduced MERIDIAN (Multi-Environment Representation for +Invariant Domain Adaptation in Networks) as a framework for cross- +environment generalization. Contrastive alignment is the core mechanism +by which MERIDIAN achieves domain invariance. + +### 7.2 Contrastive Domain Alignment + +The key idea is to learn embeddings that are invariant to environment- +specific features while preserving task-relevant features. Given CSI +from source environment S and target environment T: + + L_align = L_task(S) + lambda * L_domain(S, T) + +where L_task is the supervised task loss (e.g., boundary detection) on +labeled source data, and L_domain is a contrastive alignment loss that +pulls corresponding states from S and T together: + + L_domain = -sum_{(s,t) in Pairs} log( + exp(sim(z_s, z_t) / tau) / + sum_{t' in T} exp(sim(z_s, z_t') / tau) + ) + +**Pair construction for cross-environment alignment:** + +Pairs (s, t) are formed by matching *activity states* across environments: + +| State | Source Example | Target Example | Pairing Criterion | +|---|---|---|---| +| Empty room | Calibration CSI from S | Calibration CSI from T | Temporal (both during setup) | +| Single occupant center | Person standing in center of S | Person standing in center of T | Activity label | +| Two occupants | Two people in S | Two people in T | Occupancy count | +| Walking trajectory | Person walking in S | Person walking in T | Activity label | + +### 7.3 Environment-Invariant and Environment-Specific Features + +Not all CSI features should be aligned across environments. We decompose +the representation into invariant and specific components: + +``` +CSI Frame -> Shared Encoder -> z_shared + | + +---> Invariant Projector -> z_inv (aligned across environments) + | + +---> Specific Projector -> z_spec (environment-specific) +``` + +**Invariant features** (aligned via contrastive loss): +- Number of people present +- Activity type (sitting, walking, standing) +- Relative spatial arrangement of occupants +- Boundary topology (number and arrangement of zones) + +**Specific features** (preserved per environment): +- Absolute CSI amplitude (depends on path loss) +- Absolute phase (depends on clock offset and geometry) +- Multipath delay profile (depends on room dimensions) +- Frequency selectivity (depends on scatterer distribution) + +The invariant projector is trained with L_domain to align across +environments. The specific projector is trained with a reconstruction +loss to preserve environment-specific information needed for fine-tuning. + +### 7.4 Few-Shot Adaptation Protocol + +When deploying to a new environment, the system performs few-shot +adaptation using the pre-trained invariant representations: + +**Step 1: Zero-shot baseline** (0 labels) +- Use invariant embeddings directly with frozen encoder +- Cluster embeddings for boundary detection +- Expected performance: 50-60% of fully supervised + +**Step 2: Calibration adaptation** (0 labels, 5 minutes) +- Record empty room CSI in new environment +- Align new environment's empty-room embeddings to the invariant space +- Expected performance: 65-75% of fully supervised + +**Step 3: Few-shot fine-tuning** (5-10 labels, 10 minutes) +- Record a few labeled examples (e.g., "person in kitchen", + "person in bedroom") +- Fine-tune the specific projector and task head +- Expected performance: 85-95% of fully supervised + +### 7.5 MERIDIAN Contrastive Components + +The MERIDIAN framework (ADR-027) defines four contrastive components: + +1. **Environment Fingerprinting** (connects to `cross_room.rs`): + Contrastive embedding of environment identity. Each environment + maps to a unique region of embedding space. This enables the system + to recognize when it has returned to a previously visited environment + and recall the associated calibration. + +2. **Activity Alignment**: Contrastive loss ensuring that the same + activity (walking, sitting) maps to similar embeddings regardless + of environment. This is the core transfer mechanism. + +3. **Topological Alignment**: Contrastive loss ensuring that similar + boundary structures (one room with one doorway) map to similar + embeddings regardless of room dimensions or materials. + +4. **Temporal Alignment**: Contrastive loss ensuring that temporal + patterns (someone entering a room) are recognized regardless of + the room's RF characteristics. + +### 7.6 Negative Transfer Prevention + +Naive cross-environment alignment can cause *negative transfer*: forcing +alignment between environments that are too different (e.g., a small +bathroom vs. a warehouse) degrades performance on both. We prevent +negative transfer through: + +1. **Environment similarity gating**: Compute environment similarity + from calibration CSI statistics. Only align environments with + similarity > 0.4 (on a 0-1 scale based on room size, link count, + and multipath richness). + +2. **Adaptive alignment strength**: The alignment loss weight lambda + is modulated by a learned similarity function: + + lambda_eff = lambda * sigmoid(sim(env_s, env_t) - threshold) + + This softly disables alignment for dissimilar environments. + +3. **Per-feature alignment selection**: Not all invariant features + transfer equally well. We learn a feature-wise alignment mask that + selects which dimensions of z_inv to align for each environment pair. + +### 7.7 Continual Learning Across Environments + +As the system is deployed in more environments, it accumulates a library +of environment-specific models and a shared invariant encoder. The +invariant encoder improves with each new environment through continual +contrastive alignment: + +``` +Environment 1 (Home): z_spec_1, z_inv (v1) + | + v Align +Environment 2 (Office): z_spec_2, z_inv (v2, improved) + | + v Align +Environment 3 (Hospital): z_spec_3, z_inv (v3, further improved) + | + v ... +Environment N: z_spec_N, z_inv (vN, converged) +``` + +To prevent catastrophic forgetting, we use Elastic Weight Consolidation +(EWC) to protect the invariant encoder weights that are important for +previous environments while allowing adaptation to new ones: + + L_total = L_task + lambda_align * L_domain + lambda_ewc * sum_i F_i * (theta_i - theta_i*)^2 + +where F_i is the Fisher information of parameter theta_i estimated from +previous environments, and theta_i* is the parameter value after training +on the previous environment. + +### 7.8 Deployment Architecture for Cross-Environment Transfer + +``` +Cloud: + Invariant Encoder (shared, periodically updated) + Environment Library (z_spec per environment) + Continual learning pipeline + +Edge (ESP32 mesh): + Quantized encoder (INT8, < 500KB) + Local z_spec for current environment + Few-shot adaptation on-device + Upload CSI statistics for cloud-side continual learning +``` + +The quantized encoder runs on ESP32-S3 (with 512KB SRAM and vector +extensions) using the `wifi-densepose-nn` crate's Candle backend for +on-device inference. The `wifi-densepose-wasm` crate provides a browser- +based version for visualization and debugging. + +--- + +## 8. Integration Roadmap + +### 8.1 Phase 1: Foundation (Weeks 1-4) + +| Task | Crate | Module | Dependencies | +|---|---|---|---| +| Implement CSI augmentation library | wifi-densepose-train | pretrain/augmentations.rs | core | +| Implement SimCLR contrastive loss | wifi-densepose-train | pretrain/contrastive.rs | core, nn | +| Implement delta change detector | wifi-densepose-signal | ruvsense/delta.rs | coherence.rs | +| Add embedding cache | wifi-densepose-signal | ruvsense/embed_cache.rs | coherence_gate.rs | +| Unit tests for augmentations | wifi-densepose-train | tests/ | -- | + +### 8.2 Phase 2: AETHER-Topo (Weeks 5-8) + +| Task | Crate | Module | Dependencies | +|---|---|---|---| +| Extend AETHER embedding to 256-dim | wifi-densepose-signal | ruvsense/pose_tracker.rs | ADR-024 | +| Implement topological contrastive loss | wifi-densepose-train | pretrain/topo_loss.rs | contrastive.rs | +| Implement boundary sharpness metric | wifi-densepose-signal | ruvsense/coherence.rs | field_model.rs | +| Multi-scale boundary detection | wifi-densepose-signal | ruvsense/boundary.rs | coherence.rs | +| Integration tests: AETHER-Topo + min-cut | wifi-densepose-ruvector | tests/ | ruvector-mincut | + +### 8.3 Phase 3: Triplet Edge Classification (Weeks 9-12) + +| Task | Crate | Module | Dependencies | +|---|---|---|---| +| Implement triplet loss with OHEM | wifi-densepose-train | pretrain/triplet.rs | contrastive.rs | +| Edge state classifier | wifi-densepose-signal | ruvsense/edge_classify.rs | coherence.rs | +| Learned min-cut weighting | wifi-densepose-ruvector | src/metrics.rs | edge_classify.rs | +| Temporal state transition validator | wifi-densepose-signal | ruvsense/adversarial.rs | edge_classify.rs | +| End-to-end tests: triplet + min-cut | wifi-densepose-ruvector | tests/ | -- | + +### 8.4 Phase 4: Cross-Environment Transfer (Weeks 13-16) + +| Task | Crate | Module | Dependencies | +|---|---|---|---| +| Domain alignment contrastive loss | wifi-densepose-train | pretrain/domain_align.rs | contrastive.rs | +| Environment fingerprinting | wifi-densepose-signal | ruvsense/cross_room.rs | ADR-027 | +| Few-shot adaptation pipeline | wifi-densepose-train | pretrain/few_shot.rs | domain_align.rs | +| EWC continual learning | wifi-densepose-train | pretrain/ewc.rs | -- | +| Quantized encoder for ESP32-S3 | wifi-densepose-nn | src/quantize.rs | Candle backend | + +### 8.5 ADR Dependencies + +| This Work | Depends On | Enables | +|---|---|---| +| Contrastive pre-training | ADR-024 (AETHER) | Improved re-ID accuracy | +| AETHER-Topo | ADR-024, ADR-029 (RuvSense) | Learned boundary detection | +| Coherence boundary detection | ADR-014 (SOTA signal) | Self-supervised sensing | +| Cross-environment transfer | ADR-027 (MERIDIAN) | Scalable deployment | +| Delta-driven updates | ADR-029 (RuvSense) | Compute efficiency | +| Triplet edge classification | ADR-016 (RuVector pipeline) | Learned graph weighting | + +### 8.6 New ADR Proposal + +This research motivates a new Architecture Decision Record: + +**ADR-044: Contrastive Learning for RF Coherence Detection** + +- **Status**: Proposed +- **Context**: Current boundary detection relies on handcrafted coherence + thresholds and spectral methods. Contrastive learning can replace these + with learned representations that generalize across environments. +- **Decision**: Adopt contrastive self-supervised pre-training for CSI + encoders. Extend AETHER to AETHER-Topo for topological embeddings. + Implement delta-driven updates for compute efficiency. Use triplet + networks for edge classification. Integrate MERIDIAN contrastive + alignment for cross-environment transfer. +- **Consequences**: Requires pre-training infrastructure (GPU for initial + training, ESP32-S3 for inference). Adds ~200KB model size per + environment. Reduces labeling effort by 80-90%. Enables zero-shot + boundary detection. + +--- + +## 9. References + +### Contrastive Learning Foundations + +1. Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. (2020). "A Simple + Framework for Contrastive Learning of Visual Representations" (SimCLR). + ICML 2020. + +2. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020). "Momentum + Contrast for Unsupervised Visual Representation Learning" (MoCo). + CVPR 2020. + +3. Grill, J.-B., Strub, F., Altche, F., et al. (2020). "Bootstrap Your + Own Latent: A New Approach to Self-Supervised Learning" (BYOL). + NeurIPS 2020. + +4. Schroff, F., Kalenichenko, D., and Philbin, J. (2015). "FaceNet: A + Unified Embedding for Face Recognition and Clustering". CVPR 2015. + +5. Oord, A. van den, Li, Y., and Vinyals, O. (2018). "Representation + Learning with Contrastive Predictive Coding" (CPC). arXiv:1807.03748. + +### WiFi Sensing + +6. Ma, Y., Zhou, G., and Wang, S. (2019). "WiFi Sensing with Channel + State Information: A Survey". ACM Computing Surveys, 52(3). + +7. Wang, F., Gong, W., and Liu, J. (2019). "On Spatial Diversity in + WiFi-Based Human Activity Recognition". ACM IMWUT, 3(3). + +8. Yang, Z., Zhou, Z., and Liu, Y. (2013). "From RSSI to CSI: Indoor + Localization via Channel Response". ACM Computing Surveys, 46(2). + +9. Halperin, D., Hu, W., Sheth, A., and Wetherall, D. (2011). "Tool + Release: Gathering 802.11n Traces with Channel State Information". + ACM SIGCOMM CCR, 41(1). + +### Domain Adaptation and Transfer Learning + +10. Ganin, Y. and Lempitsky, V. (2015). "Unsupervised Domain Adaptation + by Backpropagation". ICML 2015. + +11. Long, M., Cao, Y., Wang, J., and Jordan, M. (2015). "Learning + Transferable Features with Deep Adaptation Networks". ICML 2015. + +12. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. (2017). + "Overcoming Catastrophic Forgetting in Neural Networks" (EWC). + PNAS, 114(13). + +### Graph Methods + +13. Stoer, M. and Wagner, F. (1997). "A Simple Min-Cut Algorithm". + Journal of the ACM, 44(4). + +14. Von Luxburg, U. (2007). "A Tutorial on Spectral Clustering". + Statistics and Computing, 17(4). + +15. Kipf, T. N. and Welling, M. (2017). "Semi-Supervised Classification + with Graph Convolutional Networks". ICLR 2017. + +### Project-Internal References + +16. ADR-024: Contrastive CSI Embedding / AETHER. wifi-densepose docs. +17. ADR-027: Cross-Environment Domain Generalization / MERIDIAN. + wifi-densepose docs. +18. ADR-029: RuvSense Multistatic Sensing Mode. wifi-densepose docs. +19. ADR-014: SOTA Signal Processing. wifi-densepose docs. +20. ADR-016: RuVector Training Pipeline Integration. wifi-densepose docs. + +--- + +*Document prepared for the RuView/wifi-densepose project. This research +informs the design of contrastive learning pipelines for RF field coherence +detection within the ESP32 mesh sensing architecture.* \ No newline at end of file