342 lines
16 KiB
Markdown
342 lines
16 KiB
Markdown
# SOTA WiFi Sensing for Edge Pose Estimation (2024-2026 Update)
|
|
|
|
**Date:** 2026-04-02
|
|
**Focus:** New architectures, lightweight models, edge deployment, ESP32+Pi Zero inference
|
|
**Complements:** `wifi-sensing-ruvector-sota-2026.md` (February 2026 survey)
|
|
|
|
---
|
|
|
|
## 1. New Architectures Since Last Survey
|
|
|
|
### 1.1 WiFlow: Lightweight Continuous Pose Estimation (February 2026)
|
|
|
|
**Paper:** WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling ([arXiv:2602.08661](https://arxiv.org/html/2602.08661))
|
|
|
|
WiFlow is the most directly relevant architecture for our ESP32 + Pi Zero deployment target.
|
|
|
|
#### Architecture
|
|
|
|
Three-stage encoder-decoder with spatio-temporal decoupling:
|
|
|
|
**Stage 1: Temporal Encoder (TCN)**
|
|
- Dilated causal convolution with exponentially growing dilation factors (1, 2, 4, 8)
|
|
- Input: 540x20 tensor (18 antenna links x 30 subcarriers = 540 features, 20 time steps)
|
|
- Progressive channel compression: 540 -> 440 -> 340 -> 240
|
|
- Preserves temporal causality while achieving full receptive field coverage
|
|
|
|
**Stage 2: Spatial Encoder (Asymmetric Convolution)**
|
|
- 1xk kernels operating only in the subcarrier dimension
|
|
- 4 residual blocks: 8 -> 16 -> 32 -> 64 channels
|
|
- Subcarrier compression: 240 -> 120 -> 60 -> 30 -> 15
|
|
- Stride (1,2) downsampling -- no pooling layers
|
|
|
|
**Stage 3: Axial Self-Attention**
|
|
- Two-stage axial attention reduces complexity from O(H^2 W^2) to O(H^2 W + HW^2)
|
|
- Stage one: width direction (temporal axis), 8 groups
|
|
- Stage two: height direction (keypoint axis)
|
|
- Input reshaped to (B x K) x C x T for first stage
|
|
|
|
**Decoder:**
|
|
- Adaptive average pooling instead of fully connected layers
|
|
- Direct coordinate regression to 2D keypoint positions
|
|
|
|
#### Key Metrics
|
|
|
|
| Metric | WiFlow | WPformer | WiSPPN |
|
|
|--------|--------|----------|--------|
|
|
| Parameters | **4.82M** | 10.04M | 121.5M |
|
|
| FLOPs | **0.47B** | 35.00B | 338.45B |
|
|
| PCK@20 (random split) | **97.00%** | 70.02% | 85.87% |
|
|
| MPJPE (random split) | **0.008m** | 0.028m | 0.016m |
|
|
| PCK@20 (cross-subject) | **86.89%** | -- | -- |
|
|
| Training time (5-fold) | **18.17h** | 137.5h | -- |
|
|
|
|
**Critical observations for our project:**
|
|
- 4.82M parameters at INT8 quantization = ~4.8 MB model size -- fits in Pi Zero 2 W RAM (512 MB)
|
|
- 0.47B FLOPs suggests ~50ms inference on Cortex-A53 with NEON SIMD (estimated)
|
|
- Only uses amplitude, discards phase (phase is "heavily corrupted by CFO and SFO in commercial WiFi devices")
|
|
- ESP32-S3 CSI has similar CFO/SFO issues, so amplitude-only approach is pragmatic
|
|
|
|
**Loss function:**
|
|
```
|
|
L = L_H + lambda * L_B
|
|
L_H = SmoothL1(predicted_keypoints, ground_truth, beta=0.1)
|
|
L_B = sum of bone length constraint violations across 14 bone connections
|
|
lambda = 0.2
|
|
```
|
|
|
|
The bone constraint loss is particularly important for edge deployment where noisy predictions need physical plausibility enforcement.
|
|
|
|
#### Adaptation for ESP32 + Pi Zero
|
|
|
|
WiFlow's architecture maps well to our hardware:
|
|
- TCN runs on ESP32 (temporal feature extraction from raw CSI stream)
|
|
- Asymmetric conv + axial attention runs on Pi Zero (spatial encoding + pose regression)
|
|
- The 540-dimensional input assumes Intel 5300 NIC (18 links x 30 subcarriers); for ESP32-S3 with 1 TX x 1 RX and 52 subcarriers, input dimension is 52x20 = 1040 -- even smaller
|
|
|
|
### 1.2 MultiFormer: Multi-Person WiFi Pose (May 2025)
|
|
|
|
**Paper:** MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism ([arXiv:2505.22555](https://arxiv.org/html/2505.22555v1))
|
|
|
|
#### Architecture
|
|
|
|
Teacher-student framework with OpenPose teacher providing ground truth labels.
|
|
|
|
**Time-Frequency Dual-Dimensional Tokenization (TFDDT):**
|
|
- Input: CSI matrix from 1 TX, 3 RX, 30 subcarriers
|
|
- Upsampled via zero-insertion + low-pass filtering to 64x3x64
|
|
- Two parallel token streams:
|
|
- Frequency tokens F_j: N_S tokens of length M x N_R (subcarrier-centric view)
|
|
- Temporal tokens T_i: M tokens of length N_S x N_R (time-centric view)
|
|
|
|
**Dual Transformer Encoder:**
|
|
- 8 layers per branch (frequency and temporal)
|
|
- Multi-head self-attention: MSA(X) = (1/H) * sum(Softmax(QK^T / sqrt(d_k)) V)
|
|
- Each branch followed by FFN with ReLU, dropout, residual connections
|
|
|
|
**Multi-Stage Pose Estimation:**
|
|
- Part Confidence Maps (PCM): 19x36x36 heatmaps (18 keypoints + average)
|
|
- Part Affinity Fields (PAF): 38x36x36 directional fields for 19 limb connections
|
|
- Pose-Attentive Perception Module (PAPM): channel + spatial attention on PCM/PAF
|
|
- Multi-person assignment via Hungarian algorithm on PAF integrals
|
|
|
|
#### Model Variants
|
|
|
|
| Variant | Encoder Layers | Input | Parameters |
|
|
|---------|---------------|-------|------------|
|
|
| MultiFormer | 8 | 64x1296 | 11.93M |
|
|
| MultiFormer-24 | 8 | 64x576 | 4.05M |
|
|
| MultiFormer-18 | 6 | 64x324 | **2.80M** |
|
|
|
|
**Key result on MM-Fi dataset:** MultiFormer achieves PCK@20 of 0.7225, outperforming CSI2Pose (0.6841). The compact MultiFormer-18 at 2.80M parameters is edge-deployable.
|
|
|
|
#### Relevance to Our Project
|
|
|
|
MultiFormer's dual-token approach is valuable because:
|
|
1. It explicitly separates temporal and frequency information (like WiFlow's decoupling)
|
|
2. The PAF-based multi-person assignment using Hungarian algorithm can run on Pi Zero
|
|
3. The 2.80M parameter variant (MultiFormer-18) at INT8 = ~2.8 MB, well within Pi Zero constraints
|
|
|
|
### 1.3 Person-in-WiFi 3D (CVPR 2024)
|
|
|
|
**Paper:** Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi (CVPR 2024)
|
|
|
|
First multi-person 3D WiFi pose estimation.
|
|
|
|
**Key results:**
|
|
- Single person MPJPE: 91.7mm
|
|
- Two persons: 108.1mm
|
|
- Three persons: 125.3mm
|
|
- Dataset: 97K frames, 4m x 3.5m area, 7 volunteers
|
|
- Transformer-based end-to-end architecture
|
|
|
|
**Relevance:** Establishes the accuracy ceiling for WiFi 3D pose. Our ESP32+Pi system should target comparable single-person performance (sub-100mm MPJPE) as a milestone.
|
|
|
|
### 1.4 Spatio-Temporal 3D Point Clouds from WiFi-CSI (October 2024)
|
|
|
|
**Paper:** [arXiv:2410.16303](https://arxiv.org/html/2410.16303v1)
|
|
|
|
Novel approach: generates 3D point clouds from WiFi CSI data using transformer networks.
|
|
|
|
**Key innovation:** Positional encoding with learned embeddings for antennas and subcarriers, followed by multi-head attention over antenna-subcarrier pairs. This captures both spatial (antenna geometry) and spectral (subcarrier frequency response) dependencies.
|
|
|
|
**Relevance:** Point cloud output is a richer representation than keypoints alone, enabling:
|
|
- Silhouette estimation for activity recognition
|
|
- Body volume estimation for person identification
|
|
- Occlusion reasoning when fused with multiple viewpoints
|
|
|
|
### 1.5 Graph-Based 3D Human Pose from WiFi (November 2025)
|
|
|
|
**Paper:** Graph-based 3D Human Pose Estimation using WiFi Signals ([arXiv:2511.19105](https://arxiv.org/html/2511.19105))
|
|
|
|
Uses graph neural networks where nodes represent keypoints and edges represent skeletal connections. CSI features are injected as node/edge attributes.
|
|
|
|
**Relevance:** Graph structure naturally maps to our RuvSense pose_tracker which already maintains a 17-keypoint skeleton with Kalman filtering. Adding graph-based message passing between keypoints could improve joint prediction coherence.
|
|
|
|
## 2. Edge Deployment Landscape
|
|
|
|
### 2.1 CSI-Sense-Zero: ESP32 + Pi Zero Reference Implementation
|
|
|
|
**Repository:** [github.com/winwinashwin/CSI-Sense-Zero](https://github.com/winwinashwin/CSI-Sense-Zero)
|
|
|
|
The most directly relevant prior art for our hardware target.
|
|
|
|
**Architecture:**
|
|
- Two ESP32-WROOM-32: one TX, one RX (captures CSI)
|
|
- Pi Zero: inference node
|
|
- Communication: USB serial at 921,600 baud
|
|
- Buffer: 235KB FIFO at `/tmp/csififo` (~256 CSI records)
|
|
- Inference rate: 2 Hz (configurable)
|
|
- WebSocket output for real-time visualization
|
|
|
|
**Data flow:**
|
|
```
|
|
ESP32 TX -> WiFi signal -> ESP32 RX -> Serial (921.6 kbaud) -> Pi Zero FIFO -> Model -> WebSocket
|
|
```
|
|
|
|
**Limitations:**
|
|
- Original Pi Zero (single-core ARM11) -- very slow inference
|
|
- Activity recognition only (not pose estimation)
|
|
- Python inference (not optimized for ARM)
|
|
|
|
**What we improve:**
|
|
- Pi Zero 2 W has quad-core Cortex-A53 -- roughly 5-10x faster than Pi Zero
|
|
- Rust inference (ONNX/Candle) vs Python -- 3-10x faster
|
|
- ESP32-S3 vs ESP32-WROOM-32 -- better CSI quality, more subcarriers
|
|
- Pose estimation instead of just activity classification
|
|
- UDP transport instead of USB serial -- supports multi-node mesh
|
|
|
|
### 2.2 OnnxStream: Lightweight ONNX on Pi Zero 2 W
|
|
|
|
**Repository:** [github.com/vitoplantamura/OnnxStream](https://github.com/vitoplantamura/OnnxStream)
|
|
|
|
Runs Stable Diffusion XL on Pi Zero 2 W in 298 MB RAM. Key features:
|
|
- C++ implementation, XNNPACK acceleration
|
|
- ARM NEON SIMD optimization
|
|
- Memory-efficient streaming execution (processes one operator at a time)
|
|
- Supports INT8 quantization
|
|
|
|
**Benchmark estimates for our model sizes:**
|
|
|
|
| Model | Parameters | INT8 Size | Est. Pi Zero 2 Latency |
|
|
|-------|-----------|-----------|----------------------|
|
|
| MultiFormer-18 | 2.80M | ~2.8 MB | ~30-50ms |
|
|
| WiFlow | 4.82M | ~4.8 MB | ~50-80ms |
|
|
| MultiFormer | 11.93M | ~11.9 MB | ~120-200ms |
|
|
| DensePose-WiFi | ~25M (est.) | ~25 MB | ~300-500ms |
|
|
|
|
These estimates assume XNNPACK-accelerated INT8 inference on Cortex-A53 @ 1 GHz. The WiFlow and MultiFormer-18 models can achieve 12-20 Hz inference, matching our 20 Hz TDMA cycle target.
|
|
|
|
### 2.3 ONNX Runtime on ARM
|
|
|
|
ONNX Runtime officially supports Raspberry Pi deployment with:
|
|
- ARM NEON execution provider
|
|
- INT8 quantization support
|
|
- Python and C++ APIs
|
|
- Model optimization tools (graph optimization, operator fusion)
|
|
|
|
For Rust integration, the `ort` crate (ONNX Runtime Rust bindings) supports cross-compilation to aarch64-linux-gnu.
|
|
|
|
### 2.4 EfficientFi: CSI Compression for Edge
|
|
|
|
**Paper:** EfficientFi: Towards Large-Scale Lightweight WiFi Sensing via CSI Compression ([arXiv:2204.04138](https://arxiv.org/pdf/2204.04138))
|
|
|
|
Proposes compressing CSI data on the sensing device before transmission to the inference node. Key idea: train a CSI autoencoder where the encoder runs on the constrained device and the decoder runs on the more powerful inference node.
|
|
|
|
**Relevance:** For our ESP32 -> Pi Zero pipeline, CSI compression on ESP32 reduces:
|
|
- UDP packet size (lower bandwidth, less packet loss)
|
|
- Pi Zero preprocessing time (compressed features are more compact)
|
|
- Effective latency (less data to transmit per frame)
|
|
|
|
## 3. Comparative Analysis: Architecture Selection for ESP32 + Pi Zero
|
|
|
|
### 3.1 Decision Matrix
|
|
|
|
| Criterion | WiFlow | MultiFormer-18 | DensePose-WiFi | Graph-3D |
|
|
|-----------|--------|----------------|----------------|----------|
|
|
| Parameters | 4.82M | 2.80M | ~25M | ~8M (est.) |
|
|
| FLOPs | 0.47B | ~0.3B (est.) | ~5B (est.) | ~1B (est.) |
|
|
| Multi-person | No | Yes (PAF+Hungarian) | Yes (RCNN-based) | No |
|
|
| 3D output | No (2D) | No (2D) | No (UV map) | Yes (3D) |
|
|
| Amplitude-only | Yes | Yes | No (amp+phase) | Unknown |
|
|
| Edge-viable | Yes | Yes | No | Marginal |
|
|
| Open source | Not yet | Not yet | Limited | Not yet |
|
|
|
|
### 3.2 Recommended Architecture: Hybrid WiFlow + MultiFormer
|
|
|
|
For the ESP32 + Pi Zero deployment, we recommend a hybrid architecture:
|
|
|
|
1. **WiFlow's TCN temporal encoder** on ESP32 -- extract temporal features from raw CSI
|
|
2. **MultiFormer's dual-token approach** on Pi Zero -- process both frequency and temporal views
|
|
3. **WiFlow's bone constraint loss** during training -- enforce physical skeleton plausibility
|
|
4. **RuvSense coherence gating** before inference -- reject low-quality CSI frames
|
|
|
|
This hybrid achieves:
|
|
- ~3-5M parameters (between WiFlow and MultiFormer-18)
|
|
- Amplitude-only input (robust to ESP32 CFO/SFO)
|
|
- Sub-100ms inference on Pi Zero 2 W
|
|
- Optional multi-person support via PAF module
|
|
|
|
### 3.3 Training Data Strategy
|
|
|
|
Based on the surveyed papers:
|
|
|
|
| Dataset | Subjects | Frames | Hardware | Availability |
|
|
|---------|----------|--------|----------|--------------|
|
|
| CMU DensePose-WiFi | 8 | ~250K | Intel 5300 | Limited |
|
|
| Person-in-WiFi 3D | 7 | 97K | Custom WiFi | GitHub |
|
|
| MM-Fi | Multiple | Large | WiFi + mmWave | Public |
|
|
| Wi-Pose | Multiple | Large | Intel 5300 | Public |
|
|
|
|
**Our approach:**
|
|
1. Pre-train on MM-Fi/Wi-Pose public datasets (Intel 5300 CSI format)
|
|
2. Apply domain adaptation for ESP32-S3 CSI format (different subcarrier count, CFO characteristics)
|
|
3. Fine-tune on self-collected ESP32-S3 data in target environments
|
|
4. Augment with synthetic CSI from ray-tracing forward model (Arena Physica insight)
|
|
|
|
## 4. Gap Analysis: Current wifi-densepose vs SOTA
|
|
|
|
### 4.1 What We Have
|
|
|
|
| Capability | Status | Module |
|
|
|-----------|--------|--------|
|
|
| ESP32 CSI capture | Production | `wifi-densepose-hardware` |
|
|
| Multi-node fusion | Production | `ruvsense/multistatic.rs` |
|
|
| Phase alignment | Production | `ruvsense/phase_align.rs` |
|
|
| Coherence gating | Production | `ruvsense/coherence_gate.rs` |
|
|
| 17-keypoint tracking | Production | `ruvsense/pose_tracker.rs` |
|
|
| ONNX inference engine | Production | `wifi-densepose-nn` |
|
|
| Modality translator | Production | `wifi-densepose-nn/translator.rs` |
|
|
| Training pipeline | Production | `wifi-densepose-train` |
|
|
| Subcarrier interpolation | Production | `wifi-densepose-train/subcarrier.rs` |
|
|
|
|
### 4.2 What We Are Missing
|
|
|
|
| Gap | Required For | Priority |
|
|
|-----|-------------|----------|
|
|
| **Pi Zero deployment target** | Edge inference node | Critical |
|
|
| **Lightweight model architecture** | Sub-100ms inference on Cortex-A53 | Critical |
|
|
| **Temporal causal convolution** | Real-time streaming inference | High |
|
|
| **Axial attention module** | Efficient spatial encoding | High |
|
|
| **Bone constraint loss** | Physical plausibility | High |
|
|
| **CSI compression on ESP32** | Bandwidth reduction | Medium |
|
|
| **INT8 quantization pipeline** | Model size reduction | Medium |
|
|
| **Cross-environment adaptation** | Deployment generalization | Medium |
|
|
| **Multi-person PAF decoding** | Multiple subject support | Low (Phase 2) |
|
|
| **3D pose lifting** | Z-axis estimation | Low (Phase 3) |
|
|
| **Diffusion-based pose refinement** | Uncertainty quantification | Research |
|
|
|
|
### 4.3 Architecture Gaps in Detail
|
|
|
|
**1. No lightweight inference path.** The current `wifi-densepose-nn` crate assumes GPU or high-end CPU inference. We need an `EdgeInferenceEngine` optimized for:
|
|
- INT8 ONNX models
|
|
- ARM NEON SIMD via XNNPACK
|
|
- Streaming inference (process CSI frames as they arrive, not in batches)
|
|
- Memory-mapped model loading (avoid loading entire model into RAM)
|
|
|
|
**2. No ESP32 -> Pi Zero communication protocol.** The `wifi-densepose-hardware` crate handles ESP32 CSI capture and UDP aggregation to a server, but has no lightweight protocol for ESP32 -> Pi Zero direct communication. We need:
|
|
- Compact binary frame format (not the full ADR-018 format)
|
|
- Optional CSI compression (autoencoder on ESP32 or simple PCA)
|
|
- Heartbeat and synchronization for multi-ESP32 setups
|
|
|
|
**3. No temporal convolution module.** The existing signal processing pipeline uses frame-by-frame processing. WiFlow and MultiFormer both show that temporal context (20 frames for WiFlow, 64 frames for MultiFormer) significantly improves accuracy. We need a ring buffer + TCN module in the inference path.
|
|
|
|
**4. No bone/skeleton constraint enforcement at inference time.** The `pose_tracker.rs` has Kalman filtering and skeleton constraints, but these are post-hoc corrections. WiFlow shows that baking bone constraints into the loss function during training produces better models that need less post-processing.
|
|
|
|
## 5. References
|
|
|
|
1. DensePose From WiFi, Geng et al., arXiv:2301.00250, 2023
|
|
2. Person-in-WiFi 3D, Yan et al., CVPR 2024
|
|
3. WiFlow, arXiv:2602.08661, 2026
|
|
4. MultiFormer, arXiv:2505.22555, 2025
|
|
5. CSI-Channel Spatial Decomposition, MDPI Electronics 14(4), 2025
|
|
6. CSI-Former, MDPI Entropy 25(1), 2023
|
|
7. Spatio-Temporal 3D Point Clouds from WiFi-CSI, arXiv:2410.16303, 2024
|
|
8. Graph-based 3D Human Pose from WiFi, arXiv:2511.19105, 2025
|
|
9. EfficientFi, arXiv:2204.04138, 2022
|
|
10. CSI-Sense-Zero, github.com/winwinashwin/CSI-Sense-Zero
|
|
11. OnnxStream, github.com/vitoplantamura/OnnxStream
|
|
12. Arena Physica, arenaphysica.com (Atlas RF Studio, Heaviside-0/Marconi-0)
|
|
13. Tools and Methods for WiFi Sensing in Embedded Devices, MDPI Sensors 25(19), 2025
|
|
14. Real-Time HAR using WiFi CSI and LSTM on Edge Devices, SASI-ITE 2025
|