wifi-densepose/docs/research/sota-surveys/sota-wifi-sensing-2025.md

16 KiB

SOTA WiFi Sensing for Edge Pose Estimation (2024-2026 Update)

Date: 2026-04-02 Focus: New architectures, lightweight models, edge deployment, ESP32+Pi Zero inference Complements: wifi-sensing-ruvector-sota-2026.md (February 2026 survey)


1. New Architectures Since Last Survey

1.1 WiFlow: Lightweight Continuous Pose Estimation (February 2026)

Paper: WiFlow: A Lightweight WiFi-based Continuous Human Pose Estimation Network with Spatio-Temporal Feature Decoupling (arXiv:2602.08661)

WiFlow is the most directly relevant architecture for our ESP32 + Pi Zero deployment target.

Architecture

Three-stage encoder-decoder with spatio-temporal decoupling:

Stage 1: Temporal Encoder (TCN)

  • Dilated causal convolution with exponentially growing dilation factors (1, 2, 4, 8)
  • Input: 540x20 tensor (18 antenna links x 30 subcarriers = 540 features, 20 time steps)
  • Progressive channel compression: 540 -> 440 -> 340 -> 240
  • Preserves temporal causality while achieving full receptive field coverage

Stage 2: Spatial Encoder (Asymmetric Convolution)

  • 1xk kernels operating only in the subcarrier dimension
  • 4 residual blocks: 8 -> 16 -> 32 -> 64 channels
  • Subcarrier compression: 240 -> 120 -> 60 -> 30 -> 15
  • Stride (1,2) downsampling -- no pooling layers

Stage 3: Axial Self-Attention

  • Two-stage axial attention reduces complexity from O(H^2 W^2) to O(H^2 W + HW^2)
  • Stage one: width direction (temporal axis), 8 groups
  • Stage two: height direction (keypoint axis)
  • Input reshaped to (B x K) x C x T for first stage

Decoder:

  • Adaptive average pooling instead of fully connected layers
  • Direct coordinate regression to 2D keypoint positions

Key Metrics

Metric WiFlow WPformer WiSPPN
Parameters 4.82M 10.04M 121.5M
FLOPs 0.47B 35.00B 338.45B
PCK@20 (random split) 97.00% 70.02% 85.87%
MPJPE (random split) 0.008m 0.028m 0.016m
PCK@20 (cross-subject) 86.89% -- --
Training time (5-fold) 18.17h 137.5h --

Critical observations for our project:

  • 4.82M parameters at INT8 quantization = ~4.8 MB model size -- fits in Pi Zero 2 W RAM (512 MB)
  • 0.47B FLOPs suggests ~50ms inference on Cortex-A53 with NEON SIMD (estimated)
  • Only uses amplitude, discards phase (phase is "heavily corrupted by CFO and SFO in commercial WiFi devices")
  • ESP32-S3 CSI has similar CFO/SFO issues, so amplitude-only approach is pragmatic

Loss function:

L = L_H + lambda * L_B
L_H = SmoothL1(predicted_keypoints, ground_truth, beta=0.1)
L_B = sum of bone length constraint violations across 14 bone connections
lambda = 0.2

The bone constraint loss is particularly important for edge deployment where noisy predictions need physical plausibility enforcement.

Adaptation for ESP32 + Pi Zero

WiFlow's architecture maps well to our hardware:

  • TCN runs on ESP32 (temporal feature extraction from raw CSI stream)
  • Asymmetric conv + axial attention runs on Pi Zero (spatial encoding + pose regression)
  • The 540-dimensional input assumes Intel 5300 NIC (18 links x 30 subcarriers); for ESP32-S3 with 1 TX x 1 RX and 52 subcarriers, input dimension is 52x20 = 1040 -- even smaller

1.2 MultiFormer: Multi-Person WiFi Pose (May 2025)

Paper: MultiFormer: A Multi-Person Pose Estimation System Based on CSI and Attention Mechanism (arXiv:2505.22555)

Architecture

Teacher-student framework with OpenPose teacher providing ground truth labels.

Time-Frequency Dual-Dimensional Tokenization (TFDDT):

  • Input: CSI matrix from 1 TX, 3 RX, 30 subcarriers
  • Upsampled via zero-insertion + low-pass filtering to 64x3x64
  • Two parallel token streams:
    • Frequency tokens F_j: N_S tokens of length M x N_R (subcarrier-centric view)
    • Temporal tokens T_i: M tokens of length N_S x N_R (time-centric view)

Dual Transformer Encoder:

  • 8 layers per branch (frequency and temporal)
  • Multi-head self-attention: MSA(X) = (1/H) * sum(Softmax(QK^T / sqrt(d_k)) V)
  • Each branch followed by FFN with ReLU, dropout, residual connections

Multi-Stage Pose Estimation:

  • Part Confidence Maps (PCM): 19x36x36 heatmaps (18 keypoints + average)
  • Part Affinity Fields (PAF): 38x36x36 directional fields for 19 limb connections
  • Pose-Attentive Perception Module (PAPM): channel + spatial attention on PCM/PAF
  • Multi-person assignment via Hungarian algorithm on PAF integrals

Model Variants

Variant Encoder Layers Input Parameters
MultiFormer 8 64x1296 11.93M
MultiFormer-24 8 64x576 4.05M
MultiFormer-18 6 64x324 2.80M

Key result on MM-Fi dataset: MultiFormer achieves PCK@20 of 0.7225, outperforming CSI2Pose (0.6841). The compact MultiFormer-18 at 2.80M parameters is edge-deployable.

Relevance to Our Project

MultiFormer's dual-token approach is valuable because:

  1. It explicitly separates temporal and frequency information (like WiFlow's decoupling)
  2. The PAF-based multi-person assignment using Hungarian algorithm can run on Pi Zero
  3. The 2.80M parameter variant (MultiFormer-18) at INT8 = ~2.8 MB, well within Pi Zero constraints

1.3 Person-in-WiFi 3D (CVPR 2024)

Paper: Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi (CVPR 2024)

First multi-person 3D WiFi pose estimation.

Key results:

  • Single person MPJPE: 91.7mm
  • Two persons: 108.1mm
  • Three persons: 125.3mm
  • Dataset: 97K frames, 4m x 3.5m area, 7 volunteers
  • Transformer-based end-to-end architecture

Relevance: Establishes the accuracy ceiling for WiFi 3D pose. Our ESP32+Pi system should target comparable single-person performance (sub-100mm MPJPE) as a milestone.

1.4 Spatio-Temporal 3D Point Clouds from WiFi-CSI (October 2024)

Paper: arXiv:2410.16303

Novel approach: generates 3D point clouds from WiFi CSI data using transformer networks.

Key innovation: Positional encoding with learned embeddings for antennas and subcarriers, followed by multi-head attention over antenna-subcarrier pairs. This captures both spatial (antenna geometry) and spectral (subcarrier frequency response) dependencies.

Relevance: Point cloud output is a richer representation than keypoints alone, enabling:

  • Silhouette estimation for activity recognition
  • Body volume estimation for person identification
  • Occlusion reasoning when fused with multiple viewpoints

1.5 Graph-Based 3D Human Pose from WiFi (November 2025)

Paper: Graph-based 3D Human Pose Estimation using WiFi Signals (arXiv:2511.19105)

Uses graph neural networks where nodes represent keypoints and edges represent skeletal connections. CSI features are injected as node/edge attributes.

Relevance: Graph structure naturally maps to our RuvSense pose_tracker which already maintains a 17-keypoint skeleton with Kalman filtering. Adding graph-based message passing between keypoints could improve joint prediction coherence.

2. Edge Deployment Landscape

2.1 CSI-Sense-Zero: ESP32 + Pi Zero Reference Implementation

Repository: github.com/winwinashwin/CSI-Sense-Zero

The most directly relevant prior art for our hardware target.

Architecture:

  • Two ESP32-WROOM-32: one TX, one RX (captures CSI)
  • Pi Zero: inference node
  • Communication: USB serial at 921,600 baud
  • Buffer: 235KB FIFO at /tmp/csififo (~256 CSI records)
  • Inference rate: 2 Hz (configurable)
  • WebSocket output for real-time visualization

Data flow:

ESP32 TX -> WiFi signal -> ESP32 RX -> Serial (921.6 kbaud) -> Pi Zero FIFO -> Model -> WebSocket

Limitations:

  • Original Pi Zero (single-core ARM11) -- very slow inference
  • Activity recognition only (not pose estimation)
  • Python inference (not optimized for ARM)

What we improve:

  • Pi Zero 2 W has quad-core Cortex-A53 -- roughly 5-10x faster than Pi Zero
  • Rust inference (ONNX/Candle) vs Python -- 3-10x faster
  • ESP32-S3 vs ESP32-WROOM-32 -- better CSI quality, more subcarriers
  • Pose estimation instead of just activity classification
  • UDP transport instead of USB serial -- supports multi-node mesh

2.2 OnnxStream: Lightweight ONNX on Pi Zero 2 W

Repository: github.com/vitoplantamura/OnnxStream

Runs Stable Diffusion XL on Pi Zero 2 W in 298 MB RAM. Key features:

  • C++ implementation, XNNPACK acceleration
  • ARM NEON SIMD optimization
  • Memory-efficient streaming execution (processes one operator at a time)
  • Supports INT8 quantization

Benchmark estimates for our model sizes:

Model Parameters INT8 Size Est. Pi Zero 2 Latency
MultiFormer-18 2.80M ~2.8 MB ~30-50ms
WiFlow 4.82M ~4.8 MB ~50-80ms
MultiFormer 11.93M ~11.9 MB ~120-200ms
DensePose-WiFi ~25M (est.) ~25 MB ~300-500ms

These estimates assume XNNPACK-accelerated INT8 inference on Cortex-A53 @ 1 GHz. The WiFlow and MultiFormer-18 models can achieve 12-20 Hz inference, matching our 20 Hz TDMA cycle target.

2.3 ONNX Runtime on ARM

ONNX Runtime officially supports Raspberry Pi deployment with:

  • ARM NEON execution provider
  • INT8 quantization support
  • Python and C++ APIs
  • Model optimization tools (graph optimization, operator fusion)

For Rust integration, the ort crate (ONNX Runtime Rust bindings) supports cross-compilation to aarch64-linux-gnu.

2.4 EfficientFi: CSI Compression for Edge

Paper: EfficientFi: Towards Large-Scale Lightweight WiFi Sensing via CSI Compression (arXiv:2204.04138)

Proposes compressing CSI data on the sensing device before transmission to the inference node. Key idea: train a CSI autoencoder where the encoder runs on the constrained device and the decoder runs on the more powerful inference node.

Relevance: For our ESP32 -> Pi Zero pipeline, CSI compression on ESP32 reduces:

  • UDP packet size (lower bandwidth, less packet loss)
  • Pi Zero preprocessing time (compressed features are more compact)
  • Effective latency (less data to transmit per frame)

3. Comparative Analysis: Architecture Selection for ESP32 + Pi Zero

3.1 Decision Matrix

Criterion WiFlow MultiFormer-18 DensePose-WiFi Graph-3D
Parameters 4.82M 2.80M ~25M ~8M (est.)
FLOPs 0.47B ~0.3B (est.) ~5B (est.) ~1B (est.)
Multi-person No Yes (PAF+Hungarian) Yes (RCNN-based) No
3D output No (2D) No (2D) No (UV map) Yes (3D)
Amplitude-only Yes Yes No (amp+phase) Unknown
Edge-viable Yes Yes No Marginal
Open source Not yet Not yet Limited Not yet

For the ESP32 + Pi Zero deployment, we recommend a hybrid architecture:

  1. WiFlow's TCN temporal encoder on ESP32 -- extract temporal features from raw CSI
  2. MultiFormer's dual-token approach on Pi Zero -- process both frequency and temporal views
  3. WiFlow's bone constraint loss during training -- enforce physical skeleton plausibility
  4. RuvSense coherence gating before inference -- reject low-quality CSI frames

This hybrid achieves:

  • ~3-5M parameters (between WiFlow and MultiFormer-18)
  • Amplitude-only input (robust to ESP32 CFO/SFO)
  • Sub-100ms inference on Pi Zero 2 W
  • Optional multi-person support via PAF module

3.3 Training Data Strategy

Based on the surveyed papers:

Dataset Subjects Frames Hardware Availability
CMU DensePose-WiFi 8 ~250K Intel 5300 Limited
Person-in-WiFi 3D 7 97K Custom WiFi GitHub
MM-Fi Multiple Large WiFi + mmWave Public
Wi-Pose Multiple Large Intel 5300 Public

Our approach:

  1. Pre-train on MM-Fi/Wi-Pose public datasets (Intel 5300 CSI format)
  2. Apply domain adaptation for ESP32-S3 CSI format (different subcarrier count, CFO characteristics)
  3. Fine-tune on self-collected ESP32-S3 data in target environments
  4. Augment with synthetic CSI from ray-tracing forward model (Arena Physica insight)

4. Gap Analysis: Current wifi-densepose vs SOTA

4.1 What We Have

Capability Status Module
ESP32 CSI capture Production wifi-densepose-hardware
Multi-node fusion Production ruvsense/multistatic.rs
Phase alignment Production ruvsense/phase_align.rs
Coherence gating Production ruvsense/coherence_gate.rs
17-keypoint tracking Production ruvsense/pose_tracker.rs
ONNX inference engine Production wifi-densepose-nn
Modality translator Production wifi-densepose-nn/translator.rs
Training pipeline Production wifi-densepose-train
Subcarrier interpolation Production wifi-densepose-train/subcarrier.rs

4.2 What We Are Missing

Gap Required For Priority
Pi Zero deployment target Edge inference node Critical
Lightweight model architecture Sub-100ms inference on Cortex-A53 Critical
Temporal causal convolution Real-time streaming inference High
Axial attention module Efficient spatial encoding High
Bone constraint loss Physical plausibility High
CSI compression on ESP32 Bandwidth reduction Medium
INT8 quantization pipeline Model size reduction Medium
Cross-environment adaptation Deployment generalization Medium
Multi-person PAF decoding Multiple subject support Low (Phase 2)
3D pose lifting Z-axis estimation Low (Phase 3)
Diffusion-based pose refinement Uncertainty quantification Research

4.3 Architecture Gaps in Detail

1. No lightweight inference path. The current wifi-densepose-nn crate assumes GPU or high-end CPU inference. We need an EdgeInferenceEngine optimized for:

  • INT8 ONNX models
  • ARM NEON SIMD via XNNPACK
  • Streaming inference (process CSI frames as they arrive, not in batches)
  • Memory-mapped model loading (avoid loading entire model into RAM)

2. No ESP32 -> Pi Zero communication protocol. The wifi-densepose-hardware crate handles ESP32 CSI capture and UDP aggregation to a server, but has no lightweight protocol for ESP32 -> Pi Zero direct communication. We need:

  • Compact binary frame format (not the full ADR-018 format)
  • Optional CSI compression (autoencoder on ESP32 or simple PCA)
  • Heartbeat and synchronization for multi-ESP32 setups

3. No temporal convolution module. The existing signal processing pipeline uses frame-by-frame processing. WiFlow and MultiFormer both show that temporal context (20 frames for WiFlow, 64 frames for MultiFormer) significantly improves accuracy. We need a ring buffer + TCN module in the inference path.

4. No bone/skeleton constraint enforcement at inference time. The pose_tracker.rs has Kalman filtering and skeleton constraints, but these are post-hoc corrections. WiFlow shows that baking bone constraints into the loss function during training produces better models that need less post-processing.

5. References

  1. DensePose From WiFi, Geng et al., arXiv:2301.00250, 2023
  2. Person-in-WiFi 3D, Yan et al., CVPR 2024
  3. WiFlow, arXiv:2602.08661, 2026
  4. MultiFormer, arXiv:2505.22555, 2025
  5. CSI-Channel Spatial Decomposition, MDPI Electronics 14(4), 2025
  6. CSI-Former, MDPI Entropy 25(1), 2023
  7. Spatio-Temporal 3D Point Clouds from WiFi-CSI, arXiv:2410.16303, 2024
  8. Graph-based 3D Human Pose from WiFi, arXiv:2511.19105, 2025
  9. EfficientFi, arXiv:2204.04138, 2022
  10. CSI-Sense-Zero, github.com/winwinashwin/CSI-Sense-Zero
  11. OnnxStream, github.com/vitoplantamura/OnnxStream
  12. Arena Physica, arenaphysica.com (Atlas RF Studio, Heaviside-0/Marconi-0)
  13. Tools and Methods for WiFi Sensing in Embedded Devices, MDPI Sensors 25(19), 2025
  14. Real-Time HAR using WiFi CSI and LSTM on Edge Devices, SASI-ITE 2025