wifi-densepose/docs/research/architecture/implementation-plan.md

# GOAP Implementation Plan: ESP32-S3 + Pi Zero 2 W WiFi Pose Estimation

**Date:** 2026-04-02
**Version:** 1.0
**Status:** Proposed
**Depends on:** ADR-029, ADR-068, SOTA survey (sota-wifi-sensing-2025.md)

---

## 1. Goal State Definition

### 1.1 Terminal Goal

A production-ready WiFi-based human pose estimation system where:
- **ESP32-S3** nodes capture WiFi CSI at 100 Hz, perform temporal feature extraction, and transmit compressed features via UDP
- **Raspberry Pi Zero 2 W** receives features from 1-4 ESP32 nodes, runs neural inference, and outputs 17-keypoint COCO poses at >= 10 Hz
- **Single-person MPJPE** < 100mm in trained environments
- **End-to-end latency** < 150ms (CSI capture to pose output)
- **Total BOM cost** < $30 per sensing zone (1x Pi Zero + 2x ESP32)

### 1.2 World State Variables

```
current_state:
  esp32_csi_capture:           true    # Already implemented
  multi_node_aggregation:      true    # ADR-018 UDP aggregator
  phase_alignment:             true    # ruvsense/phase_align.rs
  coherence_gating:            true    # ruvsense/coherence_gate.rs
  multistatic_fusion:          true    # ruvsense/multistatic.rs
  kalman_pose_tracking:        true    # ruvsense/pose_tracker.rs
  onnx_inference_engine:       true    # wifi-densepose-nn
  modality_translator:         true    # wifi-densepose-nn/translator.rs
  training_pipeline:           true    # wifi-densepose-train
  pi_zero_deployment:          false   # No Pi Zero target
  lightweight_model:           false   # No edge-optimized model
  temporal_conv_module:        false   # No TCN in inference path
  csi_compression:             false   # No ESP32-side compression
  int8_quantization:           false   # No quantization pipeline
  bone_constraint_loss:        false   # No skeleton physics in loss
  esp32_pi_protocol:           false   # No lightweight protocol
  edge_inference_engine:       false   # No ARM-optimized inference
  cross_env_adaptation:        false   # No domain adaptation
  multi_person_paf:            false   # No PAF-based multi-person
  3d_pose_lifting:             false   # No Z-axis estimation

goal_state:
  esp32_csi_capture:           true
  multi_node_aggregation:      true
  phase_alignment:             true
  coherence_gating:            true
  multistatic_fusion:          true
  kalman_pose_tracking:        true
  onnx_inference_engine:       true
  modality_translator:         true
  training_pipeline:           true
  pi_zero_deployment:          true    # TARGET
  lightweight_model:           true    # TARGET
  temporal_conv_module:        true    # TARGET
  csi_compression:             true    # TARGET
  int8_quantization:           true    # TARGET
  bone_constraint_loss:        true    # TARGET
  esp32_pi_protocol:           true    # TARGET
  edge_inference_engine:       true    # TARGET
  cross_env_adaptation:        true    # TARGET (Phase 2)
  multi_person_paf:            true    # TARGET (Phase 2)
  3d_pose_lifting:             true    # TARGET (Phase 3)
```

## 2. Action Definitions

Each action has preconditions, effects, estimated cost (developer-days), and priority.

### Action 1: Define ESP32-Pi Communication Protocol (ADR-069)

```
name:           define_esp32_pi_protocol
cost:           3 days
priority:       CRITICAL (blocks all Pi Zero work)
preconditions:  [esp32_csi_capture]
effects:        [esp32_pi_protocol := true]
```

**Description:** Design a lightweight binary protocol for ESP32 -> Pi Zero communication over UDP (WiFi) or UART (wired fallback).

**Protocol specification:**

```
Frame Header (8 bytes):
  [0:1]   magic:         0xCF01 (CSI Frame v1)
  [2]     node_id:       u8 (0-255, identifies ESP32 node)
  [3]     frame_type:    u8 (0=raw_csi, 1=compressed_features, 2=heartbeat)
  [4:5]   sequence:      u16 (monotonic frame counter, wraps at 65535)
  [6:7]   payload_len:   u16 (bytes following header)

Raw CSI Payload (frame_type=0):
  [0:3]   timestamp_us:  u32 (microseconds since boot, wraps at ~71 minutes)
  [4]     channel:       u8 (WiFi channel 1-13)
  [5]     bandwidth:     u8 (0=20MHz, 1=40MHz)
  [6]     rssi:          i8 (dBm)
  [7]     noise_floor:   i8 (dBm)
  [8:9]   num_sc:        u16 (number of subcarriers, typically 52 or 114)
  [10..]  csi_data:      [i16; num_sc * 2] (interleaved I/Q, little-endian)

Compressed Feature Payload (frame_type=1):
  [0:3]   timestamp_us:  u32
  [4]     compression:   u8 (0=none, 1=pca_16, 2=pca_32, 3=autoencoder)
  [5]     num_features:  u8 (number of feature dimensions)
  [6..]   features:      [f16; num_features] (half-precision floats)

Heartbeat Payload (frame_type=2):
  [0:3]   uptime_s:      u32
  [4:7]   frames_sent:   u32
  [8:9]   free_heap:     u16 (KB)
  [10]    wifi_rssi:     i8 (connection to AP)
  [11]    battery_pct:   u8 (0-100, 0xFF if wired)
```

**Implementation locations:**
- ESP32 firmware: `firmware/esp32-csi-node/main/protocol_v2.h`
- Rust parser: `wifi-densepose-hardware/src/protocol_v2.rs`

**Design rationale:**
- Fixed 8-byte header with magic number for frame synchronization
- Half-precision (f16) for compressed features saves 50% bandwidth vs f32
- Heartbeat enables Pi Zero to detect node failures and rebalance
- Raw CSI mode for debugging; compressed mode for production

### Action 2: Implement Lightweight Model Architecture

```
name:           implement_lightweight_model
cost:           10 days
priority:       CRITICAL (core inference capability)
preconditions:  [training_pipeline, onnx_inference_engine]
effects:        [lightweight_model := true, temporal_conv_module := true]
```

**Architecture: WiFlowPose (hybrid WiFlow + MultiFormer)**

Based on SOTA analysis, we define a custom architecture combining the best elements:

```
Input: CSI amplitude tensor [B, T, S]
  B = batch size
  T = temporal window (20 frames at 20 Hz = 1 second context)
  S = subcarriers (52 for ESP32-S3 20MHz, 114 for 40MHz)

Stage 1: Temporal Encoder (runs on ESP32 optionally, or Pi Zero)
  TCN with 4 layers, dilation [1, 2, 4, 8]
  Input:  [B, T, S] = [B, 20, 52]
  Output: [B, T', C_t] = [B, 20, 64] (temporal features)

Stage 2: Spatial Encoder (runs on Pi Zero)
  Asymmetric convolution blocks (1xk kernels on subcarrier dimension)
  4 residual blocks: 64 -> 128 -> 128 -> 64 channels
  Subcarrier compression: 52 -> 26 -> 13 -> 7
  Output: [B, 64, 7]

Stage 3: Keypoint Decoder (runs on Pi Zero)
  Axial self-attention (2-stage, 4 heads)
  Reshape to [B, 17, 64] (17 keypoints x 64 features)
  Linear projection: 64 -> 2 (x, y coordinates)
  Output: [B, 17, 2] (17 COCO keypoints, normalized 0-1)

Optional Stage 4: Multi-person (Phase 2)
  PAF branch: predict 19 limb affinity fields
  Hungarian assignment for person grouping
```

**Estimated model size:**
- Temporal encoder: ~0.5M params
- Spatial encoder: ~1.2M params
- Keypoint decoder: ~0.8M params
- Total: ~2.5M params
- INT8 size: ~2.5 MB
- FP16 size: ~5 MB
- Estimated Pi Zero 2 W inference: 30-60ms per frame

**Rust implementation location:** New module in `wifi-densepose-nn/src/wiflow_pose.rs`

```rust
/// WiFlowPose: Lightweight WiFi CSI to pose estimation model
///
/// Hybrid architecture combining WiFlow's TCN temporal encoder
/// with MultiFormer's dual-token spatial processing and
/// axial self-attention for keypoint decoding.
pub struct WiFlowPoseConfig {
    /// Number of input subcarriers (52 for ESP32 20MHz, 114 for 40MHz)
    pub num_subcarriers: usize,
    /// Temporal window size in frames (default: 20)
    pub temporal_window: usize,
    /// TCN dilation factors (default: [1, 2, 4, 8])
    pub tcn_dilations: Vec<usize>,
    /// Number of output keypoints (default: 17, COCO format)
    pub num_keypoints: usize,
    /// Hidden dimension for spatial encoder (default: 64)
    pub hidden_dim: usize,
    /// Number of attention heads in axial attention (default: 4)
    pub num_attention_heads: usize,
    /// Enable multi-person PAF branch (default: false)
    pub multi_person: bool,
}

impl Default for WiFlowPoseConfig {
    fn default() -> Self {
        Self {
            num_subcarriers: 52,
            temporal_window: 20,
            tcn_dilations: vec![1, 2, 4, 8],
            num_keypoints: 17,
            hidden_dim: 64,
            num_attention_heads: 4,
            multi_person: false,
        }
    }
}
```

### Action 3: Implement Bone Constraint Loss

```
name:           implement_bone_constraint_loss
cost:           2 days
priority:       HIGH
preconditions:  [training_pipeline, lightweight_model]
effects:        [bone_constraint_loss := true]
```

**Loss function following WiFlow:**

```
L_total = L_keypoint + lambda_bone * L_bone + lambda_physics * L_physics

L_keypoint = SmoothL1(pred, gt, beta=0.1)

L_bone = (1/|B|) * sum_{(i,j) in bones} | ||pred_i - pred_j|| - bone_length_{ij} |

L_physics = (1/N) * sum_t max(0, ||pred_t - pred_{t-1}|| - v_max * dt)
```

Where:
- `bones` = 14 COCO bone connections (e.g., left_shoulder-left_elbow)
- `bone_length_{ij}` = average human bone length ratios (normalized to torso length)
- `v_max` = maximum physiologically plausible keypoint velocity (2 m/s for walking, 10 m/s for fast gestures)
- `lambda_bone = 0.2`, `lambda_physics = 0.1`

**Bone length ratios (normalized to torso = shoulder_center to hip_center = 1.0):**

| Bone | Ratio |
|------|-------|
| shoulder-elbow | 0.55 |
| elbow-wrist | 0.50 |
| hip-knee | 0.85 |
| knee-ankle | 0.80 |
| shoulder-hip | 1.00 |
| neck-nose | 0.30 |
| nose-eye | 0.08 |
| eye-ear | 0.12 |

**Implementation location:** `wifi-densepose-train/src/losses.rs` (add `BoneConstraintLoss`)

### Action 4: Implement INT8 Quantization Pipeline

```
name:           implement_int8_quantization
cost:           5 days
priority:       HIGH
preconditions:  [lightweight_model, training_pipeline]
effects:        [int8_quantization := true]
```

**Approach: Post-Training Quantization (PTQ) with calibration**

1. Train model in FP32 using standard pipeline
2. Export to ONNX format
3. Run ONNX Runtime quantization tool with calibration dataset:
   - Collect 1000 representative CSI frames across multiple environments
   - Run calibration to determine per-layer quantization ranges
   - Apply symmetric INT8 quantization for weights, asymmetric for activations
4. Validate quantized model accuracy (target: <2% PCK@20 degradation)

**Quantization-aware considerations:**
- TCN layers: quantize per-channel (dilated convolutions are sensitive to quantization)
- Attention layers: keep attention logits in FP16 (softmax is numerically sensitive)
- Output layer: keep in FP32 (final coordinate regression needs precision)

**Rust implementation:**
```rust
// In wifi-densepose-nn/src/quantize.rs
pub struct QuantizationConfig {
    /// Quantization method
    pub method: QuantMethod, // PTQ, QAT, Dynamic
    /// Per-layer precision overrides
    pub layer_overrides: HashMap<String, Precision>,
    /// Calibration dataset path
    pub calibration_data: PathBuf,
    /// Number of calibration samples
    pub num_calibration_samples: usize,
    /// Target accuracy degradation threshold
    pub max_accuracy_loss: f32,
}

pub enum Precision {
    INT8,
    FP16,
    FP32,
}
```

**ONNX quantization command (for build pipeline):**
```bash
python -m onnxruntime.quantization.quantize \
  --input model_fp32.onnx \
  --output model_int8.onnx \
  --calibrate \
  --calibration_data_reader CsiCalibrationReader \
  --quant_format QDQ \
  --activation_type QUInt8 \
  --weight_type QInt8
```

### Action 5: Build Edge Inference Engine for Pi Zero

```
name:           build_edge_inference_engine
cost:           8 days
priority:       CRITICAL
preconditions:  [lightweight_model, int8_quantization, esp32_pi_protocol]
effects:        [edge_inference_engine := true, pi_zero_deployment := true]
```

**Architecture: Streaming inference with ring buffer**

```
                    UDP/UART
ESP32-S3 ---------> Pi Zero 2 W
                    |
                    v
            +-- RingBuffer<CsiFrame> --+
            |  (capacity: 64 frames)   |
            +------ |  | -------------+
                    v  v
            +-- TemporalWindow --------+
            |  (20 frames, sliding)    |
            +------ | ----------------+
                    v
            +-- WiFlowPose ONNX ------+
            |  (INT8, XNNPACK accel)  |
            +------ | ----------------+
                    v
            +-- PoseTracker -----------+
            |  (Kalman + skeleton)    |
            +------ | ----------------+
                    v
              PoseEstimate output
              (17 keypoints + confidence)
```

**New Rust binary:** `wifi-densepose-cli/src/bin/edge_infer.rs`

```rust
/// Edge inference daemon for Raspberry Pi Zero 2 W
///
/// Receives CSI frames from ESP32 nodes via UDP, maintains a temporal
/// sliding window, runs INT8 ONNX inference, and outputs pose estimates.
///
/// Usage:
///   wifi-densepose edge-infer \
///     --model model_int8.onnx \
///     --listen 0.0.0.0:5555 \
///     --output-port 5556 \
///     --window-size 20 \
///     --max-nodes 4

struct EdgeInferConfig {
    /// Path to INT8 ONNX model
    model_path: PathBuf,
    /// UDP listen address for CSI frames
    listen_addr: SocketAddr,
    /// UDP output address for pose results
    output_addr: Option<SocketAddr>,
    /// Temporal window size
    window_size: usize,
    /// Maximum ESP32 nodes to accept
    max_nodes: usize,
    /// Inference thread count (1-4 on Pi Zero 2 W)
    num_threads: usize,
    /// Enable XNNPACK acceleration
    use_xnnpack: bool,
}
```

**Cross-compilation for Pi Zero 2 W:**

```bash
# Install cross-compilation toolchain
rustup target add aarch64-unknown-linux-gnu
sudo apt install gcc-aarch64-linux-gnu

# Build for Pi Zero 2 W (64-bit Raspberry Pi OS)
cross build --target aarch64-unknown-linux-gnu \
  --release \
  -p wifi-densepose-cli \
  --features edge-inference \
  --no-default-features

# Or for 32-bit Raspberry Pi OS:
# rustup target add armv7-unknown-linux-gnueabihf
# cross build --target armv7-unknown-linux-gnueabihf ...
```

**ONNX Runtime linking for ARM:**
- Use `ort` crate with `download-binaries` feature for automatic aarch64 binary download
- Alternative: build OnnxStream from source for minimal binary size (~2 MB vs ~30 MB for full ONNX Runtime)

### Action 6: Implement CSI Compression on ESP32

```
name:           implement_csi_compression
cost:           5 days
priority:       MEDIUM
preconditions:  [esp32_csi_capture, esp32_pi_protocol]
effects:        [csi_compression := true]
```

**Three compression tiers:**

**Tier 0: No compression (raw CSI)**
- Payload: 52 subcarriers x 2 (I/Q) x 2 bytes = 208 bytes per frame
- Use case: debugging, maximum fidelity

**Tier 1: PCA-16 (run on ESP32)**
- Pre-computed PCA projection matrix (52 -> 16 dimensions)
- Stored in NVS flash during provisioning
- Payload: 16 features x 2 bytes (f16) = 32 bytes per frame
- Compression: 6.5x
- Compute: ~0.1ms on ESP32-S3 (matrix-vector multiply, SIMD)

**Tier 2: PCA-32 (higher fidelity)**
- 52 -> 32 dimensions
- Payload: 32 x 2 = 64 bytes
- Compression: 3.25x

**Tier 3: Learned autoencoder (future)**
- ESP32-S3 has enough compute for a small encoder (~10K params)
- Requires quantized encoder weights in flash
- Most bandwidth-efficient but requires training

**PCA computation (offline, during provisioning):**

```rust
// wifi-densepose-train/src/compression.rs

/// Compute PCA projection matrix from calibration CSI data
pub fn compute_pca_projection(
    calibration_data: &[CsiFrame],
    target_dims: usize,
) -> PcaProjection {
    // 1. Stack all CSI amplitude vectors into matrix [N, S]
    // 2. Center (subtract mean)
    // 3. Compute covariance matrix [S, S]
    // 4. Eigendecomposition, take top `target_dims` eigenvectors
    // 5. Return projection matrix [S, target_dims] and mean vector [S]
    // ...
}

pub struct PcaProjection {
    /// Projection matrix [num_subcarriers, target_dims]
    pub matrix: Vec<f32>,
    /// Mean vector for centering [num_subcarriers]
    pub mean: Vec<f32>,
    /// Number of input subcarriers
    pub input_dims: usize,
    /// Number of output features
    pub output_dims: usize,
}
```

**ESP32 firmware integration:**
- Store PCA matrix in NVS partition (32x52x4 = 6.5 KB for PCA-32)
- Apply projection in CSI callback before UDP transmission
- Selectable via provisioning command

### Action 7: Implement Cross-Environment Adaptation

```
name:           implement_cross_env_adaptation
cost:           8 days
priority:       MEDIUM (Phase 2)
preconditions:  [lightweight_model, training_pipeline, pi_zero_deployment]
effects:        [cross_env_adaptation := true]
```

**Approach: Rapid environment calibration with few-shot adaptation**

Inspired by Arena Physica's template-based design space and MERIDIAN (ADR-027):

1. **Environment fingerprinting (on Pi Zero, at deployment time):**
   - Collect 60 seconds of "empty room" CSI
   - Compute room signature: mean amplitude profile, delay spread, K-factor
   - Match to nearest room template (corridor, office, bedroom, etc.)
   - Load template-specific model weights

2. **Few-shot fine-tuning (optional, on workstation):**
   - Collect 5 minutes of calibration data with known poses
   - Fine-tune last 2 layers of the model (~50K params)
   - Transfer updated model back to Pi Zero

3. **Online adaptation (continuous, on Pi Zero):**
   - Track CSI statistics over time (sliding window mean/variance)
   - Detect distribution shift (KL divergence exceeds threshold)
   - Apply batch normalization statistics update (no gradient computation needed)

**Implementation location:** `wifi-densepose-train/src/rapid_adapt.rs` (extend existing module)

### Action 8: Implement Multi-Person PAF Decoding

```
name:           implement_multi_person_paf
cost:           6 days
priority:       LOW (Phase 2)
preconditions:  [lightweight_model, bone_constraint_loss]
effects:        [multi_person_paf := true]
```

**Architecture (following MultiFormer):**

Add a PAF branch to the WiFlowPose model:

```
Stage 3 features [B, 64, 7]
  |
  +--> Keypoint head: [B, 17, 2] (single-person keypoints)
  |
  +--> PAF head: [B, 38, H, W] (19 limb affinity fields)
  |
  +--> Confidence head: [B, 19, H, W] (part confidence maps)
```

**Multi-person assignment on Pi Zero:**
1. Extract candidate keypoints from confidence maps via NMS
2. Compute PAF integral scores between candidate pairs
3. Solve bipartite matching with Hungarian algorithm
4. Group keypoints into person instances

**Estimated additional cost:** ~1M parameters, ~10ms additional inference time

### Action 9: Implement 3D Pose Lifting

```
name:           implement_3d_pose_lifting
cost:           5 days
priority:       LOW (Phase 3)
preconditions:  [lightweight_model, multi_person_paf, multistatic_fusion]
effects:        [3d_pose_lifting := true]
```

**Approach: Multi-view triangulation + learned depth prior**

With 2+ ESP32 nodes at known positions, compute 3D pose via:

1. Each node pair provides a different viewing angle of the WiFi field
2. 2D pose from each viewpoint is estimated independently
3. Epipolar geometry constrains 3D position from 2D observations
4. Learned depth prior resolves ambiguities (front/back confusion)

This leverages the existing `viewpoint/geometry.rs` module in wifi-densepose-ruvector which already computes GeometricDiversityIndex and Fisher Information for multi-node configurations.

## 3. Hardware Architecture

### 3.1 System Topology

```
                    WiFi AP (existing home router)
                    /         |          \
                   /          |           \
            ESP32-S3 #1   ESP32-S3 #2   ESP32-S3 #3
            (CSI node)    (CSI node)    (CSI node, optional)
                |             |              |
                +------+------+------+-------+
                       | UDP (WiFi)  |
                       v             v
                  Raspberry Pi Zero 2 W
                  (edge inference node)
                       |
                       v
                  Pose output (UDP/MQTT/WebSocket)
                  to display / home automation / API
```

### 3.2 Data Flow Timing

```
T=0ms     ESP32 #1 captures CSI frame (channel 1)
T=2ms     ESP32 #1 applies PCA compression (0.1ms compute)
T=3ms     ESP32 #1 sends UDP packet to Pi Zero (64 bytes)
T=5ms     ESP32 #2 captures CSI frame (channel 6, TDM slot)
T=7ms     ESP32 #2 sends UDP packet to Pi Zero
T=10ms    Pi Zero receives both frames, adds to ring buffer
T=10ms    Pi Zero checks temporal window (20 frames accumulated?)
          If yes: run inference
T=15ms    Temporal encoder processes 20-frame window (5ms)
T=35ms    Spatial encoder + attention (20ms)
T=45ms    Keypoint decoder (10ms)
T=48ms    Kalman filter update + skeleton constraints (3ms)
T=50ms    Pose estimate emitted (17 keypoints + confidence)
```

**Total latency: ~50ms** (well under 150ms target)
**Throughput: 20 Hz** (matching TDMA cycle)

### 3.3 Hardware Bill of Materials

| Component | Unit Cost | Quantity | Total |
|-----------|----------|----------|-------|
| ESP32-S3 DevKit (8MB) | $9 | 2 | $18 |
| Raspberry Pi Zero 2 W | $15 | 1 | $15 |
| MicroSD card (16GB) | $5 | 1 | $5 |
| USB-C power supply | $5 | 1 | $5 |
| **Total** | | | **$43** |

With ESP32-S3 SuperMini ($6 each), total drops to **$37**.

For minimum viable setup (1 ESP32 + 1 Pi Zero): **$24**.

### 3.4 Pi Zero 2 W Specifications

| Parameter | Value |
|-----------|-------|
| SoC | BCM2710A1 (quad-core Cortex-A53 @ 1 GHz) |
| RAM | 512 MB LPDDR2 |
| WiFi | 802.11b/g/n (2.4 GHz only) |
| Bluetooth | BLE 4.2 |
| GPIO | 40-pin header (UART, SPI, I2C) |
| Power | 5V/2A USB micro-B |
| OS | Raspberry Pi OS Lite (64-bit, headless) |

**Memory budget for inference:**

| Component | Memory |
|-----------|--------|
| OS + services | ~100 MB |
| WiFlowPose INT8 model | ~3 MB |
| ONNX Runtime / OnnxStream | ~10-30 MB |
| Ring buffer (64 frames x 4 nodes) | ~1 MB |
| Inference workspace | ~20 MB |
| **Total** | ~134-164 MB |
| **Available** | ~348-378 MB headroom |

Comfortable fit within 512 MB RAM.

## 4. Rust Crate Modifications

### 4.1 Modified Crates

#### wifi-densepose-hardware

**New files:**
- `src/protocol_v2.rs` -- Lightweight ESP32-Pi binary protocol parser/serializer
- `src/pi_zero.rs` -- Pi Zero UDP receiver with ring buffer management

**Modified files:**
- `src/lib.rs` -- Add `pub mod protocol_v2; pub mod pi_zero;`
- `src/aggregator/mod.rs` -- Add support for protocol_v2 frame format

#### wifi-densepose-nn

**New files:**
- `src/wiflow_pose.rs` -- WiFlowPose model definition (TCN + asymmetric conv + axial attention)
- `src/edge_engine.rs` -- Edge-optimized inference engine (streaming, ARM NEON)
- `src/quantize.rs` -- INT8 quantization configuration and validation

**Modified files:**
- `src/lib.rs` -- Add new module exports
- `src/onnx.rs` -- Add XNNPACK execution provider option, INT8 model loading
- `src/translator.rs` -- Add WiFlowPose-compatible input format

#### wifi-densepose-train

**New files:**
- `src/wiflow_pose_trainer.rs` -- Training loop for WiFlowPose architecture
- `src/compression.rs` -- PCA computation for ESP32 CSI compression
- `src/bone_loss.rs` -- Bone constraint and physics consistency losses

**Modified files:**
- `src/losses.rs` -- Add `BoneConstraintLoss`, `PhysicsConsistencyLoss`
- `src/config.rs` -- Add WiFlowPose training configuration options
- `src/dataset.rs` -- Add ESP32-S3 CSI format support (52/114 subcarriers)
- `src/rapid_adapt.rs` -- Add few-shot environment calibration

#### wifi-densepose-signal

**New files:**
- `src/ruvsense/temporal_encoder.rs` -- TCN temporal feature extraction (shared code for ESP32 and Pi)

**Modified files:**
- `src/ruvsense/mod.rs` -- Add `pub mod temporal_encoder;`

#### wifi-densepose-cli

**New files:**
- `src/bin/edge_infer.rs` -- Pi Zero edge inference daemon
- `src/bin/calibrate.rs` -- Environment calibration tool (PCA computation, room fingerprinting)

#### wifi-densepose-core

**Modified files:**
- `src/types.rs` -- Add `CompressedCsiFrame`, `EdgePoseEstimate` types

### 4.2 New Feature Flags

```toml
# wifi-densepose-nn/Cargo.toml
[features]
default = ["onnx"]
onnx = ["ort"]
edge-inference = ["onnx", "xnnpack"]  # NEW: ARM NEON + XNNPACK
candle = ["candle-core", "candle-nn"]
tch-backend = ["tch"]

# wifi-densepose-cli/Cargo.toml
[features]
default = ["full"]
full = ["wifi-densepose-nn/onnx", "wifi-densepose-train/tch-backend"]
edge-inference = ["wifi-densepose-nn/edge-inference"]  # NEW: minimal binary for Pi
```

### 4.3 Cross-Compilation Configuration

```toml
# .cargo/config.toml (add section)
[target.aarch64-unknown-linux-gnu]
linker = "aarch64-linux-gnu-gcc"
rustflags = ["-C", "target-cpu=cortex-a53", "-C", "target-feature=+neon"]
```

## 5. ESP32 Firmware Modifications

### 5.1 New Files

- `firmware/esp32-csi-node/main/protocol_v2.h` -- Protocol v2 frame packing
- `firmware/esp32-csi-node/main/pca_compress.h` -- PCA compression for CSI
- `firmware/esp32-csi-node/main/pca_compress.c` -- PCA implementation with ESP32 SIMD
- `firmware/esp32-csi-node/main/pi_zero_mode.c` -- Pi Zero communication mode (lighter than full server mode)

### 5.2 Modified Files

- `firmware/esp32-csi-node/main/csi_handler.c` -- Add compression step in CSI callback
- `firmware/esp32-csi-node/main/nvs_config.c` -- Store PCA matrix in NVS
- `firmware/esp32-csi-node/main/Kconfig.projbuild` -- Add CONFIG_PI_ZERO_MODE, CONFIG_CSI_COMPRESSION options

### 5.3 Provisioning Updates

```bash
# Provision for Pi Zero mode with PCA-16 compression
python firmware/esp32-csi-node/provision.py \
  --port COM7 \
  --ssid "MyWiFi" \
  --password "secret" \
  --target-ip 192.168.1.50 \  # Pi Zero IP
  --target-port 5555 \
  --compression pca-16 \
  --pca-matrix pca_matrix_16.bin
```

## 6. Training Pipeline

### 6.1 Training Workflow

```
Phase 1: Pre-train on public datasets (GPU workstation)
  Dataset: MM-Fi + Wi-Pose (Intel 5300 format, 30 subcarriers)
  Model: WiFlowPose with 30 subcarriers
  Loss: L_keypoint + 0.2 * L_bone + 0.1 * L_physics
  Duration: ~20 hours on single A100

Phase 2: Domain adaptation for ESP32 CSI (GPU workstation)
  Dataset: Self-collected ESP32-S3 data (52 subcarriers)
  Method: Fine-tune all layers with lower learning rate (1e-4)
  Subcarrier interpolation: 30 -> 52 using existing interpolate_subcarriers()
  Duration: ~4 hours

Phase 3: Quantization (CPU workstation)
  Method: Post-training quantization with 1000 calibration samples
  Format: ONNX INT8 (QDQ format)
  Validation: PCK@20 degradation < 2%

Phase 4: Environment calibration (on Pi Zero)
  Method: 60-second empty-room CSI collection
  Output: Room fingerprint + PCA matrix
  Duration: ~2 minutes total
```

### 6.2 Dataset Collection Protocol

For self-collected ESP32 training data:

1. **Setup:** 2 ESP32-S3 nodes at opposite corners of 4x4m room, Pi Zero receiving
2. **Ground truth:** Smartphone camera running MediaPipe Pose (30 FPS), synchronized via NTP
3. **Activities:** Standing, walking, sitting, waving, falling, idle (2 minutes each)
4. **Subjects:** 5+ volunteers with varying body types
5. **Environments:** 3+ rooms (bedroom, office, corridor) for generalization
6. **Total target:** ~100K synchronized CSI-pose frame pairs

**Synchronization approach:**
- ESP32 and Pi Zero synchronized via NTP (< 10ms accuracy on LAN)
- Camera frames timestamped with system clock
- Offline alignment via cross-correlation of movement signals

### 6.3 Transfer Learning Strategy

Following DensePose-WiFi's proven approach:

```
L_total = lambda_pose * L_pose
        + lambda_bone * L_bone
        + lambda_transfer * L_transfer
        + lambda_physics * L_physics

L_transfer = MSE(features_student, features_teacher)
```

Where `features_teacher` come from a pre-trained image-based pose model (HRNet or ViTPose) and `features_student` come from the WiFi CSI model at corresponding intermediate layers.

**Lambda schedule:**
- Epochs 1-20: lambda_transfer = 0.5 (heavy transfer guidance)
- Epochs 20-50: lambda_transfer = 0.2 (moderate guidance)
- Epochs 50-100: lambda_transfer = 0.05 (fine-tuning freedom)

## 7. Timeline and Milestones

### Phase 1: Foundation (Weeks 1-4)

| Week | Actions | Deliverable |
|------|---------|-------------|
| 1 | Action 1 (protocol), ADR-069 draft | Protocol spec + parser tests |
| 2 | Action 2 (model architecture, begin) | WiFlowPose model definition in Rust |
| 2 | Action 3 (bone loss) | Loss functions implemented and tested |
| 3 | Action 2 (model architecture, complete) | Full model with ONNX export |
| 4 | Action 4 (quantization) | INT8 model, accuracy validated |

**Milestone M1:** WiFlowPose model trained on MM-Fi, exported to INT8 ONNX, PCK@20 > 85% on validation set.

### Phase 2: Edge Deployment (Weeks 5-8)

| Week | Actions | Deliverable |
|------|---------|-------------|
| 5 | Action 5 (edge engine, begin) | Cross-compilation working, model loads on Pi |
| 6 | Action 5 (edge engine, complete) | Streaming inference at >= 10 Hz on Pi Zero |
| 6 | Action 6 (CSI compression) | PCA compression on ESP32, verified bandwidth reduction |
| 7 | Integration testing | ESP32 -> Pi Zero full pipeline working |
| 8 | Performance optimization | Latency < 100ms, memory < 200 MB |

**Milestone M2:** End-to-end demo: ESP32 captures CSI, Pi Zero outputs pose at 10+ Hz.

### Phase 3: Accuracy and Adaptation (Weeks 9-12)

| Week | Actions | Deliverable |
|------|---------|-------------|
| 9 | Data collection (ESP32-S3 training data) | 50K+ synchronized CSI-pose frames |
| 10 | Domain adaptation training | ESP32-specific model, MPJPE < 120mm |
| 11 | Action 7 (cross-env adaptation) | Room calibration working |
| 12 | Validation and documentation | ADR-069 finalized, witness bundle |

**Milestone M3:** Single-person MPJPE < 100mm in calibrated environment, cross-environment deployment working with 60-second calibration.

### Phase 4: Multi-Person and 3D (Weeks 13-20)

| Week | Actions | Deliverable |
|------|---------|-------------|
| 13-14 | Action 8 (multi-person PAF) | 2-person pose separation working |
| 15-16 | Action 9 (3D lifting) | Z-axis estimation from multi-node |
| 17-18 | Advanced optimization | Model distillation, QAT |
| 19-20 | Production hardening | OTA updates, monitoring, alerting |

**Milestone M4:** Multi-person 3D pose at 10 Hz on Pi Zero 2 W.

## 8. Risk Analysis

### 8.1 Technical Risks

| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| Pi Zero 2 W inference too slow (> 100ms) | Medium | High | Fall back to activity recognition (smaller model); use Pi 4 instead |
| ESP32-S3 CSI quality insufficient for pose | Low | Critical | Already validated in ADR-028; add directional antennas if needed |
| INT8 quantization degrades accuracy > 5% | Medium | Medium | Use FP16 instead (2x size, ~1.5x slower); apply QAT |
| Cross-environment generalization poor | High | High | Room calibration (Action 7); template-based models; continuous adaptation |
| WiFi interference degrades CSI | Medium | Medium | Coherence gating (already implemented); channel hopping; 5 GHz fallback |
| ONNX Runtime binary too large for Pi Zero | Low | Medium | Use OnnxStream (2 MB) instead of full ONNX Runtime (30 MB) |
| Multi-person association errors | High | Medium | Limit to 2 persons initially; use PAF + Hungarian; AETHER re-ID |

### 8.2 Hardware Risks

| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| Pi Zero 2 W supply shortage | Medium | Medium | Design also works with Pi 3A+ or Pi 4 |
| ESP32-S3 firmware instability | Low | Medium | Existing firmware battle-tested; OTA rollback |
| WiFi AP interference with CSI | Low | Low | Dedicated 2.4 GHz channel; ESP32 channel hopping |
| Power supply issues (brownout) | Low | Medium | Proper power supply; ESP32 brownout detection |

### 8.3 Research Risks

| Risk | Probability | Impact | Mitigation |
|------|------------|--------|------------|
| WiFlow results don't reproduce | Medium | High | Fall back to CSI-Former or MultiFormer architecture |
| ESP32 CSI fundamentally different from Intel 5300 | Medium | High | Collect ESP32-specific training data; subcarrier interpolation |
| Bone constraint loss doesn't improve edge accuracy | Low | Low | Remove if no benefit; constraint is simple and cheap |
| PCA compression loses critical CSI information | Low | Medium | Validate with ablation study; fall back to raw CSI if needed |

## 9. Dependency Graph (Action Ordering)

```
                    [esp32_csi_capture] (DONE)
                    /                    \
                   v                      v
    [Action 1: Protocol]          [training_pipeline] (DONE)
           |                      /        |        \
           v                     v         v         v
    [Action 6: Compression] [Action 2: Model] [Action 3: Bone Loss]
           |                     |              |
           |                     +------+-------+
           |                            v
           |                   [Action 4: Quantization]
           |                            |
           +---------------+------------+
                           v
                  [Action 5: Edge Engine]
                           |
                           v
                  [Action 7: Cross-Env] (Phase 2)
                           |
                           v
                  [Action 8: Multi-Person] (Phase 2)
                           |
                           v
                  [Action 9: 3D Lifting] (Phase 3)
```

**Critical path:** Action 1 -> Action 2 -> Action 4 -> Action 5
**Parallel path:** Action 3 can proceed concurrently with Action 2
**Parallel path:** Action 6 can proceed concurrently with Actions 2-4

## 10. Success Criteria

### Phase 1 Exit Criteria

- [ ] WiFlowPose model trains to convergence on MM-Fi dataset
- [ ] PCK@20 >= 85% on MM-Fi validation set
- [ ] INT8 ONNX model size < 5 MB
- [ ] Bone constraint loss reduces physically implausible predictions by > 50%

### Phase 2 Exit Criteria

- [ ] edge_infer binary cross-compiles for aarch64 and runs on Pi Zero 2 W
- [ ] End-to-end latency < 150ms (CSI capture to pose output)
- [ ] Inference rate >= 10 Hz sustained
- [ ] PCA compression reduces bandwidth by >= 3x without > 5% accuracy loss
- [ ] Multi-node support (2 ESP32 nodes + 1 Pi Zero) working

### Phase 3 Exit Criteria

- [ ] Single-person MPJPE < 100mm in calibrated environment
- [ ] Cross-environment deployment works with 60-second calibration
- [ ] System runs continuously for 24 hours without crashes
- [ ] ESP32 OTA firmware update working for CSI compression parameters

### Phase 4 Exit Criteria

- [ ] 2-person pose separation working (MPJPE < 150mm per person)
- [ ] 3D pose estimation from 2+ nodes (Z-axis error < 200mm)
- [ ] Production monitoring and alerting operational

## 11. Relationship to Existing ADRs

| ADR | Relationship |
|-----|-------------|
| ADR-018 | Protocol v2 (Action 1) extends ADR-018 binary frame format |
| ADR-024 | AETHER re-ID embeddings used in multi-person tracking (Action 8) |
| ADR-027 | MERIDIAN cross-env generalization informs Action 7 |
| ADR-028 | ESP32 capability audit validates CSI quality assumptions |
| ADR-029 | RuvSense pipeline stages feed into edge inference (Action 5) |
| ADR-068 | Per-node state pipeline directly used by multi-node inference |

## 12. New ADR Required

**ADR-069: Edge Inference on Raspberry Pi Zero 2 W**

This implementation plan should be formalized as ADR-069 covering:
- Protocol v2 specification
- WiFlowPose architecture selection rationale
- Pi Zero deployment constraints and optimizations
- INT8 quantization strategy
- Cross-compilation approach
- Environment calibration protocol

Status: Proposed, pending this plan's approval.