From 1a3c6b4d11aefbc5cd735f0805edcb9cf5bd3988 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 8 Mar 2026 20:05:55 +0000 Subject: [PATCH] Add attention mechanisms for RF sensing research GOAP Agent 3 output: 1,110-line document covering GAT for RF graphs, self-attention for CSI sequences, cross-attention multi-link fusion, attention-weighted differentiable mincut, spatial node attention, antenna-level subcarrier attention, and efficient attention variants (linear, sparse, LSH, S4/Mamba). 8 ASCII architecture diagrams. Part of RF Topological Sensing research swarm (10 agents). https://claude.ai/code/session_01DGUAowNScGVp88bK2eiuRv --- .../03-attention-mechanisms-rf-sensing.md | 1110 +++++++++++++++++ 1 file changed, 1110 insertions(+) create mode 100644 docs/research/03-attention-mechanisms-rf-sensing.md diff --git a/docs/research/03-attention-mechanisms-rf-sensing.md b/docs/research/03-attention-mechanisms-rf-sensing.md new file mode 100644 index 00000000..95beecff --- /dev/null +++ b/docs/research/03-attention-mechanisms-rf-sensing.md @@ -0,0 +1,1110 @@ +# Attention Mechanisms for RF Topological Sensing + +## A Comprehensive Survey for WiFi-DensePose / RuView + +**Document**: 03-attention-mechanisms-rf-sensing +**Date**: 2026-03-08 +**Status**: Research Reference +**Scope**: Attention architectures for graph-based RF sensing where ESP32 nodes +form a dynamic signal topology and minimum cut partitioning detects human +presence, pose, and activity. + +--- + +## Table of Contents + +1. [Introduction and Problem Setting](#1-introduction-and-problem-setting) +2. [Graph Attention Networks for RF Sensing Graphs](#2-graph-attention-networks-for-rf-sensing-graphs) +3. [Self-Attention for CSI Sequences](#3-self-attention-for-csi-sequences) +4. [Cross-Attention for Multi-Link Fusion](#4-cross-attention-for-multi-link-fusion) +5. [Attention-Weighted Minimum Cut](#5-attention-weighted-minimum-cut) +6. [Spatial Attention for Node Importance](#6-spatial-attention-for-node-importance) +7. [Antenna-Level Attention](#7-antenna-level-attention) +8. [Efficient Attention for Resource-Constrained Deployment](#8-efficient-attention-for-resource-constrained-deployment) +9. [Unified Architecture](#9-unified-architecture) +10. [References and Further Reading](#10-references-and-further-reading) + +--- + +## 1. Introduction and Problem Setting + +### 1.1 RF Topological Sensing Model + +RF topological sensing models a physical space as a dynamic signal graph +G = (V, E, W) where: + +- **Vertices V**: ESP32 nodes placed in the environment (typically 4-8 nodes) +- **Edges E**: Bidirectional TX-RX links between node pairs +- **Weights W**: Signal coherence metrics derived from Channel State Information (CSI) + +A person moving through the space perturbs the RF field, causing coherence +drops along links whose Fresnel zones intersect the person's body. Minimum +cut partitioning of this weighted graph identifies the boundary between +perturbed and unperturbed subgraphs, localizing the person. + +``` + RF Topological Sensing — Conceptual Model + ========================================== + + Physical Space Signal Graph G = (V, E, W) + +-----------------------+ + | | N1 ----0.92---- N2 + | [N1] [N2] | / \ / \ + | \ / | 0.31 0.87 0.45 0.91 + | \ P / | / \ / \ + | \../ | N4 --0.28-- N5 --0.89-- N3 + | [N4]...[P]....[N3] | \ / + | / \ | 0.93 ------ 0.90 + | / \ | + | [N5] [N6] | Low weights (0.28, 0.31, 0.45) indicate + | | links crossing the person P's position. + +-----------------------+ Mincut separates {N4,N5} from {N1,N2,N3,N6}. +``` + +### 1.2 Why Attention Mechanisms + +Traditional RF sensing uses hand-crafted features: amplitude variance, +phase difference, subcarrier correlation. These have three fundamental +limitations: + +1. **Static edge weighting**: Fixed formulas cannot adapt to environment + changes (furniture moved, temperature drift, multipath evolution). +2. **Uniform link treatment**: All TX-RX pairs contribute equally regardless + of geometric information content. +3. **No temporal context**: Each CSI frame is processed independently, + ignoring the sequential structure of human motion. + +Attention mechanisms address all three by learning to weight information +sources — subcarriers, time steps, links, and nodes — according to their +relevance for the downstream task. + +### 1.3 Notation + +| Symbol | Meaning | +|--------|---------| +| N | Number of ESP32 nodes | +| L = N(N-1)/2 | Number of bidirectional links | +| S | Number of OFDM subcarriers (typically 52 or 114) | +| T | Number of time steps in a CSI window | +| H(s,t) in C^S | CSI vector for link l at time t | +| d_k | Attention key/query dimension | +| h | Number of attention heads | + +--- + +## 2. Graph Attention Networks for RF Sensing Graphs + +### 2.1 From Static Weights to Learned Attention + +In a standard graph formulation, the adjacency matrix A has entries a_ij +representing signal coherence between nodes i and j. Graph Attention Networks +(GATs) replace these fixed weights with learned attention coefficients that +adapt based on the node features. + +Given node feature vectors x_i in R^F for each ESP32 node i, GAT computes +attention coefficients: + +``` + e_ij = LeakyReLU(a^T [W x_i || W x_j]) + + alpha_ij = softmax_j(e_ij) = exp(e_ij) / sum_k(exp(e_ik)) +``` + +where: +- W in R^{F' x F} is a learnable weight matrix +- a in R^{2F'} is a learnable attention vector +- || denotes concatenation +- The softmax normalizes over all neighbors j of node i + +The updated node representation becomes: + +``` + x_i' = sigma( sum_j alpha_ij W x_j ) +``` + +### 2.2 Node Features from CSI + +For RF sensing, node features are not given directly. Each ESP32 node +participates in multiple links, and each link produces CSI streams. We +construct node features by aggregating incoming link information: + +``` + x_i = AGG({ f(H_ij(t)) : j in N(i), t in [T] }) +``` + +where f is a feature extractor (e.g., amplitude statistics, phase slope) +and AGG is mean or max pooling over neighbors and time. + +``` + Node Feature Construction + ========================= + + Links to Node N1: Feature Extraction: Node Feature: + + N2->N1: H_21(1..T) ---> f(H_21) = [amp_var, \ + N3->N1: H_31(1..T) ---> f(H_31) = phase_slope, > AGG --> x_1 in R^F + N4->N1: H_41(1..T) ---> f(H_41) = corr, ...] / + N5->N1: H_51(1..T) ---> f(H_51) / +``` + +### 2.3 Multi-Head Attention for RF Graphs + +Single-head attention captures one notion of relevance. Multi-head attention +runs h independent attention computations and concatenates or averages: + +``` + x_i' = ||_{k=1}^{h} sigma( sum_j alpha_ij^(k) W^(k) x_j ) +``` + +For RF sensing, different heads can specialize in different phenomena: + +| Head | Learned Specialization | +|------|----------------------| +| Head 1 | Line-of-sight path quality | +| Head 2 | Multipath richness (scattering) | +| Head 3 | Temporal stability (static vs dynamic) | +| Head 4 | Frequency selectivity (subcarrier variance) | + +### 2.4 Edge-Featured GAT for RF Links + +Standard GAT only uses node features to compute attention. In RF sensing, +edges carry rich information (the CSI itself). Edge-featured GAT +incorporates edge attributes e_ij directly: + +``` + e_ij = LeakyReLU(a^T [W_n x_i || W_n x_j || W_e e_ij]) +``` + +where e_ij in R^E contains link-level features: +- Mean amplitude across subcarriers +- Phase coherence (circular variance) +- Doppler shift estimate +- Signal-to-noise ratio +- Fresnel zone geometry (distance, angle) + +``` + Edge-Featured GAT — RF Sensing + ================================ + + x_i x_j + | | + v v + [W_n x_i] [W_n x_j] + | | + +--- CONCAT ---+--- CONCAT ---+ + | | + [W_e e_ij] | + | | + [ a^T [...] ] | + | | + LeakyReLU | + | | + alpha_ij | + | | + alpha_ij * W x_j ---+---> contribution to x_i' +``` + +### 2.5 GATv2: Dynamic Attention + +The original GAT has a "static attention" limitation: the ranking of +attention coefficients is fixed for a given query node regardless of the +key. GATv2 fixes this by applying the nonlinearity after concatenation +but before the dot product: + +``` + e_ij = a^T LeakyReLU(W [x_i || x_j]) +``` + +This is strictly more expressive and important for RF sensing where the +same node should attend differently depending on which neighbor it is +evaluating — a dynamic property essential for tracking moving targets. + +--- + +## 3. Self-Attention for CSI Sequences + +### 3.1 Temporal Structure of CSI + +CSI measurements arrive as time series at 100-1000 Hz. Human motion creates +characteristic temporal patterns: periodic breathing modulates amplitude +at 0.2-0.5 Hz, walking creates 1-2 Hz Doppler signatures, and gestures +produce transient bursts. Self-attention over CSI sequences identifies +which time steps carry the most information for graph weight updates. + +### 3.2 Transformer Self-Attention on CSI + +Given a CSI sequence H = [h_1, h_2, ..., h_T] where h_t in R^S is the +CSI vector at time t, self-attention computes: + +``` + Q = H W_Q, K = H W_K, V = H W_V + + Attention(Q, K, V) = softmax(Q K^T / sqrt(d_k)) V +``` + +The attention matrix A in R^{T x T} has entry A_st representing how much +time step t attends to time step s. This captures: + +- **Periodic structure**: Breathing cycles create diagonal band patterns +- **Motion onset**: Sudden movements create high attention to transition frames +- **Static periods**: Uniformly low attention during no-activity intervals + +``` + Self-Attention on CSI Time Series + ================================== + + Input: T time steps of S-dimensional CSI vectors + + h_1 h_2 h_3 ... h_T Time steps + | | | | + v v v v + [ Linear Projections Q, K, V ] + | | | | + v v v v + [ Scaled Dot-Product Attention ] + | | | | + v v v v + z_1 z_2 z_3 ... z_T Contextualized representations + + Attention Pattern (breathing example): + + t1 t2 t3 t4 t5 t6 t7 t8 + t1 [ .9 .3 .1 .0 .7 .2 .1 .0 ] <-- attends to t1, t5 + t2 [ .3 .9 .3 .1 .2 .7 .3 .1 ] (same phase of + t3 [ .1 .3 .9 .3 .1 .2 .7 .3 ] breathing cycle) + t4 [ .0 .1 .3 .9 .0 .1 .3 .8 ] + ... + Diagonal bands indicate periodic self-similarity. +``` + +### 3.3 Positional Encoding for CSI + +CSI time series require positional encoding to preserve temporal ordering. +Sinusoidal positional encodings work well, but learnable encodings tuned +to the CSI sampling rate can capture hardware-specific timing patterns: + +``` + PE(t, 2i) = sin(t / 10000^{2i/d}) + PE(t, 2i+1) = cos(t / 10000^{2i/d}) +``` + +For 100 Hz CSI with T=128 window, the positional encoding must resolve +10 ms differences. An alternative is relative positional encoding (RPE) +which encodes the time difference (t - s) rather than absolute position, +making the model invariant to window start time. + +### 3.4 Causal vs. Bidirectional Attention + +For real-time sensing, causal (masked) attention is necessary — time step t +can only attend to steps 1..t: + +``` + Mask_st = { 0 if s <= t + { -inf if s > t + + A = softmax((Q K^T + Mask) / sqrt(d_k)) +``` + +For offline analysis (e.g., training data labeling), bidirectional attention +provides richer context by allowing each step to attend to the full window. + +### 3.5 Temporal Attention Pooling for Edge Weights + +The key application is collapsing the time dimension into a single edge +weight for graph construction. Attention-weighted temporal pooling: + +``` + w_ij = sum_t alpha_t * g(z_t^{ij}) + + where alpha_t = softmax(v^T tanh(W_a z_t^{ij})) +``` + +Here z_t^{ij} is the contextualized CSI representation for link (i,j) +at time t, and g maps to a scalar coherence score. The attention weights +alpha_t learn to focus on the most informative moments — for example, +the peak of a Doppler burst during a gesture. + +--- + +## 4. Cross-Attention for Multi-Link Fusion + +### 4.1 Inter-Link Dependencies + +In a multistatic RF sensing setup, links are not independent. A person +walking between nodes N1 and N3 simultaneously affects links (N1,N3), +(N2,N3), and (N1,N4) to varying degrees. Cross-attention captures these +correlations by allowing each link's representation to attend to all +other links. + +### 4.2 Formulation + +Let Z^{ij} in R^{T x d} be the temporal CSI embedding for link (i,j) +after self-attention. Cross-attention between link (i,j) and all other +links: + +``` + Q = Z^{ij} W_Q (query from target link) + K = [Z^{kl}] W_K (keys from all links, stacked) + V = [Z^{kl}] W_V (values from all links, stacked) + + CrossAttn(ij) = softmax(Q K^T / sqrt(d_k)) V +``` + +### 4.3 Architecture + +``` + Cross-Attention for Multi-Link Fusion + ====================================== + + Link (1,2) Link (1,3) Link (2,3) Link (2,4) ... + | | | | + [Self-Attn] [Self-Attn] [Self-Attn] [Self-Attn] + | | | | + v v v v + Z^12 Z^13 Z^23 Z^24 + | | | | + +------+-------+------+------+------+------+ + | | | + [Cross-Attn] [Cross-Attn] [Cross-Attn] ... + | | | + v v v + C^12 C^13 C^23 + | | | + [Edge Score] [Edge Score] [Edge Score] + | | | + v v v + w_12 w_13 w_23 + + Each link attends to all other links to capture + spatial correlations from shared human targets. +``` + +### 4.4 Geometric Bias in Cross-Attention + +Links that are physically close or share a node should have baseline +higher attention. We introduce a geometric bias G_bias: + +``` + A = softmax((Q K^T + G_bias) / sqrt(d_k)) V +``` + +where G_bias_mn encodes the geometric relationship between link m and +link n: + +``` + G_bias_mn = -beta * d_Fresnel(m, n) + gamma * shared_node(m, n) +``` + +- d_Fresnel: distance between Fresnel zone centers +- shared_node: 1 if links share an endpoint, 0 otherwise +- beta, gamma: learnable parameters + +This is the concept implemented in RuVector's `CrossViewpointAttention` +with `GeometricBias` — the attention mechanism is biased toward +geometrically meaningful link combinations while still allowing the model +to discover non-obvious correlations. + +### 4.5 Hierarchical Cross-Attention + +For N nodes with L = N(N-1)/2 links, full cross-attention is O(L^2). +A hierarchical approach reduces this: + +1. **Node-local fusion**: Each node aggregates its incident links (O(N) links per node) +2. **Node-to-node attention**: Cross-attention between node representations (O(N^2)) +3. **Back-projection**: Node attention weights propagate back to link scores + +``` + Level 1 (Link -> Node): Links incident to Ni --> aggregate --> n_i + Level 2 (Node -> Node): {n_1, ..., n_N} --> Cross-Attn --> {n_1', ..., n_N'} + Level 3 (Node -> Link): n_i', n_j' --> project --> w_ij +``` + +This reduces complexity from O(L^2) = O(N^4) to O(N^2), critical for +dense meshes with 6-8 nodes (15-28 links). + +--- + +## 5. Attention-Weighted Minimum Cut + +### 5.1 Classical Minimum Cut + +Given graph G = (V, E, W), the minimum s-t cut partitions V into S and T +such that s in S, t in T, and the cut weight is minimized: + +``` + mincut(S, T) = sum_{(i,j): i in S, j in T} w_ij +``` + +For RF sensing, we seek the normalized cut (Ncut) which balances partition +sizes: + +``` + Ncut(S, T) = cut(S,T)/assoc(S,V) + cut(S,T)/assoc(T,V) +``` + +where assoc(S,V) = sum of all edge weights incident to S. + +### 5.2 Differentiable Relaxation + +The discrete mincut problem is NP-hard. The spectral relaxation uses the +graph Laplacian L = D - W (D is the degree matrix): + +``` + min_y y^T L y / y^T D y subject to y in {-1, +1}^N + + Relaxed: min_y y^T L y / y^T D y, y in R^N +``` + +The solution is the Fiedler vector — the eigenvector of the smallest +nonzero eigenvalue of the normalized Laplacian. + +### 5.3 Attention as Edge Scoring for MinCut + +The key insight: replace fixed edge weights with attention-computed scores +that are differentiable end-to-end. Given raw CSI features, attention +produces edge weights, which feed into a differentiable mincut layer: + +``` + Attention-Weighted Differentiable MinCut Pipeline + ================================================== + + Raw CSI Frames Differentiable MinCut + per link (i,j) + + H_12 --+ W = {w_ij} + H_13 --+--> [Attention ] --> | + H_23 --+ [ Modules ] [Build Laplacian L = D - W] + H_24 --+ [Sec 2,3,4,7 ] | + H_34 --+ [Soft assignment S = softmax(X)] + ... --+ | + [MinCut loss: Tr(S^T L S) / Tr(S^T D S)] + | + [Backprop through attention weights] +``` + +### 5.4 Soft MinCut Assignment + +Instead of hard cluster assignments, use a soft assignment matrix +S in R^{N x K} where K is the number of clusters: + +``` + S = softmax(MLP(X)) where X = GNN(node_features, W) + + L_cut = -Tr(S^T A S) / Tr(S^T D S) (MinCut loss) + L_orth = || S^T S / ||S^T S||_F - I/sqrt(K) ||_F (Orthogonality) + + L_total = L_cut + lambda * L_orth +``` + +The attention-computed edge weights W flow into A (adjacency), D (degree), +and through the GNN into S. The entire pipeline is differentiable, allowing +the attention mechanism to learn edge weights that produce meaningful cuts. + +### 5.5 Mincut Attention Loss + +The training signal for attention comes from two sources: + +1. **Supervised**: Ground-truth person location determines which links + should have low weights (those crossing the person's body). + +2. **Self-supervised**: The mincut objective itself provides a training + signal — attention weights that produce cleaner cuts (lower Ncut value + with balanced partitions) are reinforced. + +``` + L_attention = L_supervised + alpha * L_mincut + beta * L_regularization + + L_supervised = BCE(w_ij, y_ij) (y_ij = 1 if link unobstructed) + L_mincut = Ncut(S*, T*) (quality of resulting partition) + L_regularization = sum_ij |alpha_ij| * H(alpha_ij) (attention entropy) +``` + +The entropy regularization H(alpha) prevents attention collapse (all weight +on one link) or uniform attention (no discrimination). + +--- + +## 6. Spatial Attention for Node Importance + +### 6.1 Motivation + +Not all ESP32 nodes contribute equally. A node in a corner has fewer +intersecting Fresnel zones than a central node. A node with hardware +degradation may produce noisy CSI. Spatial attention learns to weight +nodes by their information contribution. + +### 6.2 Node Importance Scoring + +For each node i, compute an importance score: + +``` + s_i = sigma(w^T [x_i || g_i || q_i]) +``` + +where: +- x_i: node feature vector (from CSI aggregation) +- g_i: geometric feature (position, angle coverage, Fresnel density) +- q_i: quality feature (SNR, packet loss rate, timing jitter) + +The importance score gates the node's contribution: + +``` + x_i_gated = s_i * x_i +``` + +### 6.3 Squeeze-and-Excitation for Node Graphs + +Adapted from channel attention in CNNs, Squeeze-and-Excitation (SE) +for node graphs: + +``` + 1. Squeeze: z = (1/N) sum_i x_i (global node pooling) + 2. Excite: s = sigma(W_2 ReLU(W_1 z)) (per-node importance) + 3. Scale: x_i' = s_i * x_i (reweight nodes) +``` + +``` + Squeeze-and-Excitation for ESP32 Node Graph + ============================================= + + Node features: x_1 x_2 x_3 x_4 x_5 x_6 + | | | | | | + +--+--+--+--+--+--+--+--+--+--+ + | + [Global Pool z] + | + [FC -> ReLU -> FC -> Sigmoid] + | + s_1 s_2 s_3 s_4 s_5 s_6 + | | | | | | + * * * * * * + | | | | | | + x_1' x_2' x_3' x_4' x_5' x_6' + + Example: Node 3 (occluded corner) gets s_3 = 0.2 + Node 5 (central, clear LoS) gets s_5 = 0.9 +``` + +### 6.4 Fisher Information-Based Attention + +From estimation theory, the Fisher Information quantifies how much a +measurement contributes to parameter estimation. For node i observing +target at position theta: + +``` + FI_i(theta) = E[ (d/d_theta log p(H_i | theta))^2 ] +``` + +Nodes with higher Fisher Information provide more localization accuracy. +This can be computed analytically for simple signal models or approximated +via the Cramer-Rao bound. The Geometric Diversity Index from RuVector's +`geometry.rs` module implements a related concept. + +### 6.5 Dynamic Node Dropout + +Spatial attention naturally enables dynamic node dropout — nodes with +importance below a threshold are excluded from graph construction: + +``` + V_active = { i in V : s_i > tau } + E_active = { (i,j) in E : i in V_active AND j in V_active } +``` + +This provides robustness to node failures and reduces computation when +some nodes are uninformative (e.g., all links from a node are in deep +shadow). + +--- + +## 7. Antenna-Level Attention + +### 7.1 Subcarrier-Level CSI Features + +Each CSI measurement contains S subcarriers (52 for 20 MHz, 114 for 40 MHz +802.11n). Not all subcarriers are equally informative: + +- Subcarriers near null frequencies carry noise +- Subcarriers in frequency-selective fading notches are unreliable +- Subcarriers near the band edges have lower SNR +- Different subcarriers have different sensitivity to motion at different + distances (wavelength-dependent Fresnel zone widths) + +### 7.2 Antenna Attention Mechanism + +RuVector's `apply_antenna_attention` concept applies attention at the +subcarrier level before any graph construction. For a CSI vector +h in C^S: + +``` + h_real = [Re(h) || Im(h)] in R^{2S} + a = softmax(W_2 ReLU(W_1 h_real + b_1) + b_2) in R^S + h_attended = a odot h in C^S +``` + +where odot is element-wise multiplication (the attention weights are +real-valued but applied to complex CSI). + +``` + Antenna-Level Attention (Before Graph Construction) + ==================================================== + + Raw CSI: h = [h_1, h_2, ..., h_S] (S complex subcarriers) + | | | + [Re/Im decompose + concat] + | + [FC -> ReLU -> FC -> Softmax] + | + Attention: a = [a_1, a_2, ..., a_S] (S real weights, sum = 1) + | | | + * * * (element-wise) + | | | + Attended: h' = [a_1*h_1, a_2*h_2, ..., a_S*h_S] + | + [Feature extraction] + | + [Graph edge weight w_ij] + + Subcarrier attention map (example, 52 subcarriers): + + Attention ^ + weight | ** ** + | * * ***** * * + | * * * * * * + | * * * * * * + |*** ****** ********* *** + +-------------------------------------------------> + 10 20 30 40 50 + Subcarrier index + + Peaks at subcarriers most affected by target motion. + Nulls at subcarriers dominated by static multipath. +``` + +### 7.3 Multi-Antenna Attention + +With multiple antennas (MIMO), attention operates across both antenna +and subcarrier dimensions. For an A-antenna, S-subcarrier system, +the CSI tensor H in C^{A x S}: + +``` + Antenna attention: a_ant in R^A (which antennas matter) + Subcarrier attention: a_sub in R^S (which frequencies matter) + + Joint attention: A_joint = a_ant * a_sub^T in R^{A x S} + Attended CSI: H' = A_joint odot H in C^{A x S} +``` + +This factored attention (rank-1) is parameter-efficient. A full attention +matrix A in R^{A*S x A*S} is more expressive but requires A*S times more +computation. + +### 7.4 Temporal-Spectral Attention + +Combining subcarrier attention with temporal attention creates a 2D +attention map over the time-frequency representation of CSI: + +``` + Time-Frequency Attention Map + ============================= + + Subcarrier ^ + (freq) | . . . . . . . . . . . . + 52 | . . . . . . . . . . . . + | . . . . # # . . . . . . + 40 | . . . # # # # . . . . . + | . . . # # # # . . . . . + 30 | . . # # # # # # . . . . + | . . . # # # # . . . . . + 20 | . . . . # # . . . . . . + | . . . . . . . . . . . . + 10 | . . . . . . . . . . . . + | . . . . . . . . . . . . + 1 | . . . . . . . . . . . . + +---+---+---+---+---+---+---+---+---+---> + 20 40 60 80 100 120 140 160 180 + Time step + + '#' = high attention (motion event at t=60-120, f=20-45) + '.' = low attention (static or noise) +``` + +This is essentially a learned spectrogram filter that isolates the +time-frequency regions containing target motion signatures. + +### 7.5 Connection to Sparse Subcarrier Selection + +RuVector's `subcarrier_selection.rs` uses mincut-based selection to reduce +114 subcarriers to 56 for efficiency. Antenna-level attention provides a +soft version of this: instead of hard selection, it continuously weights +subcarriers. The hard selection can be derived from attention weights: + +``` + selected_subcarriers = top_k(a, k=56) +``` + +Or using Gumbel-Softmax for differentiable discrete selection during +training. + +--- + +## 8. Efficient Attention for Resource-Constrained Deployment + +### 8.1 The Quadratic Bottleneck + +Standard self-attention has O(T^2) time and memory complexity. For +CSI sequences with T=512 at 100 Hz (5.12 seconds), the attention matrix +has 262,144 entries per head. On ESP32 with 520 KB SRAM, this is +prohibitive. + +### 8.2 Linear Attention + +Linear attention replaces the softmax with kernel decomposition: + +``` + Standard: Attn(Q,K,V) = softmax(QK^T/sqrt(d)) V O(T^2 d) + + Linear: Attn(Q,K,V) = phi(Q) (phi(K)^T V) O(T d^2) +``` + +where phi is a feature map (e.g., elu(x) + 1, or random Fourier features). +The key insight is associativity: computing (K^T V) first yields a +d x d matrix, then multiplying by Q is O(T d^2), which is linear in T +when d << T. + +For CSI with d_k = 64 and T = 512, this reduces computation by 8x. + +``` + Standard vs Linear Attention + ============================= + + Standard (O(T^2 d)): Linear (O(T d^2)): + + Q [T x d] phi(Q) [T x d'] + \ \ + * K^T [d x T] * (phi(K)^T V) [d' x d] + \ \ + [T x T] (large!) [T x d] (small!) + \ | + * V [T x d] | (done) + \ | + [T x d] [T x d] +``` + +### 8.3 Sparse Attention Patterns + +Instead of full T x T attention, use structured sparsity: + +**Local Window Attention**: Each position attends to a window of w neighbors: + +``` + A_st = { QK^T/sqrt(d) if |s - t| <= w/2 + { -inf otherwise +``` + +Complexity: O(T * w) with w << T. For CSI at 100 Hz, w = 32 covers +320 ms — sufficient for most motion events. + +**Dilated Attention**: Attend to positions at exponentially increasing gaps: + +``` + Attend to: t-1, t-2, t-4, t-8, t-16, t-32, ... +``` + +This provides O(T log T) complexity while maintaining long-range context. + +**Strided Attention**: Combine local and strided patterns (as in Longformer): + +``` + Attention Pattern (T=16, window=3, stride=4): + + 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 + 1 [ x x . x . . . . x . . . x . . . ] + 2 [ x x x . x . . . . x . . . x . . ] + 3 [ . x x x . x . . . . x . . . x . ] + 4 [ x . x x x . x . . . . x . . . x ] + ... + x = attends, . = masked + Local window (3) + every 4th position for global context +``` + +### 8.4 Locality-Sensitive Hashing (LSH) Attention + +LSH attention (from Reformer) groups similar queries and keys into buckets, +computing attention only within buckets: + +``` + 1. Hash Q and K into b buckets using LSH + 2. Sort by bucket assignment + 3. Compute attention within each bucket + + Complexity: O(T * T/b) per bucket, O(T * T/b * b) total + With b = sqrt(T): O(T * sqrt(T)) +``` + +For RF sensing, LSH naturally groups similar CSI patterns — time steps +with similar signal characteristics attend to each other, which is +physically meaningful (similar body poses produce similar CSI). + +### 8.5 Quantized Attention for ESP32 + +For edge deployment on ESP32: + +``` + INT8 Quantized Attention: + + Q_int8 = clamp(round(Q / scale_Q), -128, 127) + K_int8 = clamp(round(K / scale_K), -128, 127) + + Scores_int16 = Q_int8 * K_int8^T (INT8 matmul -> INT16) + A = softmax(dequantize(Scores_int16)) (back to FP32 for softmax) + + Memory: Q,K in INT8 uses 1/4 the SRAM of FP32 + Compute: INT8 matmul is 2-4x faster on ESP32-S3 +``` + +### 8.6 Attention-Free Alternatives + +For the most constrained scenarios, attention-free architectures that +approximate attention behavior: + +**Gated Linear Units (GLU)**: +``` + y = (X W_1 + b_1) odot sigma(X W_2 + b_2) +``` + +**State Space Models (S4/Mamba)**: +``` + x_t = A x_{t-1} + B u_t + y_t = C x_t + D u_t + + With structured A matrix: O(T log T) via FFT +``` + +S4 models are particularly promising for CSI sequences because: +- O(T) inference (vs O(T^2) for attention) +- Natural handling of continuous-time signals +- Long-range dependency capture through structured state matrices +- Efficient on sequential hardware (no parallel attention needed) + +### 8.7 Deployment Decision Matrix + +``` + +--------------------+--------+---------+--------+----------+ + | Method | Memory | Compute | Range | Platform | + +--------------------+--------+---------+--------+----------+ + | Full Attention | O(T^2) | O(T^2d) | Global | Server | + | Linear Attention | O(Td) | O(Td^2) | Global | Edge GPU | + | Window Attention | O(Tw) | O(Twd) | Local | RPi/Jetson| + | Dilated Attention | O(TlgT)| O(TlgTd)| Global | RPi | + | LSH Attention | O(TsqT)| O(TsqTd)| Global | Edge GPU | + | INT8 Quantized | O(T^2) | O(T^2d) | Global | ESP32-S3 | + | GLU (no attention) | O(Td) | O(Td) | Local | ESP32 | + | S4/Mamba | O(d^2) | O(Td) | Global | ESP32 | + +--------------------+--------+---------+--------+----------+ + + T = sequence length, d = model dimension, w = window size +``` + +--- + +## 9. Unified Architecture + +### 9.1 Full Pipeline + +Combining all attention mechanisms into a unified RF sensing pipeline: + +``` + Unified Attention Architecture for RF Topological Sensing + ========================================================== + + LAYER 0: RAW CSI ACQUISITION + +-----------------------------------------------------------+ + | ESP32 Node i <---> ESP32 Node j | + | H_ij in C^{A x S x T} (antennas x subcarriers x time) | + +-----------------------------------------------------------+ + | + v + LAYER 1: ANTENNA-LEVEL ATTENTION (Section 7) + +-----------------------------------------------------------+ + | Per-link subcarrier weighting | + | a_sub = SoftAttn(H_ij) in R^S | + | H_ij' = a_sub odot H_ij | + | Reduces noise, emphasizes motion-sensitive subcarriers | + +-----------------------------------------------------------+ + | + v + LAYER 2: TEMPORAL SELF-ATTENTION (Section 3) + +-----------------------------------------------------------+ + | Per-link temporal context | + | Z_ij = SelfAttn(H_ij'[t=1..T]) | + | Captures breathing, gait, gesture patterns | + | Uses efficient attention (Section 8) for long sequences | + +-----------------------------------------------------------+ + | + v + LAYER 3: CROSS-LINK ATTENTION (Section 4) + +-----------------------------------------------------------+ + | Inter-link dependency modeling | + | C_ij = CrossAttn(Z_ij, {Z_kl : all links}) | + | With geometric bias G_bias from node positions | + | Captures multi-link correlations from shared targets | + +-----------------------------------------------------------+ + | + v + LAYER 4: EDGE WEIGHT COMPUTATION + +-----------------------------------------------------------+ + | w_ij = MLP(TemporalPool(C_ij)) | + | Temporal pooling with attention (Section 3.5) | + | Produces scalar edge weight per link | + +-----------------------------------------------------------+ + | + v + LAYER 5: GRAPH ATTENTION NETWORK (Section 2) + +-----------------------------------------------------------+ + | Multi-head GAT with edge features | + | x_i' = GAT(x_i, {x_j, w_ij, e_ij}) | + | Refines node representations using graph structure | + +-----------------------------------------------------------+ + | + v + LAYER 6: SPATIAL NODE ATTENTION (Section 6) + +-----------------------------------------------------------+ + | Node importance weighting | + | s_i = SE_Block(x_i') | + | Suppresses noisy or uninformative nodes | + +-----------------------------------------------------------+ + | + v + LAYER 7: DIFFERENTIABLE MINCUT (Section 5) + +-----------------------------------------------------------+ + | Soft cluster assignment with attention-weighted edges | + | S = softmax(MLP(x')) | + | L = L_cut + L_orth + L_supervised | + | Partitions graph at human body boundaries | + +-----------------------------------------------------------+ + | + v + OUTPUT: Person detection, localization, pose estimation +``` + +### 9.2 Training Strategy + +**Stage 1: Pretrain antenna attention** (Section 7) on single-link CSI +with signal quality labels. This bootstraps meaningful subcarrier +weighting before full pipeline training. + +**Stage 2: Train temporal + cross-link attention** (Sections 3-4) with +link-level activity labels. The model learns to identify active links. + +**Stage 3: End-to-end fine-tuning** with mincut loss (Section 5) and +person location supervision. All attention mechanisms adapt jointly. + +**Stage 4: Distillation for edge deployment** — train efficient variants +(Section 8) to match the full model's attention patterns using KL +divergence between attention distributions. + +### 9.3 Computational Budget + +For a 6-node mesh (15 links, 52 subcarriers, T=128 time steps): + +``` + Component | FLOPs/frame | Parameters | Memory + -----------------------+---------------+------------+--------- + Antenna attention (x15)| 15 * 5K | 5K | 15 KB + Temporal self-attn | 15 * 1M | 50K | 200 KB + Cross-link attention | 15^2 * 100K | 100K | 500 KB + GAT (2 layers) | 6 * 50K | 30K | 50 KB + Spatial attention | 6 * 1K | 2K | 5 KB + MinCut MLP | 6 * 10K | 10K | 10 KB + -----------------------+---------------+------------+--------- + Total | ~40M | ~200K | ~800 KB +``` + +This fits within a Raspberry Pi 4 (1 GB RAM, 4-core ARM Cortex-A72) for +real-time inference at 10 Hz. For ESP32 deployment, the efficient variants +from Section 8 reduce this by 10-50x. + +### 9.4 Relation to RuView Codebase + +The unified architecture maps directly to existing RuView modules: + +| Architecture Layer | RuView Module | File | +|---|---|---| +| Antenna Attention | ruvector-attn-mincut | `model.rs` (apply_antenna_attention) | +| Temporal Self-Attention | ruvsense | `gesture.rs`, `intention.rs` | +| Cross-Link Attention | ruvector viewpoint | `attention.rs` (CrossViewpointAttention) | +| Geometric Bias | ruvector viewpoint | `geometry.rs` (GeometricDiversityIndex) | +| Edge Weight Computation | ruvsense | `coherence.rs`, `coherence_gate.rs` | +| Graph Attention | ruvector-mincut | `metrics.rs` (DynamicPersonMatcher) | +| Spatial Node Attention | ruvsense | `multistatic.rs` (attention-weighted fusion) | +| Differentiable MinCut | ruvector-mincut | core mincut algorithm | + +--- + +## 10. References and Further Reading + +### Foundational Attention Papers + +1. Vaswani et al., "Attention Is All You Need," NeurIPS 2017. + - Original transformer self-attention mechanism. + +2. Velickovic et al., "Graph Attention Networks," ICLR 2018. + - GAT: attention-based message passing on graphs. + +3. Brody et al., "How Attentive are Graph Attention Networks?" ICLR 2022. + - GATv2: dynamic attention fixing GAT's static limitation. + +### Efficient Attention + +4. Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive + Transformers with Linear Attention," ICML 2020. + - Linear attention via kernel feature maps. + +5. Kitaev et al., "Reformer: The Efficient Transformer," ICLR 2020. + - LSH attention for subquadratic complexity. + +6. Beltagy et al., "Longformer: The Long-Document Transformer," 2020. + - Windowed + global attention patterns. + +7. Gu et al., "Efficiently Modeling Long Sequences with Structured State + Spaces (S4)," ICLR 2022. + - State space models as attention alternatives. + +8. Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State + Spaces," 2023. + - Selective SSM with input-dependent gating. + +### WiFi Sensing + +9. Wang et al., "Wi-Pose: WiFi-based Multi-Person Pose Estimation," 2021. + - WiFi CSI for human pose estimation. + +10. Yang et al., "MM-Fi: Multi-Modal Non-Intrusive 4D Human Dataset," 2024. + - Large-scale WiFi sensing dataset with multi-modal ground truth. + +11. Wang et al., "Person-in-WiFi: Fine-Grained Person Perception Using + WiFi," ICCV 2019. + - Dense body surface estimation from WiFi signals. + +### Graph Partitioning + +12. Bianchi et al., "Spectral Clustering with Graph Neural Networks for + Graph Pooling," ICML 2020. + - Differentiable mincut pooling with GNNs. + +13. Stoer and Wagner, "A Simple Min-Cut Algorithm," JACM 1997. + - Classical efficient mincut algorithm. + +### RF Sensing Theory + +14. Adib and Katabi, "See Through Walls with WiFi!" SIGCOMM 2013. + - Foundational work on WiFi-based sensing. + +15. Wang et al., "Placement Matters: Understanding the Effects of Device + Placement for WiFi Sensing," 2022. + - Fresnel zone analysis for optimal node placement. + +--- + +*End of document. This research reference supports the attention mechanism +design choices in the RuView/WiFi-DensePose RF topological sensing system.*