# BFLD Benchmarks and Evaluation Strategy

## 1. Datasets

### 1.1 BFId Dataset (Primary)

**Reference**: Todt, Morsbach, Strufe; KIT. ACM CCS 2025.
https://dl.acm.org/doi/10.1145/3719027.3765062
https://ps.tm.kit.edu/english/bfid-dataset/index.php

197 individuals. BFI and CSI recorded simultaneously. Multiple sessions, multiple AP
angles. Available to researchers for non-commercial use on request from KIT.

**Use in BFLD evaluation**: The BFId dataset provides the ground-truth identity labels
needed to calibrate `identity_risk_score`. Specifically: given BFId's known re-ID
accuracy as a function of time window, BFLD's identity_risk_score should correlate
with BFId's success rate. High-risk frames (score > 0.7) should correspond to windows
where BFId achieves > 80% accuracy; low-risk frames (score < 0.2) should correspond
to windows where BFId accuracy approaches chance.

### 1.2 Wi-Pose and MM-Fi (Context)

**MM-Fi**: Multi-modal WiFi sensing dataset used by this project (ADR-015). Contains
synchronized WiFi CSI, mmWave, and camera pose data. Does not contain BFI separately,
but can be used to validate BFLD's CSI-optional path (AC7).

**Wi-Pose**: Academic benchmark for WiFi pose estimation. CSI only; used for
person_count and motion accuracy baselines.

### 1.3 Proposed In-House Multi-Site Capture Protocol

**Purpose**: Validate cross-site isolation (Invariant 3) and daily rotation.

**Setup**:
- Site A: ruvultra (RTX 5080 workstation, Tailscale 100.104.125.72) with USB WiFi
  adapter in monitor mode.
- Site B: cognitum-v0 (Pi 5, Tailscale 100.77.59.83) with Nexmon monitor mode.
- Subject pool: 5–10 volunteers.
- Protocol: Each subject walks a fixed path at each site on 3 consecutive days.
  BFI captured simultaneously at both sites using Wi-BFI.

**Analysis**:
1. Can the BFId classifier re-identify subjects within a site? (Baseline — should
   confirm BFId's published results.)
2. Can any classifier re-identify subjects across sites using BFLD's
   rf_signature_hash? (Should fail — cross-site isolation test.)
3. Can any classifier re-identify across days using BFLD's rf_signature_hash? (Should
   fail — daily rotation test.)

---

## 2. Metrics

### 2.1 Presence Detection

| Metric | Definition | Target |
|--------|-----------|--------|
| Latency p50 | Time from first non-empty BFI frame to first `presence=true` event | < 500 ms |
| Latency p95 | | < 1000 ms (AC2) |
| False positive rate | Presence=true when room is confirmed empty | < 5% |
| False negative rate | Presence=false when person confirmed present | < 2% |

Measurement method: camera ground-truth (ruvultra webcam via MediaPipe Pose, same
as ADR-079 collection protocol) for empty/occupied labels.

### 2.2 Motion Score

| Metric | Definition | Target |
|--------|-----------|--------|
| MAE vs ground truth | Mean absolute error of motion score vs camera-derived motion magnitude | < 0.1 |
| Hz at sustained operation | Events published per second on `motion/state` | >= 1 Hz (AC3) |
| Latency p95 | Time from motion onset (camera) to motion event | < 750 ms |

### 2.3 Person Count

| Metric | Definition | Target |
|--------|-----------|--------|
| Count accuracy | Fraction of windows where BFLD person_count == camera count | > 85% for 1–3 persons |
| Count MAE | |  < 0.5 for counts 1–4 |

Person count is harder than presence. The target is achievable with MinCut separation
(`ruvector-mincut`) but requires multi-AP coverage for 4+ persons.

### 2.4 Identity Risk Calibration

This is BFLD's novel evaluation dimension — no prior system has explicitly quantified
this.

**Calibration definition**: Let `r(t)` = BFLD's identity_risk_score at time t.
Let `acc(t)` = BFId classifier's re-identification accuracy when trained on frames
around time t. The identity_risk_score is *calibrated* if:

    E[acc(t) | r(t) = v] is monotonically increasing in v

In other words: higher risk scores should correspond to frames where identity inference
is genuinely easier.

**Evaluation protocol**:
1. Run BFId classifier in sliding 5-second windows on the BFId dataset.
2. Record per-window BFId accuracy (using leave-one-out cross-validation).
3. Run BFLD's identity_risk_score computation on the same windows.
4. Compute Spearman correlation between risk scores and BFId accuracy.
5. Target: Spearman rho > 0.5 (positive monotonic correlation).

### 2.5 Privacy-Mode False Positive Rate

When `privacy_mode` is enabled (privacy_class = 3), all identity-correlated fields
should be suppressed. The false positive rate is the fraction of outbound events
that inadvertently include an identity-correlated field despite privacy_mode being
active.

**Target**: 0% (this is a hard correctness requirement, not a statistical target).
Verified by the AC5 fuzz test in `acceptance.rs`.

---

## 3. Red-Team Protocol

### 3.1 Hash Re-identification Attack

**Question**: Can an attacker re-identify a person across rotated hashes?

**Setup**:
- Run BFLD pipeline for person X across 3 days.
- Collect `rf_signature_hash` values for each day: H_1, H_2, H_3.
- Adversary has access to H_1, H_2, H_3 and knows they are from the same site.
- Adversary attempts to confirm H_1, H_2, H_3 are from the same person.

**Success condition**: adversary achieves confirmation rate > chance (1/N for N subjects).

**Expected result**: FAIL (by construction of the hash rotation with site_salt).
Since day_epoch changes daily and site_salt is fixed but unknown to the adversary,
the hash function is a keyed PRF. The adversary has three random-looking 32-byte
values with no structural relationship. Success rate should be indistinguishable from
random guessing.

**Quantitative target**: success rate <= 1/N + 0.05 (within 5% of chance).

### 3.2 Cross-Site Re-identification Attack

**Question**: Can an attacker confirm person X visited both site A and site B?

**Setup**: Same as Section 1.3 in-house protocol. Adversary has BFLD event streams
from both sites.

**Method**: Attempt to match rf_signature_hash values from site A and site B on the
same day. Alternatively, train a classifier on BFI features (using the raw angle
sequences from the captured data) and attempt cross-site re-ID.

**Expected result**: Hash-based matching fails by construction. Classifier-based
re-ID may succeed if the adversary has raw angle data (which BFLD does not publish)
but not using BFLD's published output.

**Success condition**: hash-based cross-site match rate <= 1/N + 0.05.

### 3.3 Timing Side-Channel Attack

**Question**: Can an attacker infer a person's schedule by monitoring
identity_risk_score over time?

**Method**: Record identity_risk_score time series. Correlate with known schedule
(person X leaves at 8am, returns at 6pm). Compute mutual information between
schedule and risk score time series.

**Expected result**: Some correlation exists (risk score rises when person enters),
but the attacker learns "someone is present" — equivalent to the presence sensor —
not identity. This is acceptable: presence information is already published at
class 2.

---

## 4. Comparison Baselines

| Baseline | Description | Presence F1 | Motion MAE | Identity leak |
|----------|-------------|------------|-----------|--------------|
| Raw CSI pipeline | Existing wifi-densepose pipeline (no BFLD) | ~0.95 (est.) | ~0.08 (est.) | Unquantified — no risk gating |
| BFI-only (no BFLD) | Wi-BFI + threshold presence | ~0.82 (from LeakyBeam) | N/A | Angle matrices published |
| BFI+CSI fusion (no BFLD) | Combined pipeline, ungated | ~0.97 (est.) | ~0.06 (est.) | Unquantified |
| **BFLD (BFI+CSI, class 2)** | Full BFLD with anonymous privacy class | target 0.93 | target 0.10 | 0% (class 2 gate) |
| BFLD (BFI-only, class 2) | BFLD without CSI input (AC7) | target 0.85 | target 0.12 | 0% (class 2 gate) |

The BFLD privacy-class guarantee reduces the raw sensing accuracy by a small margin
versus an ungated BFI+CSI pipeline (target F1 0.93 vs estimated 0.97). This is the
explicit trade-off: identity safety for a modest utility cost.

---

## 5. Continuous Evaluation in CI

Three tests run on every PR that touches the BFLD crate:

1. **Deterministic hash test** (AC6): same input → same output across platforms.
2. **Privacy-mode field suppression fuzz** (AC5): 1,000 random inputs → no identity
   fields in class-2 output.
3. **Latency smoke test** (AC2): 100-frame replay → first presence event < 200 ms
   (tighter than the 1s AC target, to keep CI fast).