feat(bench): int8 quantization of WiFlow-STD half pose model — MEASURED trade-off (ADR-175, honest negative) (#1095)

Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the 843,834-param "half" WiFlow-STD pose model (half_best.pth) to int8 two ways and MEASURES the accuracy/size trade-off vs fp32 under ONE locked normalization (ADR-173 torso-diameter PCK, upstream calculate_pck use_torso_norm=True), on the same seed-42 file-level 70/15/15 test split that produced the fp32 sweep numbers. MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK): fp32 96.62% pck@20 99.47% pck@50 0.008981 mpjpe 3.351 MB int8 PTQ static 40.98% pck@20 94.98% pck@50 0.038262 mpjpe 1.046 MB (-55.64pp) int8 QAT (3 ep) 67.48% pck@20 98.69% pck@50 0.026548 mpjpe 1.043 MB (-29.15pp) Verdict (honest no): int8 is NOT a win at the strict PCK@20 edge target. Static PTQ collapses; QAT recovers a large share but still loses 29 pp @20 for a 3.2x size win — keep fp32/fp16 on the edge. Disclosed: QAT fake-quant val pck@20 was 83.45% but converted int8 scores 67.48% (~16pp convert_fx gap, reported honestly). Deliverables: - v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py (reproducible: header carries the exact ssh command + run date; QAT primary, static PTQ fallback) - docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md (MEASURED table, locked normalization, QAT-vs-PTQ labeling, verdict, reproduction, limitations) - CHANGELOG [Unreleased] ### Added entry No production Rust or signal-pipeline change. Python deterministic proof unchanged (f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a, bit-exact).
2026-06-15 09:16:22 -04:00 · 2026-06-15 09:16:22 -04:00 · 0f64d23516
parent b209b8b778
commit 0f64d23516
3 changed files with 467 additions and 0 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **`homecore-recorder` security review (ADR-132 surfaces) — two real bounding fixes; SQL-injection & NaN-index dimensions confirmed clean with evidence.** Beyond-SOTA review of the HA-compat state recorder (DB persistence + history + ruvector semantic search), the crux being its DB-backed SQL-injection surface. **Findings + fixes:** (1) **Memory-DoS — unbounded `get_state_history`.** The history query carried no `LIMIT`, so a wide `[since, until]` window over a high-frequency entity (a per-second sensor ≈ 86k rows/day) would load an unbounded row set into a single in-memory `Vec`. Added a hard `LIMIT MAX_HISTORY_ROWS` (1,000,000 — generous enough never to truncate a realistic history graph, bounded enough to cap the worst case); the sibling search paths were already `k`-bounded. (2) **Disk-DoS / documented-but-missing `purge`.** The README + HA-compat table advertised `Recorder::purge(older_than)` as a capability, but **no such method existed** — i.e. no retention path at all → unbounded disk growth. Implemented a **transactional** `purge` that deletes `states` + `events` strictly **older than** the cutoff (**exclusive** boundary — idempotent, no off-by-one; a row at the cutoff instant is kept) and **garbage-collects** orphaned `state_attributes` blobs (a dedup-shared blob is dropped only once its last referencing state is gone); all three deletes run in one transaction so a mid-purge failure rolls back cleanly (no states-deleted-but-events-kept corruption). **Confirmed clean with evidence:** SQL injection — **every** query in `db.rs` uses bound `?` parameters (no `format!`/string-concat of user data into SQL); the lone `format!` builds the LIKE *pattern*, which is itself bound as a parameter with `ESCAPE '\\'` and metacharacter escaping. Pinned: a state value `'; DROP TABLE states; --` is stored/queried **literally** (table survives), and a `%`/`_` in a search query matches **literally**, not as a wildcard. NaN-index poisoning (the calibration/vitals/geo class) — **structurally impossible** here: embeddings are SHA-256 → `i32` → `f32` (an `i32` cast to `f32` is always finite, never NaN/Inf), with an all-zero-digest norm guard; probed empty-index search, empty-string query, and `k=0` — all return `Ok(0)`, **no panic**. Fail-closed write path — a removal event yields `Ok(None)`, semantic-index failure is logged not propagated (best-effort, never blocks the durable SQLite write), and `EntityId` parsing failures fall back rather than panic. **6 new pinning tests** (SQL-injection literal-storage, LIKE-metacharacter literalness, history `LIMIT`, purge exclusive-boundary, purge attribute-GC-keeps-shared, purge old-events): `homecore-recorder` **19 → 25** (`--no-default-features`) / **25 → 31** (`--features ruvector`), 0 failed; the purge-boundary test is a true pin (fails deleting 2 rows under an inclusive cutoff, passes deleting 1 under the exclusive cutoff). Behaviour otherwise unchanged; Python deterministic proof unchanged (recorder is off the signal proof path).

 ### Added
+- **ADR-175: int8 quantization of the WiFlow-STD "half" pose model — MEASURED fp32-vs-int8 accuracy/size trade-off (honest negative).** Sub-deliverable 8.2 of the benchmark/optimization milestone, and the reading of the SOTA brief's "one untested edge lever" (QAT-int8 on the 843,834-param half model that strictly dominates the published 2.23M model). A new committed script `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py` quantizes `half_best.pth` to int8 two ways and scores both with the **same** upstream `calculate_pck`/`calculate_mpjpe` that produced the fp32 sweep numbers, under **one locked normalization** (ADR-173 torso-diameter PCK — neck idx2→pelvis idx12, `use_torso_norm=True`, the standard MM-Fi/GraphPose-Fi convention), on the **same** seed-42 file-level 70/15/15 test split (52,560 NaN-free / 54,000 full windows). **MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK):** fp32 = 96.62% PCK@20 / 99.47% PCK@50 / 0.008981 MPJPE / 3.351 MB (fp32-CPU reproduces fp32-GPU to 4 dp, so the int8 deltas are pure quantization, not CPU/GPU drift); **int8 static PTQ = 40.98% PCK@20 (−55.64 pp), 1.046 MB** — naive static QDQ **collapses** on this model (the brief's 2.23M "sweet spot" does NOT transfer to the 843k half model at the tight @20 threshold); **int8 QAT (3-epoch FX fake-quant fine-tune from half_best) = 67.48% PCK@20 (−29.15 pp) / 98.69% PCK@50 (−0.78 pp), 1.043 MB.** **Verdict (honest no):** int8 is **not a win** at the strict PCK@20 edge target — QAT recovers a large share of the PTQ collapse and is near-lossless at the loose PCK@50 (coarse localization survives int8, fine does not), but a **3.2× size win at −29 pp PCK@20** is a bad trade when the half model already fits edge flash at fp32 → **keep fp32/fp16 on the edge for now.** **Disclosed gap:** the QAT *fake-quant* val PCK@20 reached 83.45% but the *converted* int8 model scores 67.48% — a real ~16 pp `convert_fx` gap (fbgemm int8 kernels ≠ straight-through estimate, esp. the axial-attention einsum/softmax); we report the converted-int8 number, not the fake-quant proxy. **MEASURED:** every table number + the PTQ collapse + the QAT partial recovery + the conversion gap. **CLAIMED/not done:** ONNX/TFLite export, on-edge-SoC latency/energy (int8 measured on x86 fbgemm — size transfers, latency does NOT), mixed-precision keeping attention fp32, longer/better-tuned QAT. **Honest limitations:** single in-domain eval split (no cross-environment split), x86-int8 not edge-SoC-int8, lightly-tuned QAT. Additive only — no production Rust or signal-pipeline change; Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
 - **Metric-locked PCK/MPJPE accuracy harness — resolves the PCK-definition ambiguity (`wifi-densepose-train`, needs ADR slot 173).** The SOTA brief (`docs/research/sota-nn-train-benchmark-brief.md` §1, §3.1, §4) found the single biggest threat to any "beyond-SOTA" claim is **metric ambiguity**: three PCK@20 figures (96.09% WiFlow-STD image-normalized, 81.63% AetherArena torso-PCK, 61.1% GraphPose-Fi standard PCK) cannot be lined up because each silently uses a different normalization — the project was retracted twice over this (a withdrawn "92.9%" used *absolute* pixels, not torso). New `src/accuracy.rs` makes the normalizer **explicit, selectable, and carried with every reported number**: a `PckNormalization` enum (`TorsoDiameter` = standard MM-Fi/GraphPose-Fi hip↔hip; `BoundingBoxDiagonal` = looser WiFlow-STD image-normalized; `AbsolutePixels(threshold)` = the retracted convention, included so historical numbers are reproducible and clearly labeled non-comparable); one canonical `pck_at(pred, gt, vis, k, normalization)` reusing the `metrics_core` geometric primitives (hip distance, bbox diagonal — no duplicate kernel); `mpjpe(pred, gt, vis)` (2D/3D, mm); and a self-describing `PoseAccuracy { pck_at: BTreeMap<u8,f32>, mpjpe, normalization, n_keypoints, n_frames }` returned by `accuracy_report(frames, ks, normalization)` so an **unlabeled PCK number is structurally impossible**. **17 hand-computed deterministic tests** (no GPU, no datasets) prove the harness arithmetic: perfect→PCK=1.0/MPJPE=0; all-just-outside→0.0; half-in-half-out→0.5; the **key proof** that identical predictions score 0.50 (torso) / 1.00 (bbox) / 0.75 (abs) under the three normalizations (the ambiguity is real and the definitions are distinct); MPJPE 2D/3D fixtures; and graceful degenerate handling (zero torso, empty frames, NaN coords — no panic, never a false-perfect). **This is measurement infrastructure, not an accuracy claim** — the tests prove the harness is correct, not that any model is good. `wifi-densepose-train` lib 191→206, `test_metrics` 12→14, 0 failed. Python deterministic proof unchanged (off the signal proof path).
 - **CI bench-regression guard (`.github/workflows/bench-regression.yml`) — wires the v2/ criterion benches into CI as a real, hard-failing COMPILE-VERIFY gate + an informational fast-run; caught + fixed one already-bit-rotted bench (benchmark/optimization milestone sub-deliverable 8.3; needs ADR slot 174).** The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,aether_prefilter,cir}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv,onnx}_bench`, `wifi-densepose-engine/engine_cycle`, …) but, because benches are **not** part of `cargo test`, nothing in CI compiled them — so they silently rot when a public API they call changes. **Proof this matters (MEASURED):** running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile (`E0063: missing field last_rssi in initializer of SensorPosition` — the struct gained a field, the bench was never updated); fixed in this change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` → `Finished`, Executable produced). **HONEST SCOPE — what gates vs what is informational:** (1) `bench-compile` (HARD GATE) runs `cargo bench --workspace --no-default-features --no-run` (compile + link every default-feature bench, no measurement) plus a `--features cir` compile of the gated `cir_bench` — a deterministic, real regression guard against bench bit-rot; (2) `bench-fast-run` (INFORMATIONAL, `continue-on-error: true`, NEVER gates) runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1s warm-up / 2s measure / 10 samples), targeted per-`--bench` (the crates' libtest lib targets reject criterion flags), and uploads the logs as an artifact. **No timing-regression gate, by design and stated in the workflow header:** wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or a cross-runner `criterion --baseline` compare would manufacture false failures; that becomes honest only on a frequency-pinned self-hosted runner (documented as the re-add condition). The `crv`-gated `ruvector/crv_bench` is deliberately NOT compiled by the gate because its crates.io dep `ruvector-crv 0.1.1` currently fails to build on stable (upstream E0308 in its own `stage_iii.rs`) — noted in-workflow with the re-add condition. Checkout is `submodules: recursive` (the workspace path-deps `vendor/rufield`) and installs the Tauri/GTK dev libs like `ci.yml`'s rust-tests job (a `--workspace` bench link pulls the whole graph). **MEASURED locally (Windows, `--no-default-features`):** `nvsim`, `wifi-densepose-ruvector` (sketch/fusion/ann), `wifi-densepose-signal/cir_bench`, `wifi-densepose-mat/detection_bench` (post-fix), `wifi-densepose-vitals/vitals_bench`, and `ruview-swarm/swarm_bench` all compile + the fast subset runs (sample baseline: `nvsim pipeline_run/d1/256` ≈ 55 µs, `d16/1024` ≈ 315 µs; `ruvector sketch_hamming` ≈ 3-7 ns vs `float_l2` ≈ 63-371 ns). The full `--workspace` `--no-run` could **not** be fully validated on Windows (Tauri-`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure) — those are Windows-env artifacts that build in the Linux CI runner (each affected bench was confirmed to compile standalone here). No baseline JSON is committed (a cross-runner baseline would be dishonest). Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
 - **RuField `rufield-viewer` live-ingest mode — closes the RuView↔RuField visual loop (ADR-262 surfaces).** The dashboard gains `--source live --upstream <RuView-URL>`: it consumes RuView's `/ws/field` SSE (falling back to polling `/api/field`), **verifies every event's ed25519 provenance receipt on ingest** (`is_fusable`) — forged/tampered events are flagged ✗ and **never fused** into trusted inferences — and renders real RuView `FieldEvent`s through the same room-state/privacy-badge/fusion-graph/receipt path the synthetic mode uses (wire-compatible by construction: both sides use `rufield_core::FieldEvent` serde). **Strict banner honesty:** a single `BannerState` shows `SYNTHETIC` / `LIVE — <upstream>` / `DISCONNECTED — <upstream> unreachable`, mutually exclusive — never SYNTHETIC while showing live data or vice versa; live mode returns **409** on `/api/run` rather than fabricate a synthetic run, and starts DISCONNECTED until first verified contact. Default stays synthetic. 26 tests / 0 failed. `ruvnet/rufield` `crates/rufield-viewer`; `vendor/rufield` submodule bumped.
--- a/docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md
+++ b/docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md
@ -0,0 +1,172 @@
+# ADR-175: int8 Quantization of the WiFlow-STD "half" Pose Model — MEASURED accuracy/size trade-off
+
+| Field | Value |
+|-------|-------|
+| **Status** | Accepted — MEASURED, reproducible (honest negative) |
+| **Date** | 2026-06-15 |
+| **Deciders** | ruv |
+| **Codename** | **EDGE-INT8** |
+| **Sub-deliverable** | 8.2 of the benchmark/optimization milestone |
+| **Metric lock** | ADR-173 (one declared PCK normalization for every reported number) |
+| **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (§edge int8) |
+
+## Context
+
+The SOTA brief characterized the int8 edge story for the WiFlow-STD pose net as
+"fully characterized" for PTQ on the **published 2.23M** model (static QDQ
+conv-only = the sweet spot; dynamic int8 ≈ no-op on this all-conv net), and named
+**QAT-int8 on the strictly-dominating 843,834-param "half" model** as "the one
+untested edge lever." This ADR is the reading of that lever — a MEASURED
+fp32-vs-int8 trade-off for the half model, not a claim.
+
+The half model (`half_best.pth`, 843,834 params) is the efficiency-sweep winner
+from ADR-152 (`run_sweep.py` VARIANTS[0]: `tcn=[270,220,170,120]`,
+`conv=[4,8,16,32]`, `attn_groups=4`). Its fp32 accuracy was recorded in the sweep;
+this ADR re-measures it under the locked normalization and quantizes it.
+
+**The whole point of this deliverable is reproducibility.** Every number below was
+produced by running `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
+on host `ruvultra` (RTX 5080, torch 2.11.0+cu128) against the real checkpoint and
+the real seed-42 test split. The script + the exact command + the recorded stdout
+**is** the proof artifact. Nothing here is estimated.
+
+## Decision
+
+Quantize the half model to int8 with **both** levers and report both honestly:
+
+1. **QAT (primary target)** — FX graph-mode quantization-aware training, fbgemm
+   backend, 3 epochs of fake-quant fine-tuning from `half_best.pth` (AdamW lr 2e-5,
+   the existing `PoseLoss`), then `convert_fx` to a true int8 graph.
+2. **PTQ static QDQ (the brief's "sweet spot", measured as the honest fallback)** —
+   FX graph-mode static PTQ, fbgemm, calibrated on 64 train batches.
+
+### Locked normalization (ADR-173)
+
+**Torso-diameter PCK** — neck (keypoint idx 2) → pelvis (idx 12) distance — the
+standard MM-Fi/GraphPose-Fi convention. This is exactly the default
+`use_torso_norm=True` path of the upstream harness's `utils/metrics.calculate_pck`.
+The **same** `calculate_pck`/`calculate_mpjpe` that produced the sweep's fp32
+numbers scores **both** fp32 and int8 here, so the comparison is metric-locked: no
+normalization is mixed, and the fp32 baseline reproduces the sweep's recorded
+`half` test numbers bit-for-bit (PCK@20 clean = 96.62%), confirming the harness is
+the same one.
+
+### Device note (why int8 is CPU)
+
+PyTorch int8 quantized kernels execute on CPU (fbgemm/x86), not CUDA. So int8 eval
+is CPU. To keep the accuracy delta device-matched (not confounding int8-vs-fp32
+with CPU-vs-GPU), the script measures an **fp32-CPU** baseline too. fp32-CPU and
+fp32-GPU agree to 4 decimals (PCK@20 clean 0.96623 vs 0.96623), so CPU/GPU
+introduces no drift — the int8 deltas below are pure quantization effect.
+
+## MEASURED results (clean test subset = 52,560 NaN-free windows; torso-PCK)
+
+Source: stdout of the run below + `~/wiflow-std-bench/sweep/int8/int8_results.json`.
+
+| model | quant | size (MB) | PCK@20 | PCK@50 | MPJPE | Δ PCK@20 | Δ PCK@50 | size win |
+|-------|-------|-----------|--------|--------|-------|----------|----------|----------|
+| **fp32** (cpu) | — | **3.351** | **96.62%** | **99.47%** | **0.008981** | — | — | 1.00× |
+| int8 PTQ static | PTQ | 1.046 | 40.98% | 94.98% | 0.038262 | **−55.64 pp** | −4.49 pp | 3.20× smaller |
+| int8 QAT (3 ep) | **QAT** | 1.043 | 67.48% | 98.69% | 0.026548 | **−29.15 pp** | −0.78 pp | 3.21× smaller |
+
+Full-test-set (54,000 windows incl. NaN-zero-filled files 487–499) tracks the
+clean subset: fp32 96.10% / int8-PTQ 41.11% / int8-QAT 67.48% PCK@20 — same shape,
+recorded in the JSON.
+
+### Verdict
+
+**int8 is NOT a win for this model at the tight PCK@20 edge target — honest no.**
+
+- **PTQ static collapses** (−55.64 pp PCK@20). Naive static QDQ destroys the half
+  model. The "sweet spot" characterization from the brief does not transfer from
+  the 2.23M model to this 843k model at the strict torso-PCK@20 threshold.
+- **QAT recovers a large share of the relative gap** (PTQ 40.98% → QAT 67.48%) but
+  still **loses 29.15 pp** at PCK@20 for a 3.21× size reduction. At the loose
+  PCK@50 threshold QAT is nearly lossless (−0.78 pp), i.e. coarse-localization
+  survives int8 but fine-localization does not.
+- The size win is real and consistent (3.2× smaller, 3.351 MB → ~1.04 MB), but
+  **3.2× compression at −29 pp PCK@20 is a bad trade** when the half model already
+  fits comfortably in edge flash at fp32. Recommendation: **keep fp32 (or fp16)
+  for the half model on the edge**; do not ship this int8 variant as-is.
+
+### Observed fake-quant → int8 conversion gap (disclosed, not hidden)
+
+During QAT the **fake-quant** model's val PCK@20 reached 83.45% (epoch 3), but the
+**converted int8** model scores 67.48% on test. A ~16 pp drop on `convert_fx` is a
+real effect — the fbgemm int8 kernels are not bit-identical to the fake-quant
+simulation (per-tensor activation quant + the axial-attention `einsum`/softmax path
+quantize worse than the straight-through estimate predicts). This gap is the honest
+reason QAT did not close the loss, and it is exactly the kind of number that would
+be invisible if one only reported the fake-quant proxy. We report the **converted
+int8** number as the deliverable, not the fake-quant proxy.
+
+## Reproduction
+
+```bash
+ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
+  python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
+```
+
+- Script (committed): `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
+  (scp'd to `~/quantize_half_int8.py` on ruvultra for the run).
+- Inputs (on ruvultra, unmodified): `~/wiflow-std-bench/sweep/half_best.pth`,
+  `~/wiflow-std-bench/preprocessed_csi_data/` (seed-42 file-level 70/15/15 split),
+  upstream `models`/`dataset`/`utils/metrics`/`losses` (DY2434/WiFlow @ 06899d29,
+  Apache-2.0), and `sweep/model_compact.py` (the half-model definition).
+- Outputs (written, non-destructive): `~/wiflow-std-bench/sweep/int8/` —
+  `half_int8_qat.pth`, `half_int8_ptq_static.pth`, `int8_results.json`,
+  `int8_run.log`. **No existing file under `~/wiflow-std-bench` was modified.**
+- Run metadata: host `ruvultra`, GPU RTX 5080, torch `2.11.0+cu128`, fbgemm engine,
+  `date_utc 2026-06-15T12:35:06Z`, QAT ≈ 97 s/epoch.
+
+## What is MEASURED vs CLAIMED
+
+- **MEASURED:** every PCK/MPJPE/size number in the table; the fp32 baseline (which
+  reproduces the recorded sweep `half` numbers); the PTQ collapse; the QAT partial
+  recovery; the fake-quant→int8 conversion gap; the 3.2× size reduction.
+- **CLAIMED / not done here:** ONNX/TFLite export; on-real-edge (ESP32/Pi/Hailo)
+  latency or energy (int8 here is measured on x86 fbgemm, the dev box, **not** an
+  edge SoC — the size number transfers, a latency number does **not**); a
+  per-layer mixed-precision search that might keep the attention block in fp32; QAT
+  beyond 3 epochs or with learned-quant-range schedules. Those are the obvious next
+  levers if int8 is revisited; none is asserted as a result.
+
+## Honest scope / limitations
+
+- **Single eval split** — one seed-42 file-level test partition; no cross-room /
+  cross-environment generalization split (the GraphPose-Fi frontier from ADR-173 is
+  a separate, harder split and is not what is measured here).
+- **In-domain only** — these are in-distribution test numbers; they say nothing
+  about the cross-environment robustness gap.
+- **x86 int8, not edge-SoC int8** — accuracy and size transfer to an edge int8
+  runtime; the runtime/latency does not (different kernels, different SoC). No
+  latency claim is made.
+- **QAT lightly tuned** — 3 epochs, single LR, default fbgemm qconfig. A longer /
+  better-tuned QAT might narrow the −29 pp, but on the evidence here int8 does not
+  reach fp32 at PCK@20, and that is the reportable result today.
+
+## Consequences
+
+### Positive
+- The "one untested edge lever" (QAT-int8 on the half model) is now MEASURED. The
+  edge int8 question for the half model is answered with reproducible numbers: at
+  the strict PCK@20 target it loses, and we can say so with a committed script.
+- Establishes a reusable, metric-locked quantization+eval harness
+  (`quantize_half_int8.py`) for any future int8 attempt on these compact variants.
+
+### Negative
+- None to the codebase (additive script + ADR + CHANGELOG only; no production Rust
+  or signal-pipeline change; Python deterministic proof hash
+  `f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a` unchanged).
+
+### Neutral
+- The negative verdict means the half model stays fp32/fp16 on the edge for now.
+  int8 for these compact pose nets is parked pending the next-lever work above.
+
+## Links
+- ADR-173 — metric-locked PCK/MPJPE harness (the locked normalization used here)
+- ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark / efficiency sweep
+  (produced `half_best.pth`)
+- `docs/research/sota-nn-train-benchmark-brief.md` — §edge int8 (the "one untested
+  lever" this ADR measures)
+- Script: `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
--- a/v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py
+++ b/v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py
@ -0,0 +1,294 @@
+#!/usr/bin/env python3
+"""ADR-175: int8 quantization of the WiFlow-STD "half" pose model + MEASURED accuracy/size trade-off.
+
+Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the 843,834-param
+"half" WiFlow-STD pose model to int8 (QAT primary, static-PTQ fallback) and MEASURES the
+accuracy delta against the fp32 baseline under ONE locked PCK normalization.
+
+LOCKED NORMALIZATION (ADR-173): torso-diameter PCK — neck(idx 2)->pelvis(idx 12) distance,
+exactly the default `use_torso_norm=True` path of upstream `utils/metrics.calculate_pck`,
+which is the standard MM-Fi/GraphPose-Fi convention. The SAME `calculate_pck` /
+`calculate_mpjpe` from the upstream harness scores BOTH fp32 and int8 so the comparison is
+metric-locked. The test split is the seed-42 file-level 70/15/15 test partition (54,000
+windows full / 52,560 NaN-free) produced by the SAME loader that produced half_best.pth.
+
+int8 backend: FX graph-mode quantization, fbgemm engine (server x86 int8). Quantized int8
+kernels execute on CPU, so int8 eval is CPU; an fp32-CPU baseline is also measured so the
+accuracy delta is device-matched (CPU fp32 vs CPU int8), and an fp32-GPU number is reported
+for continuity with the sweep's recorded numbers.
+
+REPRODUCE (exact command run for ADR-175, run date 2026-06-15, on host ruvultra / RTX 5080):
+  ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
+    python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
+
+  (the script lives in-repo at v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py;
+   it was scp'd to ~/quantize_half_int8.py on ruvultra and invoked as above. It is read-only
+   to everything under ~/wiflow-std-bench except that it WRITES its int8 artifacts + a JSON
+   results file into ~/wiflow-std-bench/sweep/int8/ — it never modifies half_best.pth or any
+   upstream file.)
+
+Everything this script prints to stdout is MEASURED. Nothing is estimated.
+"""
+import argparse
+import copy
+import json
+import os
+import random
+import sys
+import time
+
+import numpy as np
+import torch
+import torch.nn as nn
+from torch.utils.data import DataLoader, Subset
+
+BENCH = os.path.expanduser('~/wiflow-std-bench')
+SWEEP = os.path.join(BENCH, 'sweep')
+OUTDIR = os.path.join(SWEEP, 'int8')
+sys.path.insert(0, os.path.join(BENCH, 'upstream'))
+sys.path.insert(0, SWEEP)
+
+from dataset import (PreprocessedCSIKeypointsDataset,  # noqa: E402
+                     create_preprocessed_train_val_test_loaders)
+from losses.pose_loss import PoseLoss                  # noqa: E402
+from utils.metrics import calculate_pck, calculate_mpjpe  # noqa: E402  LOCKED metric (torso norm)
+from model_compact import CompactWiFlowPoseModel, describe  # noqa: E402
+
+# half variant config — IDENTICAL to sweep/run_sweep.py VARIANTS[0] that produced half_best.pth
+HALF = dict(tcn=[270, 220, 170, 120], conv=[4, 8, 16, 32], attn_groups=4,
+            groups_mode='gcd20', input_pw_groups=1)
+HALF_CKPT = os.path.join(SWEEP, 'half_best.pth')
+CORRUPT_FILE_START = 487   # files 487-499 were zero-filled by clean_nan.py (same as sweep)
+SEED = 42
+THRESHOLDS = (0.1, 0.2, 0.3, 0.4, 0.5)   # PCK@10..50
+
+
+def set_seed(seed=SEED):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+    torch.backends.cudnn.deterministic = True
+    torch.backends.cudnn.benchmark = False
+
+
+def build_half(dropout=0.5):
+    return CompactWiFlowPoseModel(
+        tcn_channels=HALF['tcn'], conv_channels=HALF['conv'],
+        attn_groups=HALF['attn_groups'], groups_mode=HALF['groups_mode'],
+        input_pw_groups=HALF['input_pw_groups'], dropout=dropout)
+
+
+@torch.no_grad()
+def evaluate(model, loader, device):
+    """MEASURED PCK@10..50 + MPJPE under the LOCKED torso-diameter normalization."""
+    model.eval()
+    totals = {t: 0.0 for t in THRESHOLDS}
+    total_mpe, n = 0.0, 0
+    for bx, by in loader:
+        bx, by = bx.to(device), by.to(device)
+        out = model(bx)
+        bs = by.size(0)
+        total_mpe += calculate_mpjpe(out, by) * bs
+        pck = calculate_pck(out, by, thresholds=list(totals))  # use_torso_norm=True default
+        for t in totals:
+            totals[t] += pck[t] * bs
+        n += bs
+    return {'samples': n, 'mpjpe': total_mpe / n,
+            **{f'pck@{int(t * 100)}': totals[t] / n for t in totals}}
+
+
+def file_size_mb(path):
+    return os.path.getsize(path) / (1024 * 1024)
+
+
+def state_dict_size_mb(model, path):
+    """On-disk size of the *quantized* checkpoint (int8 weights are packed by fbgemm)."""
+    torch.save(model.state_dict(), path)
+    return file_size_mb(path)
+
+
+def loaders():
+    set_seed(SEED)
+    data_dir = os.path.join(BENCH, 'preprocessed_csi_data')
+    dataset = PreprocessedCSIKeypointsDataset(data_dir=data_dir, keypoint_scale=1000.0,
+                                              enable_temporal_clean=True)
+    train_loader, val_loader, test_loader = create_preprocessed_train_val_test_loaders(
+        dataset=dataset, batch_size=64, num_workers=2, random_seed=SEED)
+    return dataset, train_loader, val_loader, test_loader
+
+
+def clean_loader_from(dataset, test_loader, bs=256):
+    w2f = dataset.window_to_file
+    clean_idx = [i for i in test_loader.dataset.indices if w2f[i] < CORRUPT_FILE_START]
+    return DataLoader(Subset(dataset, clean_idx), batch_size=bs, shuffle=False, num_workers=2)
+
+
+def eval_loaders(dataset, test_loader, bs=256):
+    full = DataLoader(test_loader.dataset, batch_size=bs, shuffle=False, num_workers=2)
+    clean = clean_loader_from(dataset, test_loader, bs=bs)
+    return full, clean
+
+
+# --------------------------------------------------------------- int8 paths (FX graph mode)
+def ptq_static(fp32_model, train_loader, calib_batches=64):
+    """Static post-training quantization, FX graph mode, fbgemm. CPU int8."""
+    from torch.ao.quantization import get_default_qconfig, QConfigMapping
+    from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
+    torch.backends.quantized.engine = 'fbgemm'
+    m = copy.deepcopy(fp32_model).cpu().eval()
+    qconfig = get_default_qconfig('fbgemm')
+    qmap = QConfigMapping().set_global(qconfig)
+    example = torch.randn(1, 540, 20)
+    prepared = prepare_fx(m, qmap, example_inputs=(example,))
+    prepared.eval()
+    with torch.no_grad():
+        for i, (bx, _) in enumerate(train_loader):
+            prepared(bx.cpu())
+            if i + 1 >= calib_batches:
+                break
+    return convert_fx(prepared)
+
+
+def qat(fp32_model, train_loader, val_loader, device, epochs=3, lr=2e-5):
+    """Quantization-aware training, FX graph mode, fbgemm. Fine-tune fake-quant from fp32, convert. CPU int8."""
+    from torch.ao.quantization import get_default_qat_qconfig, QConfigMapping
+    from torch.ao.quantization.quantize_fx import prepare_qat_fx, convert_fx
+    torch.backends.quantized.engine = 'fbgemm'
+    set_seed(SEED)
+    m = copy.deepcopy(fp32_model).to(device).train()
+    qconfig = get_default_qat_qconfig('fbgemm')
+    qmap = QConfigMapping().set_global(qconfig)
+    example = torch.randn(1, 540, 20).to(device)
+    prepared = prepare_qat_fx(m, qmap, example_inputs=(example,))
+    prepared.to(device)
+
+    criterion = PoseLoss(position_weight=1.0, bone_weight=0.2, loss_type='smooth_l1')
+    opt = torch.optim.AdamW(prepared.parameters(), lr=lr, weight_decay=5e-5, betas=(0.9, 0.999))
+
+    best_val = float('inf')
+    best_state = None
+    for ep in range(1, epochs + 1):
+        prepared.train()
+        t0 = time.time()
+        ep_loss, nb = 0.0, 0
+        for bx, by in train_loader:
+            bx, by = bx.to(device), by.to(device)
+            opt.zero_grad(set_to_none=True)
+            out = prepared(bx)
+            loss, _ = criterion(out, by)
+            if not torch.isfinite(loss):
+                continue
+            loss.backward()
+            opt.step()
+            ep_loss += loss.item()
+            nb += 1
+        # eval the fake-quant model on GPU (proxy for int8) to pick the best epoch
+        prepared.eval()
+        v = evaluate(prepared, val_loader, device)
+        print(f"[qat] epoch {ep}/{epochs} train_loss={ep_loss / max(nb,1):.5f} "
+              f"val_mpjpe(fakequant)={v['mpjpe']:.5f} val_pck20={v['pck@20']*100:.2f}% "
+              f"({time.time()-t0:.0f}s)", flush=True)
+        if v['mpjpe'] < best_val:
+            best_val = v['mpjpe']
+            best_state = copy.deepcopy(prepared.state_dict())
+    if best_state is not None:
+        prepared.load_state_dict(best_state)
+    prepared.cpu().eval()
+    return convert_fx(prepared)
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument('--mode', choices=['ptq', 'qat', 'both'], default='both')
+    ap.add_argument('--qat-epochs', type=int, default=3)
+    ap.add_argument('--calib-batches', type=int, default=64)
+    args = ap.parse_args()
+    os.makedirs(OUTDIR, exist_ok=True)
+
+    cuda = torch.device('cuda')
+    cpu = torch.device('cpu')
+    print(f"torch {torch.__version__} | cuda {torch.cuda.get_device_name(0)} | "
+          f"quantized.engine candidates {torch.backends.quantized.supported_engines}", flush=True)
+
+    dataset, train_loader, val_loader, test_loader = loaders()
+    test_full, test_clean = eval_loaders(dataset, test_loader)
+
+    # ---------- fp32 baseline (loads half_best.pth strict; same arch as sweep) ----------
+    fp32 = build_half().eval()
+    state = torch.load(HALF_CKPT, map_location='cpu', weights_only=True)
+    fp32.load_state_dict(state, strict=True)
+    fp32_size = file_size_mb(HALF_CKPT)
+    params = describe(fp32)['params']
+    print(f"\n=== fp32 baseline: half_best.pth | params={params:,} | "
+          f"on-disk={fp32_size:.3f} MB ===", flush=True)
+
+    results = {
+        'host': os.uname().nodename, 'gpu': torch.cuda.get_device_name(0),
+        'torch': torch.__version__, 'date_utc': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
+        'locked_normalization': 'torso-diameter (neck idx2 -> pelvis idx12), '
+                                'upstream calculate_pck use_torso_norm=True (ADR-173 standard)',
+        'checkpoint': HALF_CKPT, 'params': params, 'fp32_size_mb': fp32_size,
+        'test_split': 'seed-42 file-level 70/15/15 test (full 54000 / clean 52560)',
+        'fp32': {}, 'int8': {},
+    }
+
+    fp32_gpu = build_half().to(cuda).eval()
+    fp32_gpu.load_state_dict(state, strict=True)
+    print('[fp32/gpu] full ...', flush=True)
+    results['fp32']['gpu_full'] = evaluate(fp32_gpu, test_full, cuda)
+    print(json.dumps(results['fp32']['gpu_full']), flush=True)
+    print('[fp32/gpu] clean ...', flush=True)
+    results['fp32']['gpu_clean'] = evaluate(fp32_gpu, test_clean, cuda)
+    print(json.dumps(results['fp32']['gpu_clean']), flush=True)
+
+    print('[fp32/cpu] full (device-matched ref for int8) ...', flush=True)
+    results['fp32']['cpu_full'] = evaluate(fp32.to(cpu), test_full, cpu)
+    print(json.dumps(results['fp32']['cpu_full']), flush=True)
+    print('[fp32/cpu] clean ...', flush=True)
+    results['fp32']['cpu_clean'] = evaluate(fp32.to(cpu), test_clean, cpu)
+    print(json.dumps(results['fp32']['cpu_clean']), flush=True)
+
+    # ---------- int8 ----------
+    def measure_int8(label, qmodel):
+        path = os.path.join(OUTDIR, f'half_int8_{label}.pth')
+        size = state_dict_size_mb(qmodel, path)
+        print(f"[int8/{label}] on-disk={size:.3f} MB | full ...", flush=True)
+        full = evaluate(qmodel, test_full, cpu)
+        print(json.dumps(full), flush=True)
+        print(f"[int8/{label}] clean ...", flush=True)
+        clean = evaluate(qmodel, test_clean, cpu)
+        print(json.dumps(clean), flush=True)
+        results['int8'][label] = {'size_mb': size, 'checkpoint': path,
+                                  'cpu_full': full, 'cpu_clean': clean}
+
+    if args.mode in ('ptq', 'both'):
+        print("\n=== int8 PTQ (static, FX, fbgemm) ===", flush=True)
+        qp = ptq_static(fp32.to(cpu).eval(), train_loader, calib_batches=args.calib_batches)
+        measure_int8('ptq_static', qp)
+
+    if args.mode in ('qat', 'both'):
+        print(f"\n=== int8 QAT (FX, fbgemm, {args.qat_epochs} epochs from half_best) ===", flush=True)
+        qq = qat(fp32, train_loader, val_loader, cuda, epochs=args.qat_epochs)
+        measure_int8('qat', qq)
+
+    out = os.path.join(OUTDIR, 'int8_results.json')
+    with open(out, 'w') as f:
+        json.dump(results, f, indent=2)
+    print('\nwrote', out, flush=True)
+
+    # ---------- comparison table (MEASURED) ----------
+    print("\n================= MEASURED COMPARISON (clean test subset, torso-PCK) =================", flush=True)
+    base = results['fp32']['cpu_clean']
+    print(f"{'model':16s} {'size_MB':>8s} {'pck@20':>8s} {'pck@50':>8s} {'mpjpe':>9s}", flush=True)
+    print(f"{'fp32 (cpu)':16s} {fp32_size:8.3f} {base['pck@20']*100:7.2f}% {base['pck@50']*100:7.2f}% {base['mpjpe']:9.6f}", flush=True)
+    for label, r in results['int8'].items():
+        c = r['cpu_clean']
+        d20 = (c['pck@20'] - base['pck@20']) * 100
+        d50 = (c['pck@50'] - base['pck@50']) * 100
+        print(f"{'int8 '+label:16s} {r['size_mb']:8.3f} {c['pck@20']*100:7.2f}% {c['pck@50']*100:7.2f}% {c['mpjpe']:9.6f}  "
+              f"(d_pck20={d20:+.2f}pp d_pck50={d50:+.2f}pp size={fp32_size/r['size_mb']:.2f}x smaller)", flush=True)
+
+
+if __name__ == '__main__':
+    main()