feat(bench): int8 quantization of WiFlow-STD half pose model — MEASURED trade-off (ADR-175, honest negative) (#1095)

Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the
843,834-param "half" WiFlow-STD pose model (half_best.pth) to int8 two ways and
MEASURES the accuracy/size trade-off vs fp32 under ONE locked normalization
(ADR-173 torso-diameter PCK, upstream calculate_pck use_torso_norm=True), on the
same seed-42 file-level 70/15/15 test split that produced the fp32 sweep numbers.

MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK):
  fp32             96.62% pck@20  99.47% pck@50  0.008981 mpjpe  3.351 MB
  int8 PTQ static  40.98% pck@20  94.98% pck@50  0.038262 mpjpe  1.046 MB  (-55.64pp)
  int8 QAT (3 ep)  67.48% pck@20  98.69% pck@50  0.026548 mpjpe  1.043 MB  (-29.15pp)

Verdict (honest no): int8 is NOT a win at the strict PCK@20 edge target. Static
PTQ collapses; QAT recovers a large share but still loses 29 pp @20 for a 3.2x
size win — keep fp32/fp16 on the edge. Disclosed: QAT fake-quant val pck@20 was
83.45% but converted int8 scores 67.48% (~16pp convert_fx gap, reported honestly).

Deliverables:
- v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py (reproducible:
  header carries the exact ssh command + run date; QAT primary, static PTQ fallback)
- docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md (MEASURED table,
  locked normalization, QAT-vs-PTQ labeling, verdict, reproduction, limitations)
- CHANGELOG [Unreleased] ### Added entry

No production Rust or signal-pipeline change. Python deterministic proof unchanged
(f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a, bit-exact).
This commit is contained in:
rUv 2026-06-15 09:16:22 -04:00 committed by GitHub
parent b209b8b778
commit 0f64d23516
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 467 additions and 0 deletions

View File

@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **`homecore-recorder` security review (ADR-132 surfaces) — two real bounding fixes; SQL-injection & NaN-index dimensions confirmed clean with evidence.** Beyond-SOTA review of the HA-compat state recorder (DB persistence + history + ruvector semantic search), the crux being its DB-backed SQL-injection surface. **Findings + fixes:** (1) **Memory-DoS — unbounded `get_state_history`.** The history query carried no `LIMIT`, so a wide `[since, until]` window over a high-frequency entity (a per-second sensor ≈ 86k rows/day) would load an unbounded row set into a single in-memory `Vec`. Added a hard `LIMIT MAX_HISTORY_ROWS` (1,000,000 — generous enough never to truncate a realistic history graph, bounded enough to cap the worst case); the sibling search paths were already `k`-bounded. (2) **Disk-DoS / documented-but-missing `purge`.** The README + HA-compat table advertised `Recorder::purge(older_than)` as a capability, but **no such method existed** — i.e. no retention path at all → unbounded disk growth. Implemented a **transactional** `purge` that deletes `states` + `events` strictly **older than** the cutoff (**exclusive** boundary — idempotent, no off-by-one; a row at the cutoff instant is kept) and **garbage-collects** orphaned `state_attributes` blobs (a dedup-shared blob is dropped only once its last referencing state is gone); all three deletes run in one transaction so a mid-purge failure rolls back cleanly (no states-deleted-but-events-kept corruption). **Confirmed clean with evidence:** SQL injection — **every** query in `db.rs` uses bound `?` parameters (no `format!`/string-concat of user data into SQL); the lone `format!` builds the LIKE *pattern*, which is itself bound as a parameter with `ESCAPE '\\'` and metacharacter escaping. Pinned: a state value `'; DROP TABLE states; --` is stored/queried **literally** (table survives), and a `%`/`_` in a search query matches **literally**, not as a wildcard. NaN-index poisoning (the calibration/vitals/geo class) — **structurally impossible** here: embeddings are SHA-256 → `i32``f32` (an `i32` cast to `f32` is always finite, never NaN/Inf), with an all-zero-digest norm guard; probed empty-index search, empty-string query, and `k=0` — all return `Ok(0)`, **no panic**. Fail-closed write path — a removal event yields `Ok(None)`, semantic-index failure is logged not propagated (best-effort, never blocks the durable SQLite write), and `EntityId` parsing failures fall back rather than panic. **6 new pinning tests** (SQL-injection literal-storage, LIKE-metacharacter literalness, history `LIMIT`, purge exclusive-boundary, purge attribute-GC-keeps-shared, purge old-events): `homecore-recorder` **19 → 25** (`--no-default-features`) / **25 → 31** (`--features ruvector`), 0 failed; the purge-boundary test is a true pin (fails deleting 2 rows under an inclusive cutoff, passes deleting 1 under the exclusive cutoff). Behaviour otherwise unchanged; Python deterministic proof unchanged (recorder is off the signal proof path).
### Added
- **ADR-175: int8 quantization of the WiFlow-STD "half" pose model — MEASURED fp32-vs-int8 accuracy/size trade-off (honest negative).** Sub-deliverable 8.2 of the benchmark/optimization milestone, and the reading of the SOTA brief's "one untested edge lever" (QAT-int8 on the 843,834-param half model that strictly dominates the published 2.23M model). A new committed script `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py` quantizes `half_best.pth` to int8 two ways and scores both with the **same** upstream `calculate_pck`/`calculate_mpjpe` that produced the fp32 sweep numbers, under **one locked normalization** (ADR-173 torso-diameter PCK — neck idx2→pelvis idx12, `use_torso_norm=True`, the standard MM-Fi/GraphPose-Fi convention), on the **same** seed-42 file-level 70/15/15 test split (52,560 NaN-free / 54,000 full windows). **MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK):** fp32 = 96.62% PCK@20 / 99.47% PCK@50 / 0.008981 MPJPE / 3.351 MB (fp32-CPU reproduces fp32-GPU to 4 dp, so the int8 deltas are pure quantization, not CPU/GPU drift); **int8 static PTQ = 40.98% PCK@20 (55.64 pp), 1.046 MB** — naive static QDQ **collapses** on this model (the brief's 2.23M "sweet spot" does NOT transfer to the 843k half model at the tight @20 threshold); **int8 QAT (3-epoch FX fake-quant fine-tune from half_best) = 67.48% PCK@20 (29.15 pp) / 98.69% PCK@50 (0.78 pp), 1.043 MB.** **Verdict (honest no):** int8 is **not a win** at the strict PCK@20 edge target — QAT recovers a large share of the PTQ collapse and is near-lossless at the loose PCK@50 (coarse localization survives int8, fine does not), but a **3.2× size win at 29 pp PCK@20** is a bad trade when the half model already fits edge flash at fp32 → **keep fp32/fp16 on the edge for now.** **Disclosed gap:** the QAT *fake-quant* val PCK@20 reached 83.45% but the *converted* int8 model scores 67.48% — a real ~16 pp `convert_fx` gap (fbgemm int8 kernels ≠ straight-through estimate, esp. the axial-attention einsum/softmax); we report the converted-int8 number, not the fake-quant proxy. **MEASURED:** every table number + the PTQ collapse + the QAT partial recovery + the conversion gap. **CLAIMED/not done:** ONNX/TFLite export, on-edge-SoC latency/energy (int8 measured on x86 fbgemm — size transfers, latency does NOT), mixed-precision keeping attention fp32, longer/better-tuned QAT. **Honest limitations:** single in-domain eval split (no cross-environment split), x86-int8 not edge-SoC-int8, lightly-tuned QAT. Additive only — no production Rust or signal-pipeline change; Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
- **Metric-locked PCK/MPJPE accuracy harness — resolves the PCK-definition ambiguity (`wifi-densepose-train`, needs ADR slot 173).** The SOTA brief (`docs/research/sota-nn-train-benchmark-brief.md` §1, §3.1, §4) found the single biggest threat to any "beyond-SOTA" claim is **metric ambiguity**: three PCK@20 figures (96.09% WiFlow-STD image-normalized, 81.63% AetherArena torso-PCK, 61.1% GraphPose-Fi standard PCK) cannot be lined up because each silently uses a different normalization — the project was retracted twice over this (a withdrawn "92.9%" used *absolute* pixels, not torso). New `src/accuracy.rs` makes the normalizer **explicit, selectable, and carried with every reported number**: a `PckNormalization` enum (`TorsoDiameter` = standard MM-Fi/GraphPose-Fi hip↔hip; `BoundingBoxDiagonal` = looser WiFlow-STD image-normalized; `AbsolutePixels(threshold)` = the retracted convention, included so historical numbers are reproducible and clearly labeled non-comparable); one canonical `pck_at(pred, gt, vis, k, normalization)` reusing the `metrics_core` geometric primitives (hip distance, bbox diagonal — no duplicate kernel); `mpjpe(pred, gt, vis)` (2D/3D, mm); and a self-describing `PoseAccuracy { pck_at: BTreeMap<u8,f32>, mpjpe, normalization, n_keypoints, n_frames }` returned by `accuracy_report(frames, ks, normalization)` so an **unlabeled PCK number is structurally impossible**. **17 hand-computed deterministic tests** (no GPU, no datasets) prove the harness arithmetic: perfect→PCK=1.0/MPJPE=0; all-just-outside→0.0; half-in-half-out→0.5; the **key proof** that identical predictions score 0.50 (torso) / 1.00 (bbox) / 0.75 (abs) under the three normalizations (the ambiguity is real and the definitions are distinct); MPJPE 2D/3D fixtures; and graceful degenerate handling (zero torso, empty frames, NaN coords — no panic, never a false-perfect). **This is measurement infrastructure, not an accuracy claim** — the tests prove the harness is correct, not that any model is good. `wifi-densepose-train` lib 191→206, `test_metrics` 12→14, 0 failed. Python deterministic proof unchanged (off the signal proof path).
- **CI bench-regression guard (`.github/workflows/bench-regression.yml`) — wires the v2/ criterion benches into CI as a real, hard-failing COMPILE-VERIFY gate + an informational fast-run; caught + fixed one already-bit-rotted bench (benchmark/optimization milestone sub-deliverable 8.3; needs ADR slot 174).** The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,aether_prefilter,cir}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv,onnx}_bench`, `wifi-densepose-engine/engine_cycle`, …) but, because benches are **not** part of `cargo test`, nothing in CI compiled them — so they silently rot when a public API they call changes. **Proof this matters (MEASURED):** running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile (`E0063: missing field last_rssi in initializer of SensorPosition` — the struct gained a field, the bench was never updated); fixed in this change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` → `Finished`, Executable produced). **HONEST SCOPE — what gates vs what is informational:** (1) `bench-compile` (HARD GATE) runs `cargo bench --workspace --no-default-features --no-run` (compile + link every default-feature bench, no measurement) plus a `--features cir` compile of the gated `cir_bench` — a deterministic, real regression guard against bench bit-rot; (2) `bench-fast-run` (INFORMATIONAL, `continue-on-error: true`, NEVER gates) runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1s warm-up / 2s measure / 10 samples), targeted per-`--bench` (the crates' libtest lib targets reject criterion flags), and uploads the logs as an artifact. **No timing-regression gate, by design and stated in the workflow header:** wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or a cross-runner `criterion --baseline` compare would manufacture false failures; that becomes honest only on a frequency-pinned self-hosted runner (documented as the re-add condition). The `crv`-gated `ruvector/crv_bench` is deliberately NOT compiled by the gate because its crates.io dep `ruvector-crv 0.1.1` currently fails to build on stable (upstream E0308 in its own `stage_iii.rs`) — noted in-workflow with the re-add condition. Checkout is `submodules: recursive` (the workspace path-deps `vendor/rufield`) and installs the Tauri/GTK dev libs like `ci.yml`'s rust-tests job (a `--workspace` bench link pulls the whole graph). **MEASURED locally (Windows, `--no-default-features`):** `nvsim`, `wifi-densepose-ruvector` (sketch/fusion/ann), `wifi-densepose-signal/cir_bench`, `wifi-densepose-mat/detection_bench` (post-fix), `wifi-densepose-vitals/vitals_bench`, and `ruview-swarm/swarm_bench` all compile + the fast subset runs (sample baseline: `nvsim pipeline_run/d1/256` ≈ 55 µs, `d16/1024` ≈ 315 µs; `ruvector sketch_hamming` ≈ 3-7 ns vs `float_l2` ≈ 63-371 ns). The full `--workspace` `--no-run` could **not** be fully validated on Windows (Tauri-`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure) — those are Windows-env artifacts that build in the Linux CI runner (each affected bench was confirmed to compile standalone here). No baseline JSON is committed (a cross-runner baseline would be dishonest). Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
- **RuField `rufield-viewer` live-ingest mode — closes the RuView↔RuField visual loop (ADR-262 surfaces).** The dashboard gains `--source live --upstream <RuView-URL>`: it consumes RuView's `/ws/field` SSE (falling back to polling `/api/field`), **verifies every event's ed25519 provenance receipt on ingest** (`is_fusable`) — forged/tampered events are flagged ✗ and **never fused** into trusted inferences — and renders real RuView `FieldEvent`s through the same room-state/privacy-badge/fusion-graph/receipt path the synthetic mode uses (wire-compatible by construction: both sides use `rufield_core::FieldEvent` serde). **Strict banner honesty:** a single `BannerState` shows `SYNTHETIC` / `LIVE — <upstream>` / `DISCONNECTED — <upstream> unreachable`, mutually exclusive — never SYNTHETIC while showing live data or vice versa; live mode returns **409** on `/api/run` rather than fabricate a synthetic run, and starts DISCONNECTED until first verified contact. Default stays synthetic. 26 tests / 0 failed. `ruvnet/rufield` `crates/rufield-viewer`; `vendor/rufield` submodule bumped.

View File

@ -0,0 +1,172 @@
# ADR-175: int8 Quantization of the WiFlow-STD "half" Pose Model — MEASURED accuracy/size trade-off
| Field | Value |
|-------|-------|
| **Status** | Accepted — MEASURED, reproducible (honest negative) |
| **Date** | 2026-06-15 |
| **Deciders** | ruv |
| **Codename** | **EDGE-INT8** |
| **Sub-deliverable** | 8.2 of the benchmark/optimization milestone |
| **Metric lock** | ADR-173 (one declared PCK normalization for every reported number) |
| **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (§edge int8) |
## Context
The SOTA brief characterized the int8 edge story for the WiFlow-STD pose net as
"fully characterized" for PTQ on the **published 2.23M** model (static QDQ
conv-only = the sweet spot; dynamic int8 ≈ no-op on this all-conv net), and named
**QAT-int8 on the strictly-dominating 843,834-param "half" model** as "the one
untested edge lever." This ADR is the reading of that lever — a MEASURED
fp32-vs-int8 trade-off for the half model, not a claim.
The half model (`half_best.pth`, 843,834 params) is the efficiency-sweep winner
from ADR-152 (`run_sweep.py` VARIANTS[0]: `tcn=[270,220,170,120]`,
`conv=[4,8,16,32]`, `attn_groups=4`). Its fp32 accuracy was recorded in the sweep;
this ADR re-measures it under the locked normalization and quantizes it.
**The whole point of this deliverable is reproducibility.** Every number below was
produced by running `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
on host `ruvultra` (RTX 5080, torch 2.11.0+cu128) against the real checkpoint and
the real seed-42 test split. The script + the exact command + the recorded stdout
**is** the proof artifact. Nothing here is estimated.
## Decision
Quantize the half model to int8 with **both** levers and report both honestly:
1. **QAT (primary target)** — FX graph-mode quantization-aware training, fbgemm
backend, 3 epochs of fake-quant fine-tuning from `half_best.pth` (AdamW lr 2e-5,
the existing `PoseLoss`), then `convert_fx` to a true int8 graph.
2. **PTQ static QDQ (the brief's "sweet spot", measured as the honest fallback)**
FX graph-mode static PTQ, fbgemm, calibrated on 64 train batches.
### Locked normalization (ADR-173)
**Torso-diameter PCK** — neck (keypoint idx 2) → pelvis (idx 12) distance — the
standard MM-Fi/GraphPose-Fi convention. This is exactly the default
`use_torso_norm=True` path of the upstream harness's `utils/metrics.calculate_pck`.
The **same** `calculate_pck`/`calculate_mpjpe` that produced the sweep's fp32
numbers scores **both** fp32 and int8 here, so the comparison is metric-locked: no
normalization is mixed, and the fp32 baseline reproduces the sweep's recorded
`half` test numbers bit-for-bit (PCK@20 clean = 96.62%), confirming the harness is
the same one.
### Device note (why int8 is CPU)
PyTorch int8 quantized kernels execute on CPU (fbgemm/x86), not CUDA. So int8 eval
is CPU. To keep the accuracy delta device-matched (not confounding int8-vs-fp32
with CPU-vs-GPU), the script measures an **fp32-CPU** baseline too. fp32-CPU and
fp32-GPU agree to 4 decimals (PCK@20 clean 0.96623 vs 0.96623), so CPU/GPU
introduces no drift — the int8 deltas below are pure quantization effect.
## MEASURED results (clean test subset = 52,560 NaN-free windows; torso-PCK)
Source: stdout of the run below + `~/wiflow-std-bench/sweep/int8/int8_results.json`.
| model | quant | size (MB) | PCK@20 | PCK@50 | MPJPE | Δ PCK@20 | Δ PCK@50 | size win |
|-------|-------|-----------|--------|--------|-------|----------|----------|----------|
| **fp32** (cpu) | — | **3.351** | **96.62%** | **99.47%** | **0.008981** | — | — | 1.00× |
| int8 PTQ static | PTQ | 1.046 | 40.98% | 94.98% | 0.038262 | **55.64 pp** | 4.49 pp | 3.20× smaller |
| int8 QAT (3 ep) | **QAT** | 1.043 | 67.48% | 98.69% | 0.026548 | **29.15 pp** | 0.78 pp | 3.21× smaller |
Full-test-set (54,000 windows incl. NaN-zero-filled files 487499) tracks the
clean subset: fp32 96.10% / int8-PTQ 41.11% / int8-QAT 67.48% PCK@20 — same shape,
recorded in the JSON.
### Verdict
**int8 is NOT a win for this model at the tight PCK@20 edge target — honest no.**
- **PTQ static collapses** (55.64 pp PCK@20). Naive static QDQ destroys the half
model. The "sweet spot" characterization from the brief does not transfer from
the 2.23M model to this 843k model at the strict torso-PCK@20 threshold.
- **QAT recovers a large share of the relative gap** (PTQ 40.98% → QAT 67.48%) but
still **loses 29.15 pp** at PCK@20 for a 3.21× size reduction. At the loose
PCK@50 threshold QAT is nearly lossless (0.78 pp), i.e. coarse-localization
survives int8 but fine-localization does not.
- The size win is real and consistent (3.2× smaller, 3.351 MB → ~1.04 MB), but
**3.2× compression at 29 pp PCK@20 is a bad trade** when the half model already
fits comfortably in edge flash at fp32. Recommendation: **keep fp32 (or fp16)
for the half model on the edge**; do not ship this int8 variant as-is.
### Observed fake-quant → int8 conversion gap (disclosed, not hidden)
During QAT the **fake-quant** model's val PCK@20 reached 83.45% (epoch 3), but the
**converted int8** model scores 67.48% on test. A ~16 pp drop on `convert_fx` is a
real effect — the fbgemm int8 kernels are not bit-identical to the fake-quant
simulation (per-tensor activation quant + the axial-attention `einsum`/softmax path
quantize worse than the straight-through estimate predicts). This gap is the honest
reason QAT did not close the loss, and it is exactly the kind of number that would
be invisible if one only reported the fake-quant proxy. We report the **converted
int8** number as the deliverable, not the fake-quant proxy.
## Reproduction
```bash
ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
```
- Script (committed): `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
(scp'd to `~/quantize_half_int8.py` on ruvultra for the run).
- Inputs (on ruvultra, unmodified): `~/wiflow-std-bench/sweep/half_best.pth`,
`~/wiflow-std-bench/preprocessed_csi_data/` (seed-42 file-level 70/15/15 split),
upstream `models`/`dataset`/`utils/metrics`/`losses` (DY2434/WiFlow @ 06899d29,
Apache-2.0), and `sweep/model_compact.py` (the half-model definition).
- Outputs (written, non-destructive): `~/wiflow-std-bench/sweep/int8/`
`half_int8_qat.pth`, `half_int8_ptq_static.pth`, `int8_results.json`,
`int8_run.log`. **No existing file under `~/wiflow-std-bench` was modified.**
- Run metadata: host `ruvultra`, GPU RTX 5080, torch `2.11.0+cu128`, fbgemm engine,
`date_utc 2026-06-15T12:35:06Z`, QAT ≈ 97 s/epoch.
## What is MEASURED vs CLAIMED
- **MEASURED:** every PCK/MPJPE/size number in the table; the fp32 baseline (which
reproduces the recorded sweep `half` numbers); the PTQ collapse; the QAT partial
recovery; the fake-quant→int8 conversion gap; the 3.2× size reduction.
- **CLAIMED / not done here:** ONNX/TFLite export; on-real-edge (ESP32/Pi/Hailo)
latency or energy (int8 here is measured on x86 fbgemm, the dev box, **not** an
edge SoC — the size number transfers, a latency number does **not**); a
per-layer mixed-precision search that might keep the attention block in fp32; QAT
beyond 3 epochs or with learned-quant-range schedules. Those are the obvious next
levers if int8 is revisited; none is asserted as a result.
## Honest scope / limitations
- **Single eval split** — one seed-42 file-level test partition; no cross-room /
cross-environment generalization split (the GraphPose-Fi frontier from ADR-173 is
a separate, harder split and is not what is measured here).
- **In-domain only** — these are in-distribution test numbers; they say nothing
about the cross-environment robustness gap.
- **x86 int8, not edge-SoC int8** — accuracy and size transfer to an edge int8
runtime; the runtime/latency does not (different kernels, different SoC). No
latency claim is made.
- **QAT lightly tuned** — 3 epochs, single LR, default fbgemm qconfig. A longer /
better-tuned QAT might narrow the 29 pp, but on the evidence here int8 does not
reach fp32 at PCK@20, and that is the reportable result today.
## Consequences
### Positive
- The "one untested edge lever" (QAT-int8 on the half model) is now MEASURED. The
edge int8 question for the half model is answered with reproducible numbers: at
the strict PCK@20 target it loses, and we can say so with a committed script.
- Establishes a reusable, metric-locked quantization+eval harness
(`quantize_half_int8.py`) for any future int8 attempt on these compact variants.
### Negative
- None to the codebase (additive script + ADR + CHANGELOG only; no production Rust
or signal-pipeline change; Python deterministic proof hash
`f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a` unchanged).
### Neutral
- The negative verdict means the half model stays fp32/fp16 on the edge for now.
int8 for these compact pose nets is parked pending the next-lever work above.
## Links
- ADR-173 — metric-locked PCK/MPJPE harness (the locked normalization used here)
- ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark / efficiency sweep
(produced `half_best.pth`)
- `docs/research/sota-nn-train-benchmark-brief.md` — §edge int8 (the "one untested
lever" this ADR measures)
- Script: `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`

View File

@ -0,0 +1,294 @@
#!/usr/bin/env python3
"""ADR-175: int8 quantization of the WiFlow-STD "half" pose model + MEASURED accuracy/size trade-off.
Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the 843,834-param
"half" WiFlow-STD pose model to int8 (QAT primary, static-PTQ fallback) and MEASURES the
accuracy delta against the fp32 baseline under ONE locked PCK normalization.
LOCKED NORMALIZATION (ADR-173): torso-diameter PCK neck(idx 2)->pelvis(idx 12) distance,
exactly the default `use_torso_norm=True` path of upstream `utils/metrics.calculate_pck`,
which is the standard MM-Fi/GraphPose-Fi convention. The SAME `calculate_pck` /
`calculate_mpjpe` from the upstream harness scores BOTH fp32 and int8 so the comparison is
metric-locked. The test split is the seed-42 file-level 70/15/15 test partition (54,000
windows full / 52,560 NaN-free) produced by the SAME loader that produced half_best.pth.
int8 backend: FX graph-mode quantization, fbgemm engine (server x86 int8). Quantized int8
kernels execute on CPU, so int8 eval is CPU; an fp32-CPU baseline is also measured so the
accuracy delta is device-matched (CPU fp32 vs CPU int8), and an fp32-GPU number is reported
for continuity with the sweep's recorded numbers.
REPRODUCE (exact command run for ADR-175, run date 2026-06-15, on host ruvultra / RTX 5080):
ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
(the script lives in-repo at v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py;
it was scp'd to ~/quantize_half_int8.py on ruvultra and invoked as above. It is read-only
to everything under ~/wiflow-std-bench except that it WRITES its int8 artifacts + a JSON
results file into ~/wiflow-std-bench/sweep/int8/ it never modifies half_best.pth or any
upstream file.)
Everything this script prints to stdout is MEASURED. Nothing is estimated.
"""
import argparse
import copy
import json
import os
import random
import sys
import time
import numpy as np
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, Subset
BENCH = os.path.expanduser('~/wiflow-std-bench')
SWEEP = os.path.join(BENCH, 'sweep')
OUTDIR = os.path.join(SWEEP, 'int8')
sys.path.insert(0, os.path.join(BENCH, 'upstream'))
sys.path.insert(0, SWEEP)
from dataset import (PreprocessedCSIKeypointsDataset, # noqa: E402
create_preprocessed_train_val_test_loaders)
from losses.pose_loss import PoseLoss # noqa: E402
from utils.metrics import calculate_pck, calculate_mpjpe # noqa: E402 LOCKED metric (torso norm)
from model_compact import CompactWiFlowPoseModel, describe # noqa: E402
# half variant config — IDENTICAL to sweep/run_sweep.py VARIANTS[0] that produced half_best.pth
HALF = dict(tcn=[270, 220, 170, 120], conv=[4, 8, 16, 32], attn_groups=4,
groups_mode='gcd20', input_pw_groups=1)
HALF_CKPT = os.path.join(SWEEP, 'half_best.pth')
CORRUPT_FILE_START = 487 # files 487-499 were zero-filled by clean_nan.py (same as sweep)
SEED = 42
THRESHOLDS = (0.1, 0.2, 0.3, 0.4, 0.5) # PCK@10..50
def set_seed(seed=SEED):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
def build_half(dropout=0.5):
return CompactWiFlowPoseModel(
tcn_channels=HALF['tcn'], conv_channels=HALF['conv'],
attn_groups=HALF['attn_groups'], groups_mode=HALF['groups_mode'],
input_pw_groups=HALF['input_pw_groups'], dropout=dropout)
@torch.no_grad()
def evaluate(model, loader, device):
"""MEASURED PCK@10..50 + MPJPE under the LOCKED torso-diameter normalization."""
model.eval()
totals = {t: 0.0 for t in THRESHOLDS}
total_mpe, n = 0.0, 0
for bx, by in loader:
bx, by = bx.to(device), by.to(device)
out = model(bx)
bs = by.size(0)
total_mpe += calculate_mpjpe(out, by) * bs
pck = calculate_pck(out, by, thresholds=list(totals)) # use_torso_norm=True default
for t in totals:
totals[t] += pck[t] * bs
n += bs
return {'samples': n, 'mpjpe': total_mpe / n,
**{f'pck@{int(t * 100)}': totals[t] / n for t in totals}}
def file_size_mb(path):
return os.path.getsize(path) / (1024 * 1024)
def state_dict_size_mb(model, path):
"""On-disk size of the *quantized* checkpoint (int8 weights are packed by fbgemm)."""
torch.save(model.state_dict(), path)
return file_size_mb(path)
def loaders():
set_seed(SEED)
data_dir = os.path.join(BENCH, 'preprocessed_csi_data')
dataset = PreprocessedCSIKeypointsDataset(data_dir=data_dir, keypoint_scale=1000.0,
enable_temporal_clean=True)
train_loader, val_loader, test_loader = create_preprocessed_train_val_test_loaders(
dataset=dataset, batch_size=64, num_workers=2, random_seed=SEED)
return dataset, train_loader, val_loader, test_loader
def clean_loader_from(dataset, test_loader, bs=256):
w2f = dataset.window_to_file
clean_idx = [i for i in test_loader.dataset.indices if w2f[i] < CORRUPT_FILE_START]
return DataLoader(Subset(dataset, clean_idx), batch_size=bs, shuffle=False, num_workers=2)
def eval_loaders(dataset, test_loader, bs=256):
full = DataLoader(test_loader.dataset, batch_size=bs, shuffle=False, num_workers=2)
clean = clean_loader_from(dataset, test_loader, bs=bs)
return full, clean
# --------------------------------------------------------------- int8 paths (FX graph mode)
def ptq_static(fp32_model, train_loader, calib_batches=64):
"""Static post-training quantization, FX graph mode, fbgemm. CPU int8."""
from torch.ao.quantization import get_default_qconfig, QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
torch.backends.quantized.engine = 'fbgemm'
m = copy.deepcopy(fp32_model).cpu().eval()
qconfig = get_default_qconfig('fbgemm')
qmap = QConfigMapping().set_global(qconfig)
example = torch.randn(1, 540, 20)
prepared = prepare_fx(m, qmap, example_inputs=(example,))
prepared.eval()
with torch.no_grad():
for i, (bx, _) in enumerate(train_loader):
prepared(bx.cpu())
if i + 1 >= calib_batches:
break
return convert_fx(prepared)
def qat(fp32_model, train_loader, val_loader, device, epochs=3, lr=2e-5):
"""Quantization-aware training, FX graph mode, fbgemm. Fine-tune fake-quant from fp32, convert. CPU int8."""
from torch.ao.quantization import get_default_qat_qconfig, QConfigMapping
from torch.ao.quantization.quantize_fx import prepare_qat_fx, convert_fx
torch.backends.quantized.engine = 'fbgemm'
set_seed(SEED)
m = copy.deepcopy(fp32_model).to(device).train()
qconfig = get_default_qat_qconfig('fbgemm')
qmap = QConfigMapping().set_global(qconfig)
example = torch.randn(1, 540, 20).to(device)
prepared = prepare_qat_fx(m, qmap, example_inputs=(example,))
prepared.to(device)
criterion = PoseLoss(position_weight=1.0, bone_weight=0.2, loss_type='smooth_l1')
opt = torch.optim.AdamW(prepared.parameters(), lr=lr, weight_decay=5e-5, betas=(0.9, 0.999))
best_val = float('inf')
best_state = None
for ep in range(1, epochs + 1):
prepared.train()
t0 = time.time()
ep_loss, nb = 0.0, 0
for bx, by in train_loader:
bx, by = bx.to(device), by.to(device)
opt.zero_grad(set_to_none=True)
out = prepared(bx)
loss, _ = criterion(out, by)
if not torch.isfinite(loss):
continue
loss.backward()
opt.step()
ep_loss += loss.item()
nb += 1
# eval the fake-quant model on GPU (proxy for int8) to pick the best epoch
prepared.eval()
v = evaluate(prepared, val_loader, device)
print(f"[qat] epoch {ep}/{epochs} train_loss={ep_loss / max(nb,1):.5f} "
f"val_mpjpe(fakequant)={v['mpjpe']:.5f} val_pck20={v['pck@20']*100:.2f}% "
f"({time.time()-t0:.0f}s)", flush=True)
if v['mpjpe'] < best_val:
best_val = v['mpjpe']
best_state = copy.deepcopy(prepared.state_dict())
if best_state is not None:
prepared.load_state_dict(best_state)
prepared.cpu().eval()
return convert_fx(prepared)
def main():
ap = argparse.ArgumentParser()
ap.add_argument('--mode', choices=['ptq', 'qat', 'both'], default='both')
ap.add_argument('--qat-epochs', type=int, default=3)
ap.add_argument('--calib-batches', type=int, default=64)
args = ap.parse_args()
os.makedirs(OUTDIR, exist_ok=True)
cuda = torch.device('cuda')
cpu = torch.device('cpu')
print(f"torch {torch.__version__} | cuda {torch.cuda.get_device_name(0)} | "
f"quantized.engine candidates {torch.backends.quantized.supported_engines}", flush=True)
dataset, train_loader, val_loader, test_loader = loaders()
test_full, test_clean = eval_loaders(dataset, test_loader)
# ---------- fp32 baseline (loads half_best.pth strict; same arch as sweep) ----------
fp32 = build_half().eval()
state = torch.load(HALF_CKPT, map_location='cpu', weights_only=True)
fp32.load_state_dict(state, strict=True)
fp32_size = file_size_mb(HALF_CKPT)
params = describe(fp32)['params']
print(f"\n=== fp32 baseline: half_best.pth | params={params:,} | "
f"on-disk={fp32_size:.3f} MB ===", flush=True)
results = {
'host': os.uname().nodename, 'gpu': torch.cuda.get_device_name(0),
'torch': torch.__version__, 'date_utc': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
'locked_normalization': 'torso-diameter (neck idx2 -> pelvis idx12), '
'upstream calculate_pck use_torso_norm=True (ADR-173 standard)',
'checkpoint': HALF_CKPT, 'params': params, 'fp32_size_mb': fp32_size,
'test_split': 'seed-42 file-level 70/15/15 test (full 54000 / clean 52560)',
'fp32': {}, 'int8': {},
}
fp32_gpu = build_half().to(cuda).eval()
fp32_gpu.load_state_dict(state, strict=True)
print('[fp32/gpu] full ...', flush=True)
results['fp32']['gpu_full'] = evaluate(fp32_gpu, test_full, cuda)
print(json.dumps(results['fp32']['gpu_full']), flush=True)
print('[fp32/gpu] clean ...', flush=True)
results['fp32']['gpu_clean'] = evaluate(fp32_gpu, test_clean, cuda)
print(json.dumps(results['fp32']['gpu_clean']), flush=True)
print('[fp32/cpu] full (device-matched ref for int8) ...', flush=True)
results['fp32']['cpu_full'] = evaluate(fp32.to(cpu), test_full, cpu)
print(json.dumps(results['fp32']['cpu_full']), flush=True)
print('[fp32/cpu] clean ...', flush=True)
results['fp32']['cpu_clean'] = evaluate(fp32.to(cpu), test_clean, cpu)
print(json.dumps(results['fp32']['cpu_clean']), flush=True)
# ---------- int8 ----------
def measure_int8(label, qmodel):
path = os.path.join(OUTDIR, f'half_int8_{label}.pth')
size = state_dict_size_mb(qmodel, path)
print(f"[int8/{label}] on-disk={size:.3f} MB | full ...", flush=True)
full = evaluate(qmodel, test_full, cpu)
print(json.dumps(full), flush=True)
print(f"[int8/{label}] clean ...", flush=True)
clean = evaluate(qmodel, test_clean, cpu)
print(json.dumps(clean), flush=True)
results['int8'][label] = {'size_mb': size, 'checkpoint': path,
'cpu_full': full, 'cpu_clean': clean}
if args.mode in ('ptq', 'both'):
print("\n=== int8 PTQ (static, FX, fbgemm) ===", flush=True)
qp = ptq_static(fp32.to(cpu).eval(), train_loader, calib_batches=args.calib_batches)
measure_int8('ptq_static', qp)
if args.mode in ('qat', 'both'):
print(f"\n=== int8 QAT (FX, fbgemm, {args.qat_epochs} epochs from half_best) ===", flush=True)
qq = qat(fp32, train_loader, val_loader, cuda, epochs=args.qat_epochs)
measure_int8('qat', qq)
out = os.path.join(OUTDIR, 'int8_results.json')
with open(out, 'w') as f:
json.dump(results, f, indent=2)
print('\nwrote', out, flush=True)
# ---------- comparison table (MEASURED) ----------
print("\n================= MEASURED COMPARISON (clean test subset, torso-PCK) =================", flush=True)
base = results['fp32']['cpu_clean']
print(f"{'model':16s} {'size_MB':>8s} {'pck@20':>8s} {'pck@50':>8s} {'mpjpe':>9s}", flush=True)
print(f"{'fp32 (cpu)':16s} {fp32_size:8.3f} {base['pck@20']*100:7.2f}% {base['pck@50']*100:7.2f}% {base['mpjpe']:9.6f}", flush=True)
for label, r in results['int8'].items():
c = r['cpu_clean']
d20 = (c['pck@20'] - base['pck@20']) * 100
d50 = (c['pck@50'] - base['pck@50']) * 100
print(f"{'int8 '+label:16s} {r['size_mb']:8.3f} {c['pck@20']*100:7.2f}% {c['pck@50']*100:7.2f}% {c['mpjpe']:9.6f} "
f"(d_pck20={d20:+.2f}pp d_pck50={d50:+.2f}pp size={fp32_size/r['size_mb']:.2f}x smaller)", flush=True)
if __name__ == '__main__':
main()