From 0f64d23516460a511242f52a851a893ce0fd0fc6 Mon Sep 17 00:00:00 2001 From: rUv Date: Mon, 15 Jun 2026 09:16:22 -0400 Subject: [PATCH] =?UTF-8?q?feat(bench):=20int8=20quantization=20of=20WiFlo?= =?UTF-8?q?w-STD=20half=20pose=20model=20=E2=80=94=20MEASURED=20trade-off?= =?UTF-8?q?=20(ADR-175,=20honest=20negative)=20(#1095)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the 843,834-param "half" WiFlow-STD pose model (half_best.pth) to int8 two ways and MEASURES the accuracy/size trade-off vs fp32 under ONE locked normalization (ADR-173 torso-diameter PCK, upstream calculate_pck use_torso_norm=True), on the same seed-42 file-level 70/15/15 test split that produced the fp32 sweep numbers. MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK): fp32 96.62% pck@20 99.47% pck@50 0.008981 mpjpe 3.351 MB int8 PTQ static 40.98% pck@20 94.98% pck@50 0.038262 mpjpe 1.046 MB (-55.64pp) int8 QAT (3 ep) 67.48% pck@20 98.69% pck@50 0.026548 mpjpe 1.043 MB (-29.15pp) Verdict (honest no): int8 is NOT a win at the strict PCK@20 edge target. Static PTQ collapses; QAT recovers a large share but still loses 29 pp @20 for a 3.2x size win — keep fp32/fp16 on the edge. Disclosed: QAT fake-quant val pck@20 was 83.45% but converted int8 scores 67.48% (~16pp convert_fx gap, reported honestly). Deliverables: - v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py (reproducible: header carries the exact ssh command + run date; QAT primary, static PTQ fallback) - docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md (MEASURED table, locked normalization, QAT-vs-PTQ labeling, verdict, reproduction, limitations) - CHANGELOG [Unreleased] ### Added entry No production Rust or signal-pipeline change. Python deterministic proof unchanged (f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a, bit-exact). --- CHANGELOG.md | 1 + ...8-quantization-half-pose-model-measured.md | 172 ++++++++++ .../scripts/quantize_half_int8.py | 294 ++++++++++++++++++ 3 files changed, 467 insertions(+) create mode 100644 docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md create mode 100644 v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py diff --git a/CHANGELOG.md b/CHANGELOG.md index 25736a07..b16fdc59 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`homecore-recorder` security review (ADR-132 surfaces) — two real bounding fixes; SQL-injection & NaN-index dimensions confirmed clean with evidence.** Beyond-SOTA review of the HA-compat state recorder (DB persistence + history + ruvector semantic search), the crux being its DB-backed SQL-injection surface. **Findings + fixes:** (1) **Memory-DoS — unbounded `get_state_history`.** The history query carried no `LIMIT`, so a wide `[since, until]` window over a high-frequency entity (a per-second sensor ≈ 86k rows/day) would load an unbounded row set into a single in-memory `Vec`. Added a hard `LIMIT MAX_HISTORY_ROWS` (1,000,000 — generous enough never to truncate a realistic history graph, bounded enough to cap the worst case); the sibling search paths were already `k`-bounded. (2) **Disk-DoS / documented-but-missing `purge`.** The README + HA-compat table advertised `Recorder::purge(older_than)` as a capability, but **no such method existed** — i.e. no retention path at all → unbounded disk growth. Implemented a **transactional** `purge` that deletes `states` + `events` strictly **older than** the cutoff (**exclusive** boundary — idempotent, no off-by-one; a row at the cutoff instant is kept) and **garbage-collects** orphaned `state_attributes` blobs (a dedup-shared blob is dropped only once its last referencing state is gone); all three deletes run in one transaction so a mid-purge failure rolls back cleanly (no states-deleted-but-events-kept corruption). **Confirmed clean with evidence:** SQL injection — **every** query in `db.rs` uses bound `?` parameters (no `format!`/string-concat of user data into SQL); the lone `format!` builds the LIKE *pattern*, which is itself bound as a parameter with `ESCAPE '\\'` and metacharacter escaping. Pinned: a state value `'; DROP TABLE states; --` is stored/queried **literally** (table survives), and a `%`/`_` in a search query matches **literally**, not as a wildcard. NaN-index poisoning (the calibration/vitals/geo class) — **structurally impossible** here: embeddings are SHA-256 → `i32` → `f32` (an `i32` cast to `f32` is always finite, never NaN/Inf), with an all-zero-digest norm guard; probed empty-index search, empty-string query, and `k=0` — all return `Ok(0)`, **no panic**. Fail-closed write path — a removal event yields `Ok(None)`, semantic-index failure is logged not propagated (best-effort, never blocks the durable SQLite write), and `EntityId` parsing failures fall back rather than panic. **6 new pinning tests** (SQL-injection literal-storage, LIKE-metacharacter literalness, history `LIMIT`, purge exclusive-boundary, purge attribute-GC-keeps-shared, purge old-events): `homecore-recorder` **19 → 25** (`--no-default-features`) / **25 → 31** (`--features ruvector`), 0 failed; the purge-boundary test is a true pin (fails deleting 2 rows under an inclusive cutoff, passes deleting 1 under the exclusive cutoff). Behaviour otherwise unchanged; Python deterministic proof unchanged (recorder is off the signal proof path). ### Added +- **ADR-175: int8 quantization of the WiFlow-STD "half" pose model — MEASURED fp32-vs-int8 accuracy/size trade-off (honest negative).** Sub-deliverable 8.2 of the benchmark/optimization milestone, and the reading of the SOTA brief's "one untested edge lever" (QAT-int8 on the 843,834-param half model that strictly dominates the published 2.23M model). A new committed script `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py` quantizes `half_best.pth` to int8 two ways and scores both with the **same** upstream `calculate_pck`/`calculate_mpjpe` that produced the fp32 sweep numbers, under **one locked normalization** (ADR-173 torso-diameter PCK — neck idx2→pelvis idx12, `use_torso_norm=True`, the standard MM-Fi/GraphPose-Fi convention), on the **same** seed-42 file-level 70/15/15 test split (52,560 NaN-free / 54,000 full windows). **MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK):** fp32 = 96.62% PCK@20 / 99.47% PCK@50 / 0.008981 MPJPE / 3.351 MB (fp32-CPU reproduces fp32-GPU to 4 dp, so the int8 deltas are pure quantization, not CPU/GPU drift); **int8 static PTQ = 40.98% PCK@20 (−55.64 pp), 1.046 MB** — naive static QDQ **collapses** on this model (the brief's 2.23M "sweet spot" does NOT transfer to the 843k half model at the tight @20 threshold); **int8 QAT (3-epoch FX fake-quant fine-tune from half_best) = 67.48% PCK@20 (−29.15 pp) / 98.69% PCK@50 (−0.78 pp), 1.043 MB.** **Verdict (honest no):** int8 is **not a win** at the strict PCK@20 edge target — QAT recovers a large share of the PTQ collapse and is near-lossless at the loose PCK@50 (coarse localization survives int8, fine does not), but a **3.2× size win at −29 pp PCK@20** is a bad trade when the half model already fits edge flash at fp32 → **keep fp32/fp16 on the edge for now.** **Disclosed gap:** the QAT *fake-quant* val PCK@20 reached 83.45% but the *converted* int8 model scores 67.48% — a real ~16 pp `convert_fx` gap (fbgemm int8 kernels ≠ straight-through estimate, esp. the axial-attention einsum/softmax); we report the converted-int8 number, not the fake-quant proxy. **MEASURED:** every table number + the PTQ collapse + the QAT partial recovery + the conversion gap. **CLAIMED/not done:** ONNX/TFLite export, on-edge-SoC latency/energy (int8 measured on x86 fbgemm — size transfers, latency does NOT), mixed-precision keeping attention fp32, longer/better-tuned QAT. **Honest limitations:** single in-domain eval split (no cross-environment split), x86-int8 not edge-SoC-int8, lightly-tuned QAT. Additive only — no production Rust or signal-pipeline change; Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path). - **Metric-locked PCK/MPJPE accuracy harness — resolves the PCK-definition ambiguity (`wifi-densepose-train`, needs ADR slot 173).** The SOTA brief (`docs/research/sota-nn-train-benchmark-brief.md` §1, §3.1, §4) found the single biggest threat to any "beyond-SOTA" claim is **metric ambiguity**: three PCK@20 figures (96.09% WiFlow-STD image-normalized, 81.63% AetherArena torso-PCK, 61.1% GraphPose-Fi standard PCK) cannot be lined up because each silently uses a different normalization — the project was retracted twice over this (a withdrawn "92.9%" used *absolute* pixels, not torso). New `src/accuracy.rs` makes the normalizer **explicit, selectable, and carried with every reported number**: a `PckNormalization` enum (`TorsoDiameter` = standard MM-Fi/GraphPose-Fi hip↔hip; `BoundingBoxDiagonal` = looser WiFlow-STD image-normalized; `AbsolutePixels(threshold)` = the retracted convention, included so historical numbers are reproducible and clearly labeled non-comparable); one canonical `pck_at(pred, gt, vis, k, normalization)` reusing the `metrics_core` geometric primitives (hip distance, bbox diagonal — no duplicate kernel); `mpjpe(pred, gt, vis)` (2D/3D, mm); and a self-describing `PoseAccuracy { pck_at: BTreeMap, mpjpe, normalization, n_keypoints, n_frames }` returned by `accuracy_report(frames, ks, normalization)` so an **unlabeled PCK number is structurally impossible**. **17 hand-computed deterministic tests** (no GPU, no datasets) prove the harness arithmetic: perfect→PCK=1.0/MPJPE=0; all-just-outside→0.0; half-in-half-out→0.5; the **key proof** that identical predictions score 0.50 (torso) / 1.00 (bbox) / 0.75 (abs) under the three normalizations (the ambiguity is real and the definitions are distinct); MPJPE 2D/3D fixtures; and graceful degenerate handling (zero torso, empty frames, NaN coords — no panic, never a false-perfect). **This is measurement infrastructure, not an accuracy claim** — the tests prove the harness is correct, not that any model is good. `wifi-densepose-train` lib 191→206, `test_metrics` 12→14, 0 failed. Python deterministic proof unchanged (off the signal proof path). - **CI bench-regression guard (`.github/workflows/bench-regression.yml`) — wires the v2/ criterion benches into CI as a real, hard-failing COMPILE-VERIFY gate + an informational fast-run; caught + fixed one already-bit-rotted bench (benchmark/optimization milestone sub-deliverable 8.3; needs ADR slot 174).** The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,aether_prefilter,cir}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv,onnx}_bench`, `wifi-densepose-engine/engine_cycle`, …) but, because benches are **not** part of `cargo test`, nothing in CI compiled them — so they silently rot when a public API they call changes. **Proof this matters (MEASURED):** running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile (`E0063: missing field last_rssi in initializer of SensorPosition` — the struct gained a field, the bench was never updated); fixed in this change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` → `Finished`, Executable produced). **HONEST SCOPE — what gates vs what is informational:** (1) `bench-compile` (HARD GATE) runs `cargo bench --workspace --no-default-features --no-run` (compile + link every default-feature bench, no measurement) plus a `--features cir` compile of the gated `cir_bench` — a deterministic, real regression guard against bench bit-rot; (2) `bench-fast-run` (INFORMATIONAL, `continue-on-error: true`, NEVER gates) runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1s warm-up / 2s measure / 10 samples), targeted per-`--bench` (the crates' libtest lib targets reject criterion flags), and uploads the logs as an artifact. **No timing-regression gate, by design and stated in the workflow header:** wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or a cross-runner `criterion --baseline` compare would manufacture false failures; that becomes honest only on a frequency-pinned self-hosted runner (documented as the re-add condition). The `crv`-gated `ruvector/crv_bench` is deliberately NOT compiled by the gate because its crates.io dep `ruvector-crv 0.1.1` currently fails to build on stable (upstream E0308 in its own `stage_iii.rs`) — noted in-workflow with the re-add condition. Checkout is `submodules: recursive` (the workspace path-deps `vendor/rufield`) and installs the Tauri/GTK dev libs like `ci.yml`'s rust-tests job (a `--workspace` bench link pulls the whole graph). **MEASURED locally (Windows, `--no-default-features`):** `nvsim`, `wifi-densepose-ruvector` (sketch/fusion/ann), `wifi-densepose-signal/cir_bench`, `wifi-densepose-mat/detection_bench` (post-fix), `wifi-densepose-vitals/vitals_bench`, and `ruview-swarm/swarm_bench` all compile + the fast subset runs (sample baseline: `nvsim pipeline_run/d1/256` ≈ 55 µs, `d16/1024` ≈ 315 µs; `ruvector sketch_hamming` ≈ 3-7 ns vs `float_l2` ≈ 63-371 ns). The full `--workspace` `--no-run` could **not** be fully validated on Windows (Tauri-`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure) — those are Windows-env artifacts that build in the Linux CI runner (each affected bench was confirmed to compile standalone here). No baseline JSON is committed (a cross-runner baseline would be dishonest). Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path). - **RuField `rufield-viewer` live-ingest mode — closes the RuView↔RuField visual loop (ADR-262 surfaces).** The dashboard gains `--source live --upstream `: it consumes RuView's `/ws/field` SSE (falling back to polling `/api/field`), **verifies every event's ed25519 provenance receipt on ingest** (`is_fusable`) — forged/tampered events are flagged ✗ and **never fused** into trusted inferences — and renders real RuView `FieldEvent`s through the same room-state/privacy-badge/fusion-graph/receipt path the synthetic mode uses (wire-compatible by construction: both sides use `rufield_core::FieldEvent` serde). **Strict banner honesty:** a single `BannerState` shows `SYNTHETIC` / `LIVE — ` / `DISCONNECTED — unreachable`, mutually exclusive — never SYNTHETIC while showing live data or vice versa; live mode returns **409** on `/api/run` rather than fabricate a synthetic run, and starts DISCONNECTED until first verified contact. Default stays synthetic. 26 tests / 0 failed. `ruvnet/rufield` `crates/rufield-viewer`; `vendor/rufield` submodule bumped. diff --git a/docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md b/docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md new file mode 100644 index 00000000..cc63e33a --- /dev/null +++ b/docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md @@ -0,0 +1,172 @@ +# ADR-175: int8 Quantization of the WiFlow-STD "half" Pose Model — MEASURED accuracy/size trade-off + +| Field | Value | +|-------|-------| +| **Status** | Accepted — MEASURED, reproducible (honest negative) | +| **Date** | 2026-06-15 | +| **Deciders** | ruv | +| **Codename** | **EDGE-INT8** | +| **Sub-deliverable** | 8.2 of the benchmark/optimization milestone | +| **Metric lock** | ADR-173 (one declared PCK normalization for every reported number) | +| **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (§edge int8) | + +## Context + +The SOTA brief characterized the int8 edge story for the WiFlow-STD pose net as +"fully characterized" for PTQ on the **published 2.23M** model (static QDQ +conv-only = the sweet spot; dynamic int8 ≈ no-op on this all-conv net), and named +**QAT-int8 on the strictly-dominating 843,834-param "half" model** as "the one +untested edge lever." This ADR is the reading of that lever — a MEASURED +fp32-vs-int8 trade-off for the half model, not a claim. + +The half model (`half_best.pth`, 843,834 params) is the efficiency-sweep winner +from ADR-152 (`run_sweep.py` VARIANTS[0]: `tcn=[270,220,170,120]`, +`conv=[4,8,16,32]`, `attn_groups=4`). Its fp32 accuracy was recorded in the sweep; +this ADR re-measures it under the locked normalization and quantizes it. + +**The whole point of this deliverable is reproducibility.** Every number below was +produced by running `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py` +on host `ruvultra` (RTX 5080, torch 2.11.0+cu128) against the real checkpoint and +the real seed-42 test split. The script + the exact command + the recorded stdout +**is** the proof artifact. Nothing here is estimated. + +## Decision + +Quantize the half model to int8 with **both** levers and report both honestly: + +1. **QAT (primary target)** — FX graph-mode quantization-aware training, fbgemm + backend, 3 epochs of fake-quant fine-tuning from `half_best.pth` (AdamW lr 2e-5, + the existing `PoseLoss`), then `convert_fx` to a true int8 graph. +2. **PTQ static QDQ (the brief's "sweet spot", measured as the honest fallback)** — + FX graph-mode static PTQ, fbgemm, calibrated on 64 train batches. + +### Locked normalization (ADR-173) + +**Torso-diameter PCK** — neck (keypoint idx 2) → pelvis (idx 12) distance — the +standard MM-Fi/GraphPose-Fi convention. This is exactly the default +`use_torso_norm=True` path of the upstream harness's `utils/metrics.calculate_pck`. +The **same** `calculate_pck`/`calculate_mpjpe` that produced the sweep's fp32 +numbers scores **both** fp32 and int8 here, so the comparison is metric-locked: no +normalization is mixed, and the fp32 baseline reproduces the sweep's recorded +`half` test numbers bit-for-bit (PCK@20 clean = 96.62%), confirming the harness is +the same one. + +### Device note (why int8 is CPU) + +PyTorch int8 quantized kernels execute on CPU (fbgemm/x86), not CUDA. So int8 eval +is CPU. To keep the accuracy delta device-matched (not confounding int8-vs-fp32 +with CPU-vs-GPU), the script measures an **fp32-CPU** baseline too. fp32-CPU and +fp32-GPU agree to 4 decimals (PCK@20 clean 0.96623 vs 0.96623), so CPU/GPU +introduces no drift — the int8 deltas below are pure quantization effect. + +## MEASURED results (clean test subset = 52,560 NaN-free windows; torso-PCK) + +Source: stdout of the run below + `~/wiflow-std-bench/sweep/int8/int8_results.json`. + +| model | quant | size (MB) | PCK@20 | PCK@50 | MPJPE | Δ PCK@20 | Δ PCK@50 | size win | +|-------|-------|-----------|--------|--------|-------|----------|----------|----------| +| **fp32** (cpu) | — | **3.351** | **96.62%** | **99.47%** | **0.008981** | — | — | 1.00× | +| int8 PTQ static | PTQ | 1.046 | 40.98% | 94.98% | 0.038262 | **−55.64 pp** | −4.49 pp | 3.20× smaller | +| int8 QAT (3 ep) | **QAT** | 1.043 | 67.48% | 98.69% | 0.026548 | **−29.15 pp** | −0.78 pp | 3.21× smaller | + +Full-test-set (54,000 windows incl. NaN-zero-filled files 487–499) tracks the +clean subset: fp32 96.10% / int8-PTQ 41.11% / int8-QAT 67.48% PCK@20 — same shape, +recorded in the JSON. + +### Verdict + +**int8 is NOT a win for this model at the tight PCK@20 edge target — honest no.** + +- **PTQ static collapses** (−55.64 pp PCK@20). Naive static QDQ destroys the half + model. The "sweet spot" characterization from the brief does not transfer from + the 2.23M model to this 843k model at the strict torso-PCK@20 threshold. +- **QAT recovers a large share of the relative gap** (PTQ 40.98% → QAT 67.48%) but + still **loses 29.15 pp** at PCK@20 for a 3.21× size reduction. At the loose + PCK@50 threshold QAT is nearly lossless (−0.78 pp), i.e. coarse-localization + survives int8 but fine-localization does not. +- The size win is real and consistent (3.2× smaller, 3.351 MB → ~1.04 MB), but + **3.2× compression at −29 pp PCK@20 is a bad trade** when the half model already + fits comfortably in edge flash at fp32. Recommendation: **keep fp32 (or fp16) + for the half model on the edge**; do not ship this int8 variant as-is. + +### Observed fake-quant → int8 conversion gap (disclosed, not hidden) + +During QAT the **fake-quant** model's val PCK@20 reached 83.45% (epoch 3), but the +**converted int8** model scores 67.48% on test. A ~16 pp drop on `convert_fx` is a +real effect — the fbgemm int8 kernels are not bit-identical to the fake-quant +simulation (per-tensor activation quant + the axial-attention `einsum`/softmax path +quantize worse than the straight-through estimate predicts). This gap is the honest +reason QAT did not close the loss, and it is exactly the kind of number that would +be invisible if one only reported the fake-quant proxy. We report the **converted +int8** number as the deliverable, not the fake-quant proxy. + +## Reproduction + +```bash +ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \ + python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1' +``` + +- Script (committed): `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py` + (scp'd to `~/quantize_half_int8.py` on ruvultra for the run). +- Inputs (on ruvultra, unmodified): `~/wiflow-std-bench/sweep/half_best.pth`, + `~/wiflow-std-bench/preprocessed_csi_data/` (seed-42 file-level 70/15/15 split), + upstream `models`/`dataset`/`utils/metrics`/`losses` (DY2434/WiFlow @ 06899d29, + Apache-2.0), and `sweep/model_compact.py` (the half-model definition). +- Outputs (written, non-destructive): `~/wiflow-std-bench/sweep/int8/` — + `half_int8_qat.pth`, `half_int8_ptq_static.pth`, `int8_results.json`, + `int8_run.log`. **No existing file under `~/wiflow-std-bench` was modified.** +- Run metadata: host `ruvultra`, GPU RTX 5080, torch `2.11.0+cu128`, fbgemm engine, + `date_utc 2026-06-15T12:35:06Z`, QAT ≈ 97 s/epoch. + +## What is MEASURED vs CLAIMED + +- **MEASURED:** every PCK/MPJPE/size number in the table; the fp32 baseline (which + reproduces the recorded sweep `half` numbers); the PTQ collapse; the QAT partial + recovery; the fake-quant→int8 conversion gap; the 3.2× size reduction. +- **CLAIMED / not done here:** ONNX/TFLite export; on-real-edge (ESP32/Pi/Hailo) + latency or energy (int8 here is measured on x86 fbgemm, the dev box, **not** an + edge SoC — the size number transfers, a latency number does **not**); a + per-layer mixed-precision search that might keep the attention block in fp32; QAT + beyond 3 epochs or with learned-quant-range schedules. Those are the obvious next + levers if int8 is revisited; none is asserted as a result. + +## Honest scope / limitations + +- **Single eval split** — one seed-42 file-level test partition; no cross-room / + cross-environment generalization split (the GraphPose-Fi frontier from ADR-173 is + a separate, harder split and is not what is measured here). +- **In-domain only** — these are in-distribution test numbers; they say nothing + about the cross-environment robustness gap. +- **x86 int8, not edge-SoC int8** — accuracy and size transfer to an edge int8 + runtime; the runtime/latency does not (different kernels, different SoC). No + latency claim is made. +- **QAT lightly tuned** — 3 epochs, single LR, default fbgemm qconfig. A longer / + better-tuned QAT might narrow the −29 pp, but on the evidence here int8 does not + reach fp32 at PCK@20, and that is the reportable result today. + +## Consequences + +### Positive +- The "one untested edge lever" (QAT-int8 on the half model) is now MEASURED. The + edge int8 question for the half model is answered with reproducible numbers: at + the strict PCK@20 target it loses, and we can say so with a committed script. +- Establishes a reusable, metric-locked quantization+eval harness + (`quantize_half_int8.py`) for any future int8 attempt on these compact variants. + +### Negative +- None to the codebase (additive script + ADR + CHANGELOG only; no production Rust + or signal-pipeline change; Python deterministic proof hash + `f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a` unchanged). + +### Neutral +- The negative verdict means the half model stays fp32/fp16 on the edge for now. + int8 for these compact pose nets is parked pending the next-lever work above. + +## Links +- ADR-173 — metric-locked PCK/MPJPE harness (the locked normalization used here) +- ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark / efficiency sweep + (produced `half_best.pth`) +- `docs/research/sota-nn-train-benchmark-brief.md` — §edge int8 (the "one untested + lever" this ADR measures) +- Script: `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py` diff --git a/v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py b/v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py new file mode 100644 index 00000000..0386f687 --- /dev/null +++ b/v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py @@ -0,0 +1,294 @@ +#!/usr/bin/env python3 +"""ADR-175: int8 quantization of the WiFlow-STD "half" pose model + MEASURED accuracy/size trade-off. + +Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the 843,834-param +"half" WiFlow-STD pose model to int8 (QAT primary, static-PTQ fallback) and MEASURES the +accuracy delta against the fp32 baseline under ONE locked PCK normalization. + +LOCKED NORMALIZATION (ADR-173): torso-diameter PCK — neck(idx 2)->pelvis(idx 12) distance, +exactly the default `use_torso_norm=True` path of upstream `utils/metrics.calculate_pck`, +which is the standard MM-Fi/GraphPose-Fi convention. The SAME `calculate_pck` / +`calculate_mpjpe` from the upstream harness scores BOTH fp32 and int8 so the comparison is +metric-locked. The test split is the seed-42 file-level 70/15/15 test partition (54,000 +windows full / 52,560 NaN-free) produced by the SAME loader that produced half_best.pth. + +int8 backend: FX graph-mode quantization, fbgemm engine (server x86 int8). Quantized int8 +kernels execute on CPU, so int8 eval is CPU; an fp32-CPU baseline is also measured so the +accuracy delta is device-matched (CPU fp32 vs CPU int8), and an fp32-GPU number is reported +for continuity with the sweep's recorded numbers. + +REPRODUCE (exact command run for ADR-175, run date 2026-06-15, on host ruvultra / RTX 5080): + ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \ + python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1' + + (the script lives in-repo at v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py; + it was scp'd to ~/quantize_half_int8.py on ruvultra and invoked as above. It is read-only + to everything under ~/wiflow-std-bench except that it WRITES its int8 artifacts + a JSON + results file into ~/wiflow-std-bench/sweep/int8/ — it never modifies half_best.pth or any + upstream file.) + +Everything this script prints to stdout is MEASURED. Nothing is estimated. +""" +import argparse +import copy +import json +import os +import random +import sys +import time + +import numpy as np +import torch +import torch.nn as nn +from torch.utils.data import DataLoader, Subset + +BENCH = os.path.expanduser('~/wiflow-std-bench') +SWEEP = os.path.join(BENCH, 'sweep') +OUTDIR = os.path.join(SWEEP, 'int8') +sys.path.insert(0, os.path.join(BENCH, 'upstream')) +sys.path.insert(0, SWEEP) + +from dataset import (PreprocessedCSIKeypointsDataset, # noqa: E402 + create_preprocessed_train_val_test_loaders) +from losses.pose_loss import PoseLoss # noqa: E402 +from utils.metrics import calculate_pck, calculate_mpjpe # noqa: E402 LOCKED metric (torso norm) +from model_compact import CompactWiFlowPoseModel, describe # noqa: E402 + +# half variant config — IDENTICAL to sweep/run_sweep.py VARIANTS[0] that produced half_best.pth +HALF = dict(tcn=[270, 220, 170, 120], conv=[4, 8, 16, 32], attn_groups=4, + groups_mode='gcd20', input_pw_groups=1) +HALF_CKPT = os.path.join(SWEEP, 'half_best.pth') +CORRUPT_FILE_START = 487 # files 487-499 were zero-filled by clean_nan.py (same as sweep) +SEED = 42 +THRESHOLDS = (0.1, 0.2, 0.3, 0.4, 0.5) # PCK@10..50 + + +def set_seed(seed=SEED): + random.seed(seed) + np.random.seed(seed) + torch.manual_seed(seed) + torch.cuda.manual_seed_all(seed) + torch.backends.cudnn.deterministic = True + torch.backends.cudnn.benchmark = False + + +def build_half(dropout=0.5): + return CompactWiFlowPoseModel( + tcn_channels=HALF['tcn'], conv_channels=HALF['conv'], + attn_groups=HALF['attn_groups'], groups_mode=HALF['groups_mode'], + input_pw_groups=HALF['input_pw_groups'], dropout=dropout) + + +@torch.no_grad() +def evaluate(model, loader, device): + """MEASURED PCK@10..50 + MPJPE under the LOCKED torso-diameter normalization.""" + model.eval() + totals = {t: 0.0 for t in THRESHOLDS} + total_mpe, n = 0.0, 0 + for bx, by in loader: + bx, by = bx.to(device), by.to(device) + out = model(bx) + bs = by.size(0) + total_mpe += calculate_mpjpe(out, by) * bs + pck = calculate_pck(out, by, thresholds=list(totals)) # use_torso_norm=True default + for t in totals: + totals[t] += pck[t] * bs + n += bs + return {'samples': n, 'mpjpe': total_mpe / n, + **{f'pck@{int(t * 100)}': totals[t] / n for t in totals}} + + +def file_size_mb(path): + return os.path.getsize(path) / (1024 * 1024) + + +def state_dict_size_mb(model, path): + """On-disk size of the *quantized* checkpoint (int8 weights are packed by fbgemm).""" + torch.save(model.state_dict(), path) + return file_size_mb(path) + + +def loaders(): + set_seed(SEED) + data_dir = os.path.join(BENCH, 'preprocessed_csi_data') + dataset = PreprocessedCSIKeypointsDataset(data_dir=data_dir, keypoint_scale=1000.0, + enable_temporal_clean=True) + train_loader, val_loader, test_loader = create_preprocessed_train_val_test_loaders( + dataset=dataset, batch_size=64, num_workers=2, random_seed=SEED) + return dataset, train_loader, val_loader, test_loader + + +def clean_loader_from(dataset, test_loader, bs=256): + w2f = dataset.window_to_file + clean_idx = [i for i in test_loader.dataset.indices if w2f[i] < CORRUPT_FILE_START] + return DataLoader(Subset(dataset, clean_idx), batch_size=bs, shuffle=False, num_workers=2) + + +def eval_loaders(dataset, test_loader, bs=256): + full = DataLoader(test_loader.dataset, batch_size=bs, shuffle=False, num_workers=2) + clean = clean_loader_from(dataset, test_loader, bs=bs) + return full, clean + + +# --------------------------------------------------------------- int8 paths (FX graph mode) +def ptq_static(fp32_model, train_loader, calib_batches=64): + """Static post-training quantization, FX graph mode, fbgemm. CPU int8.""" + from torch.ao.quantization import get_default_qconfig, QConfigMapping + from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx + torch.backends.quantized.engine = 'fbgemm' + m = copy.deepcopy(fp32_model).cpu().eval() + qconfig = get_default_qconfig('fbgemm') + qmap = QConfigMapping().set_global(qconfig) + example = torch.randn(1, 540, 20) + prepared = prepare_fx(m, qmap, example_inputs=(example,)) + prepared.eval() + with torch.no_grad(): + for i, (bx, _) in enumerate(train_loader): + prepared(bx.cpu()) + if i + 1 >= calib_batches: + break + return convert_fx(prepared) + + +def qat(fp32_model, train_loader, val_loader, device, epochs=3, lr=2e-5): + """Quantization-aware training, FX graph mode, fbgemm. Fine-tune fake-quant from fp32, convert. CPU int8.""" + from torch.ao.quantization import get_default_qat_qconfig, QConfigMapping + from torch.ao.quantization.quantize_fx import prepare_qat_fx, convert_fx + torch.backends.quantized.engine = 'fbgemm' + set_seed(SEED) + m = copy.deepcopy(fp32_model).to(device).train() + qconfig = get_default_qat_qconfig('fbgemm') + qmap = QConfigMapping().set_global(qconfig) + example = torch.randn(1, 540, 20).to(device) + prepared = prepare_qat_fx(m, qmap, example_inputs=(example,)) + prepared.to(device) + + criterion = PoseLoss(position_weight=1.0, bone_weight=0.2, loss_type='smooth_l1') + opt = torch.optim.AdamW(prepared.parameters(), lr=lr, weight_decay=5e-5, betas=(0.9, 0.999)) + + best_val = float('inf') + best_state = None + for ep in range(1, epochs + 1): + prepared.train() + t0 = time.time() + ep_loss, nb = 0.0, 0 + for bx, by in train_loader: + bx, by = bx.to(device), by.to(device) + opt.zero_grad(set_to_none=True) + out = prepared(bx) + loss, _ = criterion(out, by) + if not torch.isfinite(loss): + continue + loss.backward() + opt.step() + ep_loss += loss.item() + nb += 1 + # eval the fake-quant model on GPU (proxy for int8) to pick the best epoch + prepared.eval() + v = evaluate(prepared, val_loader, device) + print(f"[qat] epoch {ep}/{epochs} train_loss={ep_loss / max(nb,1):.5f} " + f"val_mpjpe(fakequant)={v['mpjpe']:.5f} val_pck20={v['pck@20']*100:.2f}% " + f"({time.time()-t0:.0f}s)", flush=True) + if v['mpjpe'] < best_val: + best_val = v['mpjpe'] + best_state = copy.deepcopy(prepared.state_dict()) + if best_state is not None: + prepared.load_state_dict(best_state) + prepared.cpu().eval() + return convert_fx(prepared) + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument('--mode', choices=['ptq', 'qat', 'both'], default='both') + ap.add_argument('--qat-epochs', type=int, default=3) + ap.add_argument('--calib-batches', type=int, default=64) + args = ap.parse_args() + os.makedirs(OUTDIR, exist_ok=True) + + cuda = torch.device('cuda') + cpu = torch.device('cpu') + print(f"torch {torch.__version__} | cuda {torch.cuda.get_device_name(0)} | " + f"quantized.engine candidates {torch.backends.quantized.supported_engines}", flush=True) + + dataset, train_loader, val_loader, test_loader = loaders() + test_full, test_clean = eval_loaders(dataset, test_loader) + + # ---------- fp32 baseline (loads half_best.pth strict; same arch as sweep) ---------- + fp32 = build_half().eval() + state = torch.load(HALF_CKPT, map_location='cpu', weights_only=True) + fp32.load_state_dict(state, strict=True) + fp32_size = file_size_mb(HALF_CKPT) + params = describe(fp32)['params'] + print(f"\n=== fp32 baseline: half_best.pth | params={params:,} | " + f"on-disk={fp32_size:.3f} MB ===", flush=True) + + results = { + 'host': os.uname().nodename, 'gpu': torch.cuda.get_device_name(0), + 'torch': torch.__version__, 'date_utc': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()), + 'locked_normalization': 'torso-diameter (neck idx2 -> pelvis idx12), ' + 'upstream calculate_pck use_torso_norm=True (ADR-173 standard)', + 'checkpoint': HALF_CKPT, 'params': params, 'fp32_size_mb': fp32_size, + 'test_split': 'seed-42 file-level 70/15/15 test (full 54000 / clean 52560)', + 'fp32': {}, 'int8': {}, + } + + fp32_gpu = build_half().to(cuda).eval() + fp32_gpu.load_state_dict(state, strict=True) + print('[fp32/gpu] full ...', flush=True) + results['fp32']['gpu_full'] = evaluate(fp32_gpu, test_full, cuda) + print(json.dumps(results['fp32']['gpu_full']), flush=True) + print('[fp32/gpu] clean ...', flush=True) + results['fp32']['gpu_clean'] = evaluate(fp32_gpu, test_clean, cuda) + print(json.dumps(results['fp32']['gpu_clean']), flush=True) + + print('[fp32/cpu] full (device-matched ref for int8) ...', flush=True) + results['fp32']['cpu_full'] = evaluate(fp32.to(cpu), test_full, cpu) + print(json.dumps(results['fp32']['cpu_full']), flush=True) + print('[fp32/cpu] clean ...', flush=True) + results['fp32']['cpu_clean'] = evaluate(fp32.to(cpu), test_clean, cpu) + print(json.dumps(results['fp32']['cpu_clean']), flush=True) + + # ---------- int8 ---------- + def measure_int8(label, qmodel): + path = os.path.join(OUTDIR, f'half_int8_{label}.pth') + size = state_dict_size_mb(qmodel, path) + print(f"[int8/{label}] on-disk={size:.3f} MB | full ...", flush=True) + full = evaluate(qmodel, test_full, cpu) + print(json.dumps(full), flush=True) + print(f"[int8/{label}] clean ...", flush=True) + clean = evaluate(qmodel, test_clean, cpu) + print(json.dumps(clean), flush=True) + results['int8'][label] = {'size_mb': size, 'checkpoint': path, + 'cpu_full': full, 'cpu_clean': clean} + + if args.mode in ('ptq', 'both'): + print("\n=== int8 PTQ (static, FX, fbgemm) ===", flush=True) + qp = ptq_static(fp32.to(cpu).eval(), train_loader, calib_batches=args.calib_batches) + measure_int8('ptq_static', qp) + + if args.mode in ('qat', 'both'): + print(f"\n=== int8 QAT (FX, fbgemm, {args.qat_epochs} epochs from half_best) ===", flush=True) + qq = qat(fp32, train_loader, val_loader, cuda, epochs=args.qat_epochs) + measure_int8('qat', qq) + + out = os.path.join(OUTDIR, 'int8_results.json') + with open(out, 'w') as f: + json.dump(results, f, indent=2) + print('\nwrote', out, flush=True) + + # ---------- comparison table (MEASURED) ---------- + print("\n================= MEASURED COMPARISON (clean test subset, torso-PCK) =================", flush=True) + base = results['fp32']['cpu_clean'] + print(f"{'model':16s} {'size_MB':>8s} {'pck@20':>8s} {'pck@50':>8s} {'mpjpe':>9s}", flush=True) + print(f"{'fp32 (cpu)':16s} {fp32_size:8.3f} {base['pck@20']*100:7.2f}% {base['pck@50']*100:7.2f}% {base['mpjpe']:9.6f}", flush=True) + for label, r in results['int8'].items(): + c = r['cpu_clean'] + d20 = (c['pck@20'] - base['pck@20']) * 100 + d50 = (c['pck@50'] - base['pck@50']) * 100 + print(f"{'int8 '+label:16s} {r['size_mb']:8.3f} {c['pck@20']*100:7.2f}% {c['pck@50']*100:7.2f}% {c['mpjpe']:9.6f} " + f"(d_pck20={d20:+.2f}pp d_pck50={d50:+.2f}pp size={fp32_size/r['size_mb']:.2f}x smaller)", flush=True) + + +if __name__ == '__main__': + main()