feat(bench): int8 quantization of WiFlow-STD half pose model — MEASURED trade-off (ADR-175, honest negative) (#1095)
Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the 843,834-param "half" WiFlow-STD pose model (half_best.pth) to int8 two ways and MEASURES the accuracy/size trade-off vs fp32 under ONE locked normalization (ADR-173 torso-diameter PCK, upstream calculate_pck use_torso_norm=True), on the same seed-42 file-level 70/15/15 test split that produced the fp32 sweep numbers. MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK): fp32 96.62% pck@20 99.47% pck@50 0.008981 mpjpe 3.351 MB int8 PTQ static 40.98% pck@20 94.98% pck@50 0.038262 mpjpe 1.046 MB (-55.64pp) int8 QAT (3 ep) 67.48% pck@20 98.69% pck@50 0.026548 mpjpe 1.043 MB (-29.15pp) Verdict (honest no): int8 is NOT a win at the strict PCK@20 edge target. Static PTQ collapses; QAT recovers a large share but still loses 29 pp @20 for a 3.2x size win — keep fp32/fp16 on the edge. Disclosed: QAT fake-quant val pck@20 was 83.45% but converted int8 scores 67.48% (~16pp convert_fx gap, reported honestly). Deliverables: - v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py (reproducible: header carries the exact ssh command + run date; QAT primary, static PTQ fallback) - docs/adr/ADR-175-int8-quantization-half-pose-model-measured.md (MEASURED table, locked normalization, QAT-vs-PTQ labeling, verdict, reproduction, limitations) - CHANGELOG [Unreleased] ### Added entry No production Rust or signal-pipeline change. Python deterministic proof unchanged (f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a, bit-exact).
This commit is contained in:
parent
b209b8b778
commit
0f64d23516
|
|
@ -14,6 +14,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||
- **`homecore-recorder` security review (ADR-132 surfaces) — two real bounding fixes; SQL-injection & NaN-index dimensions confirmed clean with evidence.** Beyond-SOTA review of the HA-compat state recorder (DB persistence + history + ruvector semantic search), the crux being its DB-backed SQL-injection surface. **Findings + fixes:** (1) **Memory-DoS — unbounded `get_state_history`.** The history query carried no `LIMIT`, so a wide `[since, until]` window over a high-frequency entity (a per-second sensor ≈ 86k rows/day) would load an unbounded row set into a single in-memory `Vec`. Added a hard `LIMIT MAX_HISTORY_ROWS` (1,000,000 — generous enough never to truncate a realistic history graph, bounded enough to cap the worst case); the sibling search paths were already `k`-bounded. (2) **Disk-DoS / documented-but-missing `purge`.** The README + HA-compat table advertised `Recorder::purge(older_than)` as a capability, but **no such method existed** — i.e. no retention path at all → unbounded disk growth. Implemented a **transactional** `purge` that deletes `states` + `events` strictly **older than** the cutoff (**exclusive** boundary — idempotent, no off-by-one; a row at the cutoff instant is kept) and **garbage-collects** orphaned `state_attributes` blobs (a dedup-shared blob is dropped only once its last referencing state is gone); all three deletes run in one transaction so a mid-purge failure rolls back cleanly (no states-deleted-but-events-kept corruption). **Confirmed clean with evidence:** SQL injection — **every** query in `db.rs` uses bound `?` parameters (no `format!`/string-concat of user data into SQL); the lone `format!` builds the LIKE *pattern*, which is itself bound as a parameter with `ESCAPE '\\'` and metacharacter escaping. Pinned: a state value `'; DROP TABLE states; --` is stored/queried **literally** (table survives), and a `%`/`_` in a search query matches **literally**, not as a wildcard. NaN-index poisoning (the calibration/vitals/geo class) — **structurally impossible** here: embeddings are SHA-256 → `i32` → `f32` (an `i32` cast to `f32` is always finite, never NaN/Inf), with an all-zero-digest norm guard; probed empty-index search, empty-string query, and `k=0` — all return `Ok(0)`, **no panic**. Fail-closed write path — a removal event yields `Ok(None)`, semantic-index failure is logged not propagated (best-effort, never blocks the durable SQLite write), and `EntityId` parsing failures fall back rather than panic. **6 new pinning tests** (SQL-injection literal-storage, LIKE-metacharacter literalness, history `LIMIT`, purge exclusive-boundary, purge attribute-GC-keeps-shared, purge old-events): `homecore-recorder` **19 → 25** (`--no-default-features`) / **25 → 31** (`--features ruvector`), 0 failed; the purge-boundary test is a true pin (fails deleting 2 rows under an inclusive cutoff, passes deleting 1 under the exclusive cutoff). Behaviour otherwise unchanged; Python deterministic proof unchanged (recorder is off the signal proof path).
|
||||
|
||||
### Added
|
||||
- **ADR-175: int8 quantization of the WiFlow-STD "half" pose model — MEASURED fp32-vs-int8 accuracy/size trade-off (honest negative).** Sub-deliverable 8.2 of the benchmark/optimization milestone, and the reading of the SOTA brief's "one untested edge lever" (QAT-int8 on the 843,834-param half model that strictly dominates the published 2.23M model). A new committed script `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py` quantizes `half_best.pth` to int8 two ways and scores both with the **same** upstream `calculate_pck`/`calculate_mpjpe` that produced the fp32 sweep numbers, under **one locked normalization** (ADR-173 torso-diameter PCK — neck idx2→pelvis idx12, `use_torso_norm=True`, the standard MM-Fi/GraphPose-Fi convention), on the **same** seed-42 file-level 70/15/15 test split (52,560 NaN-free / 54,000 full windows). **MEASURED on ruvultra (RTX 5080, torch 2.11.0+cu128, fbgemm; clean test, torso-PCK):** fp32 = 96.62% PCK@20 / 99.47% PCK@50 / 0.008981 MPJPE / 3.351 MB (fp32-CPU reproduces fp32-GPU to 4 dp, so the int8 deltas are pure quantization, not CPU/GPU drift); **int8 static PTQ = 40.98% PCK@20 (−55.64 pp), 1.046 MB** — naive static QDQ **collapses** on this model (the brief's 2.23M "sweet spot" does NOT transfer to the 843k half model at the tight @20 threshold); **int8 QAT (3-epoch FX fake-quant fine-tune from half_best) = 67.48% PCK@20 (−29.15 pp) / 98.69% PCK@50 (−0.78 pp), 1.043 MB.** **Verdict (honest no):** int8 is **not a win** at the strict PCK@20 edge target — QAT recovers a large share of the PTQ collapse and is near-lossless at the loose PCK@50 (coarse localization survives int8, fine does not), but a **3.2× size win at −29 pp PCK@20** is a bad trade when the half model already fits edge flash at fp32 → **keep fp32/fp16 on the edge for now.** **Disclosed gap:** the QAT *fake-quant* val PCK@20 reached 83.45% but the *converted* int8 model scores 67.48% — a real ~16 pp `convert_fx` gap (fbgemm int8 kernels ≠ straight-through estimate, esp. the axial-attention einsum/softmax); we report the converted-int8 number, not the fake-quant proxy. **MEASURED:** every table number + the PTQ collapse + the QAT partial recovery + the conversion gap. **CLAIMED/not done:** ONNX/TFLite export, on-edge-SoC latency/energy (int8 measured on x86 fbgemm — size transfers, latency does NOT), mixed-precision keeping attention fp32, longer/better-tuned QAT. **Honest limitations:** single in-domain eval split (no cross-environment split), x86-int8 not edge-SoC-int8, lightly-tuned QAT. Additive only — no production Rust or signal-pipeline change; Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
|
||||
- **Metric-locked PCK/MPJPE accuracy harness — resolves the PCK-definition ambiguity (`wifi-densepose-train`, needs ADR slot 173).** The SOTA brief (`docs/research/sota-nn-train-benchmark-brief.md` §1, §3.1, §4) found the single biggest threat to any "beyond-SOTA" claim is **metric ambiguity**: three PCK@20 figures (96.09% WiFlow-STD image-normalized, 81.63% AetherArena torso-PCK, 61.1% GraphPose-Fi standard PCK) cannot be lined up because each silently uses a different normalization — the project was retracted twice over this (a withdrawn "92.9%" used *absolute* pixels, not torso). New `src/accuracy.rs` makes the normalizer **explicit, selectable, and carried with every reported number**: a `PckNormalization` enum (`TorsoDiameter` = standard MM-Fi/GraphPose-Fi hip↔hip; `BoundingBoxDiagonal` = looser WiFlow-STD image-normalized; `AbsolutePixels(threshold)` = the retracted convention, included so historical numbers are reproducible and clearly labeled non-comparable); one canonical `pck_at(pred, gt, vis, k, normalization)` reusing the `metrics_core` geometric primitives (hip distance, bbox diagonal — no duplicate kernel); `mpjpe(pred, gt, vis)` (2D/3D, mm); and a self-describing `PoseAccuracy { pck_at: BTreeMap<u8,f32>, mpjpe, normalization, n_keypoints, n_frames }` returned by `accuracy_report(frames, ks, normalization)` so an **unlabeled PCK number is structurally impossible**. **17 hand-computed deterministic tests** (no GPU, no datasets) prove the harness arithmetic: perfect→PCK=1.0/MPJPE=0; all-just-outside→0.0; half-in-half-out→0.5; the **key proof** that identical predictions score 0.50 (torso) / 1.00 (bbox) / 0.75 (abs) under the three normalizations (the ambiguity is real and the definitions are distinct); MPJPE 2D/3D fixtures; and graceful degenerate handling (zero torso, empty frames, NaN coords — no panic, never a false-perfect). **This is measurement infrastructure, not an accuracy claim** — the tests prove the harness is correct, not that any model is good. `wifi-densepose-train` lib 191→206, `test_metrics` 12→14, 0 failed. Python deterministic proof unchanged (off the signal proof path).
|
||||
- **CI bench-regression guard (`.github/workflows/bench-regression.yml`) — wires the v2/ criterion benches into CI as a real, hard-failing COMPILE-VERIFY gate + an informational fast-run; caught + fixed one already-bit-rotted bench (benchmark/optimization milestone sub-deliverable 8.3; needs ADR slot 174).** The v2/ workspace ships **26 criterion benches across 18 crates** (e.g. `nvsim/pipeline_throughput`, `wifi-densepose-ruvector/{ann,sketch,fusion}_bench`, `wifi-densepose-signal/{signal,dsp_perf,features,calibration,aether_prefilter,cir}_bench`, `wifi-densepose-mat/detection_bench`, `wifi-densepose-nn/{inference,native_conv,onnx}_bench`, `wifi-densepose-engine/engine_cycle`, …) but, because benches are **not** part of `cargo test`, nothing in CI compiled them — so they silently rot when a public API they call changes. **Proof this matters (MEASURED):** running the new gate on the current tree immediately caught `wifi-densepose-mat/detection_bench` failing to compile (`E0063: missing field last_rssi in initializer of SensorPosition` — the struct gained a field, the bench was never updated); fixed in this change (`last_rssi: None`, the simulated-zone convention) and re-verified (`cargo bench -p wifi-densepose-mat --no-default-features --bench detection_bench --no-run` → `Finished`, Executable produced). **HONEST SCOPE — what gates vs what is informational:** (1) `bench-compile` (HARD GATE) runs `cargo bench --workspace --no-default-features --no-run` (compile + link every default-feature bench, no measurement) plus a `--features cir` compile of the gated `cir_bench` — a deterministic, real regression guard against bench bit-rot; (2) `bench-fast-run` (INFORMATIONAL, `continue-on-error: true`, NEVER gates) runs a curated pure-CPU subset (`nvsim/pipeline_throughput`, `ruvector/{sketch,fusion}_bench`) in criterion quick-mode (1s warm-up / 2s measure / 10 samples), targeted per-`--bench` (the crates' libtest lib targets reject criterion flags), and uploads the logs as an artifact. **No timing-regression gate, by design and stated in the workflow header:** wall-clock on shared GitHub runners varies 2-3x run-to-run, so a hard threshold or a cross-runner `criterion --baseline` compare would manufacture false failures; that becomes honest only on a frequency-pinned self-hosted runner (documented as the re-add condition). The `crv`-gated `ruvector/crv_bench` is deliberately NOT compiled by the gate because its crates.io dep `ruvector-crv 0.1.1` currently fails to build on stable (upstream E0308 in its own `stage_iii.rs`) — noted in-workflow with the re-add condition. Checkout is `submodules: recursive` (the workspace path-deps `vendor/rufield`) and installs the Tauri/GTK dev libs like `ci.yml`'s rust-tests job (a `--workspace` bench link pulls the whole graph). **MEASURED locally (Windows, `--no-default-features`):** `nvsim`, `wifi-densepose-ruvector` (sketch/fusion/ann), `wifi-densepose-signal/cir_bench`, `wifi-densepose-mat/detection_bench` (post-fix), `wifi-densepose-vitals/vitals_bench`, and `ruview-swarm/swarm_bench` all compile + the fast subset runs (sample baseline: `nvsim pipeline_run/d1/256` ≈ 55 µs, `d16/1024` ≈ 315 µs; `ruvector sketch_hamming` ≈ 3-7 ns vs `float_l2` ≈ 63-371 ns). The full `--workspace` `--no-run` could **not** be fully validated on Windows (Tauri-`desktop` needs GTK, `candle-core` fails on MSVC, `swarm_bench` LTO-links OOM under parallel pressure) — those are Windows-env artifacts that build in the Linux CI runner (each affected bench was confirmed to compile standalone here). No baseline JSON is committed (a cross-runner baseline would be dishonest). Python deterministic proof unchanged (`f8e76f21…46f7a`, bit-exact — off the signal proof path).
|
||||
- **RuField `rufield-viewer` live-ingest mode — closes the RuView↔RuField visual loop (ADR-262 surfaces).** The dashboard gains `--source live --upstream <RuView-URL>`: it consumes RuView's `/ws/field` SSE (falling back to polling `/api/field`), **verifies every event's ed25519 provenance receipt on ingest** (`is_fusable`) — forged/tampered events are flagged ✗ and **never fused** into trusted inferences — and renders real RuView `FieldEvent`s through the same room-state/privacy-badge/fusion-graph/receipt path the synthetic mode uses (wire-compatible by construction: both sides use `rufield_core::FieldEvent` serde). **Strict banner honesty:** a single `BannerState` shows `SYNTHETIC` / `LIVE — <upstream>` / `DISCONNECTED — <upstream> unreachable`, mutually exclusive — never SYNTHETIC while showing live data or vice versa; live mode returns **409** on `/api/run` rather than fabricate a synthetic run, and starts DISCONNECTED until first verified contact. Default stays synthetic. 26 tests / 0 failed. `ruvnet/rufield` `crates/rufield-viewer`; `vendor/rufield` submodule bumped.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,172 @@
|
|||
# ADR-175: int8 Quantization of the WiFlow-STD "half" Pose Model — MEASURED accuracy/size trade-off
|
||||
|
||||
| Field | Value |
|
||||
|-------|-------|
|
||||
| **Status** | Accepted — MEASURED, reproducible (honest negative) |
|
||||
| **Date** | 2026-06-15 |
|
||||
| **Deciders** | ruv |
|
||||
| **Codename** | **EDGE-INT8** |
|
||||
| **Sub-deliverable** | 8.2 of the benchmark/optimization milestone |
|
||||
| **Metric lock** | ADR-173 (one declared PCK normalization for every reported number) |
|
||||
| **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (§edge int8) |
|
||||
|
||||
## Context
|
||||
|
||||
The SOTA brief characterized the int8 edge story for the WiFlow-STD pose net as
|
||||
"fully characterized" for PTQ on the **published 2.23M** model (static QDQ
|
||||
conv-only = the sweet spot; dynamic int8 ≈ no-op on this all-conv net), and named
|
||||
**QAT-int8 on the strictly-dominating 843,834-param "half" model** as "the one
|
||||
untested edge lever." This ADR is the reading of that lever — a MEASURED
|
||||
fp32-vs-int8 trade-off for the half model, not a claim.
|
||||
|
||||
The half model (`half_best.pth`, 843,834 params) is the efficiency-sweep winner
|
||||
from ADR-152 (`run_sweep.py` VARIANTS[0]: `tcn=[270,220,170,120]`,
|
||||
`conv=[4,8,16,32]`, `attn_groups=4`). Its fp32 accuracy was recorded in the sweep;
|
||||
this ADR re-measures it under the locked normalization and quantizes it.
|
||||
|
||||
**The whole point of this deliverable is reproducibility.** Every number below was
|
||||
produced by running `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
|
||||
on host `ruvultra` (RTX 5080, torch 2.11.0+cu128) against the real checkpoint and
|
||||
the real seed-42 test split. The script + the exact command + the recorded stdout
|
||||
**is** the proof artifact. Nothing here is estimated.
|
||||
|
||||
## Decision
|
||||
|
||||
Quantize the half model to int8 with **both** levers and report both honestly:
|
||||
|
||||
1. **QAT (primary target)** — FX graph-mode quantization-aware training, fbgemm
|
||||
backend, 3 epochs of fake-quant fine-tuning from `half_best.pth` (AdamW lr 2e-5,
|
||||
the existing `PoseLoss`), then `convert_fx` to a true int8 graph.
|
||||
2. **PTQ static QDQ (the brief's "sweet spot", measured as the honest fallback)** —
|
||||
FX graph-mode static PTQ, fbgemm, calibrated on 64 train batches.
|
||||
|
||||
### Locked normalization (ADR-173)
|
||||
|
||||
**Torso-diameter PCK** — neck (keypoint idx 2) → pelvis (idx 12) distance — the
|
||||
standard MM-Fi/GraphPose-Fi convention. This is exactly the default
|
||||
`use_torso_norm=True` path of the upstream harness's `utils/metrics.calculate_pck`.
|
||||
The **same** `calculate_pck`/`calculate_mpjpe` that produced the sweep's fp32
|
||||
numbers scores **both** fp32 and int8 here, so the comparison is metric-locked: no
|
||||
normalization is mixed, and the fp32 baseline reproduces the sweep's recorded
|
||||
`half` test numbers bit-for-bit (PCK@20 clean = 96.62%), confirming the harness is
|
||||
the same one.
|
||||
|
||||
### Device note (why int8 is CPU)
|
||||
|
||||
PyTorch int8 quantized kernels execute on CPU (fbgemm/x86), not CUDA. So int8 eval
|
||||
is CPU. To keep the accuracy delta device-matched (not confounding int8-vs-fp32
|
||||
with CPU-vs-GPU), the script measures an **fp32-CPU** baseline too. fp32-CPU and
|
||||
fp32-GPU agree to 4 decimals (PCK@20 clean 0.96623 vs 0.96623), so CPU/GPU
|
||||
introduces no drift — the int8 deltas below are pure quantization effect.
|
||||
|
||||
## MEASURED results (clean test subset = 52,560 NaN-free windows; torso-PCK)
|
||||
|
||||
Source: stdout of the run below + `~/wiflow-std-bench/sweep/int8/int8_results.json`.
|
||||
|
||||
| model | quant | size (MB) | PCK@20 | PCK@50 | MPJPE | Δ PCK@20 | Δ PCK@50 | size win |
|
||||
|-------|-------|-----------|--------|--------|-------|----------|----------|----------|
|
||||
| **fp32** (cpu) | — | **3.351** | **96.62%** | **99.47%** | **0.008981** | — | — | 1.00× |
|
||||
| int8 PTQ static | PTQ | 1.046 | 40.98% | 94.98% | 0.038262 | **−55.64 pp** | −4.49 pp | 3.20× smaller |
|
||||
| int8 QAT (3 ep) | **QAT** | 1.043 | 67.48% | 98.69% | 0.026548 | **−29.15 pp** | −0.78 pp | 3.21× smaller |
|
||||
|
||||
Full-test-set (54,000 windows incl. NaN-zero-filled files 487–499) tracks the
|
||||
clean subset: fp32 96.10% / int8-PTQ 41.11% / int8-QAT 67.48% PCK@20 — same shape,
|
||||
recorded in the JSON.
|
||||
|
||||
### Verdict
|
||||
|
||||
**int8 is NOT a win for this model at the tight PCK@20 edge target — honest no.**
|
||||
|
||||
- **PTQ static collapses** (−55.64 pp PCK@20). Naive static QDQ destroys the half
|
||||
model. The "sweet spot" characterization from the brief does not transfer from
|
||||
the 2.23M model to this 843k model at the strict torso-PCK@20 threshold.
|
||||
- **QAT recovers a large share of the relative gap** (PTQ 40.98% → QAT 67.48%) but
|
||||
still **loses 29.15 pp** at PCK@20 for a 3.21× size reduction. At the loose
|
||||
PCK@50 threshold QAT is nearly lossless (−0.78 pp), i.e. coarse-localization
|
||||
survives int8 but fine-localization does not.
|
||||
- The size win is real and consistent (3.2× smaller, 3.351 MB → ~1.04 MB), but
|
||||
**3.2× compression at −29 pp PCK@20 is a bad trade** when the half model already
|
||||
fits comfortably in edge flash at fp32. Recommendation: **keep fp32 (or fp16)
|
||||
for the half model on the edge**; do not ship this int8 variant as-is.
|
||||
|
||||
### Observed fake-quant → int8 conversion gap (disclosed, not hidden)
|
||||
|
||||
During QAT the **fake-quant** model's val PCK@20 reached 83.45% (epoch 3), but the
|
||||
**converted int8** model scores 67.48% on test. A ~16 pp drop on `convert_fx` is a
|
||||
real effect — the fbgemm int8 kernels are not bit-identical to the fake-quant
|
||||
simulation (per-tensor activation quant + the axial-attention `einsum`/softmax path
|
||||
quantize worse than the straight-through estimate predicts). This gap is the honest
|
||||
reason QAT did not close the loss, and it is exactly the kind of number that would
|
||||
be invisible if one only reported the fake-quant proxy. We report the **converted
|
||||
int8** number as the deliverable, not the fake-quant proxy.
|
||||
|
||||
## Reproduction
|
||||
|
||||
```bash
|
||||
ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
|
||||
python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
|
||||
```
|
||||
|
||||
- Script (committed): `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
|
||||
(scp'd to `~/quantize_half_int8.py` on ruvultra for the run).
|
||||
- Inputs (on ruvultra, unmodified): `~/wiflow-std-bench/sweep/half_best.pth`,
|
||||
`~/wiflow-std-bench/preprocessed_csi_data/` (seed-42 file-level 70/15/15 split),
|
||||
upstream `models`/`dataset`/`utils/metrics`/`losses` (DY2434/WiFlow @ 06899d29,
|
||||
Apache-2.0), and `sweep/model_compact.py` (the half-model definition).
|
||||
- Outputs (written, non-destructive): `~/wiflow-std-bench/sweep/int8/` —
|
||||
`half_int8_qat.pth`, `half_int8_ptq_static.pth`, `int8_results.json`,
|
||||
`int8_run.log`. **No existing file under `~/wiflow-std-bench` was modified.**
|
||||
- Run metadata: host `ruvultra`, GPU RTX 5080, torch `2.11.0+cu128`, fbgemm engine,
|
||||
`date_utc 2026-06-15T12:35:06Z`, QAT ≈ 97 s/epoch.
|
||||
|
||||
## What is MEASURED vs CLAIMED
|
||||
|
||||
- **MEASURED:** every PCK/MPJPE/size number in the table; the fp32 baseline (which
|
||||
reproduces the recorded sweep `half` numbers); the PTQ collapse; the QAT partial
|
||||
recovery; the fake-quant→int8 conversion gap; the 3.2× size reduction.
|
||||
- **CLAIMED / not done here:** ONNX/TFLite export; on-real-edge (ESP32/Pi/Hailo)
|
||||
latency or energy (int8 here is measured on x86 fbgemm, the dev box, **not** an
|
||||
edge SoC — the size number transfers, a latency number does **not**); a
|
||||
per-layer mixed-precision search that might keep the attention block in fp32; QAT
|
||||
beyond 3 epochs or with learned-quant-range schedules. Those are the obvious next
|
||||
levers if int8 is revisited; none is asserted as a result.
|
||||
|
||||
## Honest scope / limitations
|
||||
|
||||
- **Single eval split** — one seed-42 file-level test partition; no cross-room /
|
||||
cross-environment generalization split (the GraphPose-Fi frontier from ADR-173 is
|
||||
a separate, harder split and is not what is measured here).
|
||||
- **In-domain only** — these are in-distribution test numbers; they say nothing
|
||||
about the cross-environment robustness gap.
|
||||
- **x86 int8, not edge-SoC int8** — accuracy and size transfer to an edge int8
|
||||
runtime; the runtime/latency does not (different kernels, different SoC). No
|
||||
latency claim is made.
|
||||
- **QAT lightly tuned** — 3 epochs, single LR, default fbgemm qconfig. A longer /
|
||||
better-tuned QAT might narrow the −29 pp, but on the evidence here int8 does not
|
||||
reach fp32 at PCK@20, and that is the reportable result today.
|
||||
|
||||
## Consequences
|
||||
|
||||
### Positive
|
||||
- The "one untested edge lever" (QAT-int8 on the half model) is now MEASURED. The
|
||||
edge int8 question for the half model is answered with reproducible numbers: at
|
||||
the strict PCK@20 target it loses, and we can say so with a committed script.
|
||||
- Establishes a reusable, metric-locked quantization+eval harness
|
||||
(`quantize_half_int8.py`) for any future int8 attempt on these compact variants.
|
||||
|
||||
### Negative
|
||||
- None to the codebase (additive script + ADR + CHANGELOG only; no production Rust
|
||||
or signal-pipeline change; Python deterministic proof hash
|
||||
`f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a` unchanged).
|
||||
|
||||
### Neutral
|
||||
- The negative verdict means the half model stays fp32/fp16 on the edge for now.
|
||||
int8 for these compact pose nets is parked pending the next-lever work above.
|
||||
|
||||
## Links
|
||||
- ADR-173 — metric-locked PCK/MPJPE harness (the locked normalization used here)
|
||||
- ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark / efficiency sweep
|
||||
(produced `half_best.pth`)
|
||||
- `docs/research/sota-nn-train-benchmark-brief.md` — §edge int8 (the "one untested
|
||||
lever" this ADR measures)
|
||||
- Script: `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
|
||||
|
|
@ -0,0 +1,294 @@
|
|||
#!/usr/bin/env python3
|
||||
"""ADR-175: int8 quantization of the WiFlow-STD "half" pose model + MEASURED accuracy/size trade-off.
|
||||
|
||||
Sub-deliverable 8.2 of the benchmark/optimization milestone. Quantizes the 843,834-param
|
||||
"half" WiFlow-STD pose model to int8 (QAT primary, static-PTQ fallback) and MEASURES the
|
||||
accuracy delta against the fp32 baseline under ONE locked PCK normalization.
|
||||
|
||||
LOCKED NORMALIZATION (ADR-173): torso-diameter PCK — neck(idx 2)->pelvis(idx 12) distance,
|
||||
exactly the default `use_torso_norm=True` path of upstream `utils/metrics.calculate_pck`,
|
||||
which is the standard MM-Fi/GraphPose-Fi convention. The SAME `calculate_pck` /
|
||||
`calculate_mpjpe` from the upstream harness scores BOTH fp32 and int8 so the comparison is
|
||||
metric-locked. The test split is the seed-42 file-level 70/15/15 test partition (54,000
|
||||
windows full / 52,560 NaN-free) produced by the SAME loader that produced half_best.pth.
|
||||
|
||||
int8 backend: FX graph-mode quantization, fbgemm engine (server x86 int8). Quantized int8
|
||||
kernels execute on CPU, so int8 eval is CPU; an fp32-CPU baseline is also measured so the
|
||||
accuracy delta is device-matched (CPU fp32 vs CPU int8), and an fp32-GPU number is reported
|
||||
for continuity with the sweep's recorded numbers.
|
||||
|
||||
REPRODUCE (exact command run for ADR-175, run date 2026-06-15, on host ruvultra / RTX 5080):
|
||||
ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
|
||||
python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
|
||||
|
||||
(the script lives in-repo at v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py;
|
||||
it was scp'd to ~/quantize_half_int8.py on ruvultra and invoked as above. It is read-only
|
||||
to everything under ~/wiflow-std-bench except that it WRITES its int8 artifacts + a JSON
|
||||
results file into ~/wiflow-std-bench/sweep/int8/ — it never modifies half_best.pth or any
|
||||
upstream file.)
|
||||
|
||||
Everything this script prints to stdout is MEASURED. Nothing is estimated.
|
||||
"""
|
||||
import argparse
|
||||
import copy
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
|
||||
import numpy as np
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import DataLoader, Subset
|
||||
|
||||
BENCH = os.path.expanduser('~/wiflow-std-bench')
|
||||
SWEEP = os.path.join(BENCH, 'sweep')
|
||||
OUTDIR = os.path.join(SWEEP, 'int8')
|
||||
sys.path.insert(0, os.path.join(BENCH, 'upstream'))
|
||||
sys.path.insert(0, SWEEP)
|
||||
|
||||
from dataset import (PreprocessedCSIKeypointsDataset, # noqa: E402
|
||||
create_preprocessed_train_val_test_loaders)
|
||||
from losses.pose_loss import PoseLoss # noqa: E402
|
||||
from utils.metrics import calculate_pck, calculate_mpjpe # noqa: E402 LOCKED metric (torso norm)
|
||||
from model_compact import CompactWiFlowPoseModel, describe # noqa: E402
|
||||
|
||||
# half variant config — IDENTICAL to sweep/run_sweep.py VARIANTS[0] that produced half_best.pth
|
||||
HALF = dict(tcn=[270, 220, 170, 120], conv=[4, 8, 16, 32], attn_groups=4,
|
||||
groups_mode='gcd20', input_pw_groups=1)
|
||||
HALF_CKPT = os.path.join(SWEEP, 'half_best.pth')
|
||||
CORRUPT_FILE_START = 487 # files 487-499 were zero-filled by clean_nan.py (same as sweep)
|
||||
SEED = 42
|
||||
THRESHOLDS = (0.1, 0.2, 0.3, 0.4, 0.5) # PCK@10..50
|
||||
|
||||
|
||||
def set_seed(seed=SEED):
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
torch.manual_seed(seed)
|
||||
torch.cuda.manual_seed_all(seed)
|
||||
torch.backends.cudnn.deterministic = True
|
||||
torch.backends.cudnn.benchmark = False
|
||||
|
||||
|
||||
def build_half(dropout=0.5):
|
||||
return CompactWiFlowPoseModel(
|
||||
tcn_channels=HALF['tcn'], conv_channels=HALF['conv'],
|
||||
attn_groups=HALF['attn_groups'], groups_mode=HALF['groups_mode'],
|
||||
input_pw_groups=HALF['input_pw_groups'], dropout=dropout)
|
||||
|
||||
|
||||
@torch.no_grad()
|
||||
def evaluate(model, loader, device):
|
||||
"""MEASURED PCK@10..50 + MPJPE under the LOCKED torso-diameter normalization."""
|
||||
model.eval()
|
||||
totals = {t: 0.0 for t in THRESHOLDS}
|
||||
total_mpe, n = 0.0, 0
|
||||
for bx, by in loader:
|
||||
bx, by = bx.to(device), by.to(device)
|
||||
out = model(bx)
|
||||
bs = by.size(0)
|
||||
total_mpe += calculate_mpjpe(out, by) * bs
|
||||
pck = calculate_pck(out, by, thresholds=list(totals)) # use_torso_norm=True default
|
||||
for t in totals:
|
||||
totals[t] += pck[t] * bs
|
||||
n += bs
|
||||
return {'samples': n, 'mpjpe': total_mpe / n,
|
||||
**{f'pck@{int(t * 100)}': totals[t] / n for t in totals}}
|
||||
|
||||
|
||||
def file_size_mb(path):
|
||||
return os.path.getsize(path) / (1024 * 1024)
|
||||
|
||||
|
||||
def state_dict_size_mb(model, path):
|
||||
"""On-disk size of the *quantized* checkpoint (int8 weights are packed by fbgemm)."""
|
||||
torch.save(model.state_dict(), path)
|
||||
return file_size_mb(path)
|
||||
|
||||
|
||||
def loaders():
|
||||
set_seed(SEED)
|
||||
data_dir = os.path.join(BENCH, 'preprocessed_csi_data')
|
||||
dataset = PreprocessedCSIKeypointsDataset(data_dir=data_dir, keypoint_scale=1000.0,
|
||||
enable_temporal_clean=True)
|
||||
train_loader, val_loader, test_loader = create_preprocessed_train_val_test_loaders(
|
||||
dataset=dataset, batch_size=64, num_workers=2, random_seed=SEED)
|
||||
return dataset, train_loader, val_loader, test_loader
|
||||
|
||||
|
||||
def clean_loader_from(dataset, test_loader, bs=256):
|
||||
w2f = dataset.window_to_file
|
||||
clean_idx = [i for i in test_loader.dataset.indices if w2f[i] < CORRUPT_FILE_START]
|
||||
return DataLoader(Subset(dataset, clean_idx), batch_size=bs, shuffle=False, num_workers=2)
|
||||
|
||||
|
||||
def eval_loaders(dataset, test_loader, bs=256):
|
||||
full = DataLoader(test_loader.dataset, batch_size=bs, shuffle=False, num_workers=2)
|
||||
clean = clean_loader_from(dataset, test_loader, bs=bs)
|
||||
return full, clean
|
||||
|
||||
|
||||
# --------------------------------------------------------------- int8 paths (FX graph mode)
|
||||
def ptq_static(fp32_model, train_loader, calib_batches=64):
|
||||
"""Static post-training quantization, FX graph mode, fbgemm. CPU int8."""
|
||||
from torch.ao.quantization import get_default_qconfig, QConfigMapping
|
||||
from torch.ao.quantization.quantize_fx import prepare_fx, convert_fx
|
||||
torch.backends.quantized.engine = 'fbgemm'
|
||||
m = copy.deepcopy(fp32_model).cpu().eval()
|
||||
qconfig = get_default_qconfig('fbgemm')
|
||||
qmap = QConfigMapping().set_global(qconfig)
|
||||
example = torch.randn(1, 540, 20)
|
||||
prepared = prepare_fx(m, qmap, example_inputs=(example,))
|
||||
prepared.eval()
|
||||
with torch.no_grad():
|
||||
for i, (bx, _) in enumerate(train_loader):
|
||||
prepared(bx.cpu())
|
||||
if i + 1 >= calib_batches:
|
||||
break
|
||||
return convert_fx(prepared)
|
||||
|
||||
|
||||
def qat(fp32_model, train_loader, val_loader, device, epochs=3, lr=2e-5):
|
||||
"""Quantization-aware training, FX graph mode, fbgemm. Fine-tune fake-quant from fp32, convert. CPU int8."""
|
||||
from torch.ao.quantization import get_default_qat_qconfig, QConfigMapping
|
||||
from torch.ao.quantization.quantize_fx import prepare_qat_fx, convert_fx
|
||||
torch.backends.quantized.engine = 'fbgemm'
|
||||
set_seed(SEED)
|
||||
m = copy.deepcopy(fp32_model).to(device).train()
|
||||
qconfig = get_default_qat_qconfig('fbgemm')
|
||||
qmap = QConfigMapping().set_global(qconfig)
|
||||
example = torch.randn(1, 540, 20).to(device)
|
||||
prepared = prepare_qat_fx(m, qmap, example_inputs=(example,))
|
||||
prepared.to(device)
|
||||
|
||||
criterion = PoseLoss(position_weight=1.0, bone_weight=0.2, loss_type='smooth_l1')
|
||||
opt = torch.optim.AdamW(prepared.parameters(), lr=lr, weight_decay=5e-5, betas=(0.9, 0.999))
|
||||
|
||||
best_val = float('inf')
|
||||
best_state = None
|
||||
for ep in range(1, epochs + 1):
|
||||
prepared.train()
|
||||
t0 = time.time()
|
||||
ep_loss, nb = 0.0, 0
|
||||
for bx, by in train_loader:
|
||||
bx, by = bx.to(device), by.to(device)
|
||||
opt.zero_grad(set_to_none=True)
|
||||
out = prepared(bx)
|
||||
loss, _ = criterion(out, by)
|
||||
if not torch.isfinite(loss):
|
||||
continue
|
||||
loss.backward()
|
||||
opt.step()
|
||||
ep_loss += loss.item()
|
||||
nb += 1
|
||||
# eval the fake-quant model on GPU (proxy for int8) to pick the best epoch
|
||||
prepared.eval()
|
||||
v = evaluate(prepared, val_loader, device)
|
||||
print(f"[qat] epoch {ep}/{epochs} train_loss={ep_loss / max(nb,1):.5f} "
|
||||
f"val_mpjpe(fakequant)={v['mpjpe']:.5f} val_pck20={v['pck@20']*100:.2f}% "
|
||||
f"({time.time()-t0:.0f}s)", flush=True)
|
||||
if v['mpjpe'] < best_val:
|
||||
best_val = v['mpjpe']
|
||||
best_state = copy.deepcopy(prepared.state_dict())
|
||||
if best_state is not None:
|
||||
prepared.load_state_dict(best_state)
|
||||
prepared.cpu().eval()
|
||||
return convert_fx(prepared)
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument('--mode', choices=['ptq', 'qat', 'both'], default='both')
|
||||
ap.add_argument('--qat-epochs', type=int, default=3)
|
||||
ap.add_argument('--calib-batches', type=int, default=64)
|
||||
args = ap.parse_args()
|
||||
os.makedirs(OUTDIR, exist_ok=True)
|
||||
|
||||
cuda = torch.device('cuda')
|
||||
cpu = torch.device('cpu')
|
||||
print(f"torch {torch.__version__} | cuda {torch.cuda.get_device_name(0)} | "
|
||||
f"quantized.engine candidates {torch.backends.quantized.supported_engines}", flush=True)
|
||||
|
||||
dataset, train_loader, val_loader, test_loader = loaders()
|
||||
test_full, test_clean = eval_loaders(dataset, test_loader)
|
||||
|
||||
# ---------- fp32 baseline (loads half_best.pth strict; same arch as sweep) ----------
|
||||
fp32 = build_half().eval()
|
||||
state = torch.load(HALF_CKPT, map_location='cpu', weights_only=True)
|
||||
fp32.load_state_dict(state, strict=True)
|
||||
fp32_size = file_size_mb(HALF_CKPT)
|
||||
params = describe(fp32)['params']
|
||||
print(f"\n=== fp32 baseline: half_best.pth | params={params:,} | "
|
||||
f"on-disk={fp32_size:.3f} MB ===", flush=True)
|
||||
|
||||
results = {
|
||||
'host': os.uname().nodename, 'gpu': torch.cuda.get_device_name(0),
|
||||
'torch': torch.__version__, 'date_utc': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()),
|
||||
'locked_normalization': 'torso-diameter (neck idx2 -> pelvis idx12), '
|
||||
'upstream calculate_pck use_torso_norm=True (ADR-173 standard)',
|
||||
'checkpoint': HALF_CKPT, 'params': params, 'fp32_size_mb': fp32_size,
|
||||
'test_split': 'seed-42 file-level 70/15/15 test (full 54000 / clean 52560)',
|
||||
'fp32': {}, 'int8': {},
|
||||
}
|
||||
|
||||
fp32_gpu = build_half().to(cuda).eval()
|
||||
fp32_gpu.load_state_dict(state, strict=True)
|
||||
print('[fp32/gpu] full ...', flush=True)
|
||||
results['fp32']['gpu_full'] = evaluate(fp32_gpu, test_full, cuda)
|
||||
print(json.dumps(results['fp32']['gpu_full']), flush=True)
|
||||
print('[fp32/gpu] clean ...', flush=True)
|
||||
results['fp32']['gpu_clean'] = evaluate(fp32_gpu, test_clean, cuda)
|
||||
print(json.dumps(results['fp32']['gpu_clean']), flush=True)
|
||||
|
||||
print('[fp32/cpu] full (device-matched ref for int8) ...', flush=True)
|
||||
results['fp32']['cpu_full'] = evaluate(fp32.to(cpu), test_full, cpu)
|
||||
print(json.dumps(results['fp32']['cpu_full']), flush=True)
|
||||
print('[fp32/cpu] clean ...', flush=True)
|
||||
results['fp32']['cpu_clean'] = evaluate(fp32.to(cpu), test_clean, cpu)
|
||||
print(json.dumps(results['fp32']['cpu_clean']), flush=True)
|
||||
|
||||
# ---------- int8 ----------
|
||||
def measure_int8(label, qmodel):
|
||||
path = os.path.join(OUTDIR, f'half_int8_{label}.pth')
|
||||
size = state_dict_size_mb(qmodel, path)
|
||||
print(f"[int8/{label}] on-disk={size:.3f} MB | full ...", flush=True)
|
||||
full = evaluate(qmodel, test_full, cpu)
|
||||
print(json.dumps(full), flush=True)
|
||||
print(f"[int8/{label}] clean ...", flush=True)
|
||||
clean = evaluate(qmodel, test_clean, cpu)
|
||||
print(json.dumps(clean), flush=True)
|
||||
results['int8'][label] = {'size_mb': size, 'checkpoint': path,
|
||||
'cpu_full': full, 'cpu_clean': clean}
|
||||
|
||||
if args.mode in ('ptq', 'both'):
|
||||
print("\n=== int8 PTQ (static, FX, fbgemm) ===", flush=True)
|
||||
qp = ptq_static(fp32.to(cpu).eval(), train_loader, calib_batches=args.calib_batches)
|
||||
measure_int8('ptq_static', qp)
|
||||
|
||||
if args.mode in ('qat', 'both'):
|
||||
print(f"\n=== int8 QAT (FX, fbgemm, {args.qat_epochs} epochs from half_best) ===", flush=True)
|
||||
qq = qat(fp32, train_loader, val_loader, cuda, epochs=args.qat_epochs)
|
||||
measure_int8('qat', qq)
|
||||
|
||||
out = os.path.join(OUTDIR, 'int8_results.json')
|
||||
with open(out, 'w') as f:
|
||||
json.dump(results, f, indent=2)
|
||||
print('\nwrote', out, flush=True)
|
||||
|
||||
# ---------- comparison table (MEASURED) ----------
|
||||
print("\n================= MEASURED COMPARISON (clean test subset, torso-PCK) =================", flush=True)
|
||||
base = results['fp32']['cpu_clean']
|
||||
print(f"{'model':16s} {'size_MB':>8s} {'pck@20':>8s} {'pck@50':>8s} {'mpjpe':>9s}", flush=True)
|
||||
print(f"{'fp32 (cpu)':16s} {fp32_size:8.3f} {base['pck@20']*100:7.2f}% {base['pck@50']*100:7.2f}% {base['mpjpe']:9.6f}", flush=True)
|
||||
for label, r in results['int8'].items():
|
||||
c = r['cpu_clean']
|
||||
d20 = (c['pck@20'] - base['pck@20']) * 100
|
||||
d50 = (c['pck@50'] - base['pck@50']) * 100
|
||||
print(f"{'int8 '+label:16s} {r['size_mb']:8.3f} {c['pck@20']*100:7.2f}% {c['pck@50']*100:7.2f}% {c['mpjpe']:9.6f} "
|
||||
f"(d_pck20={d20:+.2f}pp d_pck50={d50:+.2f}pp size={fp32_size/r['size_mb']:.2f}x smaller)", flush=True)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Loading…
Reference in New Issue