9.3 KiB
ADR-175: int8 Quantization of the WiFlow-STD "half" Pose Model — MEASURED accuracy/size trade-off
| Field | Value |
|---|---|
| Status | Accepted — MEASURED, reproducible (honest negative) |
| Date | 2026-06-15 |
| Deciders | ruv |
| Codename | EDGE-INT8 |
| Sub-deliverable | 8.2 of the benchmark/optimization milestone |
| Metric lock | ADR-173 (one declared PCK normalization for every reported number) |
| Motivated by | docs/research/sota-nn-train-benchmark-brief.md (§edge int8) |
Context
The SOTA brief characterized the int8 edge story for the WiFlow-STD pose net as "fully characterized" for PTQ on the published 2.23M model (static QDQ conv-only = the sweet spot; dynamic int8 ≈ no-op on this all-conv net), and named QAT-int8 on the strictly-dominating 843,834-param "half" model as "the one untested edge lever." This ADR is the reading of that lever — a MEASURED fp32-vs-int8 trade-off for the half model, not a claim.
The half model (half_best.pth, 843,834 params) is the efficiency-sweep winner
from ADR-152 (run_sweep.py VARIANTS[0]: tcn=[270,220,170,120],
conv=[4,8,16,32], attn_groups=4). Its fp32 accuracy was recorded in the sweep;
this ADR re-measures it under the locked normalization and quantizes it.
The whole point of this deliverable is reproducibility. Every number below was
produced by running v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py
on host ruvultra (RTX 5080, torch 2.11.0+cu128) against the real checkpoint and
the real seed-42 test split. The script + the exact command + the recorded stdout
is the proof artifact. Nothing here is estimated.
Decision
Quantize the half model to int8 with both levers and report both honestly:
- QAT (primary target) — FX graph-mode quantization-aware training, fbgemm
backend, 3 epochs of fake-quant fine-tuning from
half_best.pth(AdamW lr 2e-5, the existingPoseLoss), thenconvert_fxto a true int8 graph. - PTQ static QDQ (the brief's "sweet spot", measured as the honest fallback) — FX graph-mode static PTQ, fbgemm, calibrated on 64 train batches.
Locked normalization (ADR-173)
Torso-diameter PCK — neck (keypoint idx 2) → pelvis (idx 12) distance — the
standard MM-Fi/GraphPose-Fi convention. This is exactly the default
use_torso_norm=True path of the upstream harness's utils/metrics.calculate_pck.
The same calculate_pck/calculate_mpjpe that produced the sweep's fp32
numbers scores both fp32 and int8 here, so the comparison is metric-locked: no
normalization is mixed, and the fp32 baseline reproduces the sweep's recorded
half test numbers bit-for-bit (PCK@20 clean = 96.62%), confirming the harness is
the same one.
Device note (why int8 is CPU)
PyTorch int8 quantized kernels execute on CPU (fbgemm/x86), not CUDA. So int8 eval is CPU. To keep the accuracy delta device-matched (not confounding int8-vs-fp32 with CPU-vs-GPU), the script measures an fp32-CPU baseline too. fp32-CPU and fp32-GPU agree to 4 decimals (PCK@20 clean 0.96623 vs 0.96623), so CPU/GPU introduces no drift — the int8 deltas below are pure quantization effect.
MEASURED results (clean test subset = 52,560 NaN-free windows; torso-PCK)
Source: stdout of the run below + ~/wiflow-std-bench/sweep/int8/int8_results.json.
| model | quant | size (MB) | PCK@20 | PCK@50 | MPJPE | Δ PCK@20 | Δ PCK@50 | size win |
|---|---|---|---|---|---|---|---|---|
| fp32 (cpu) | — | 3.351 | 96.62% | 99.47% | 0.008981 | — | — | 1.00× |
| int8 PTQ static | PTQ | 1.046 | 40.98% | 94.98% | 0.038262 | −55.64 pp | −4.49 pp | 3.20× smaller |
| int8 QAT (3 ep) | QAT | 1.043 | 67.48% | 98.69% | 0.026548 | −29.15 pp | −0.78 pp | 3.21× smaller |
Full-test-set (54,000 windows incl. NaN-zero-filled files 487–499) tracks the clean subset: fp32 96.10% / int8-PTQ 41.11% / int8-QAT 67.48% PCK@20 — same shape, recorded in the JSON.
Verdict
int8 is NOT a win for this model at the tight PCK@20 edge target — honest no.
- PTQ static collapses (−55.64 pp PCK@20). Naive static QDQ destroys the half model. The "sweet spot" characterization from the brief does not transfer from the 2.23M model to this 843k model at the strict torso-PCK@20 threshold.
- QAT recovers a large share of the relative gap (PTQ 40.98% → QAT 67.48%) but still loses 29.15 pp at PCK@20 for a 3.21× size reduction. At the loose PCK@50 threshold QAT is nearly lossless (−0.78 pp), i.e. coarse-localization survives int8 but fine-localization does not.
- The size win is real and consistent (3.2× smaller, 3.351 MB → ~1.04 MB), but 3.2× compression at −29 pp PCK@20 is a bad trade when the half model already fits comfortably in edge flash at fp32. Recommendation: keep fp32 (or fp16) for the half model on the edge; do not ship this int8 variant as-is.
Observed fake-quant → int8 conversion gap (disclosed, not hidden)
During QAT the fake-quant model's val PCK@20 reached 83.45% (epoch 3), but the
converted int8 model scores 67.48% on test. A ~16 pp drop on convert_fx is a
real effect — the fbgemm int8 kernels are not bit-identical to the fake-quant
simulation (per-tensor activation quant + the axial-attention einsum/softmax path
quantize worse than the straight-through estimate predicts). This gap is the honest
reason QAT did not close the loss, and it is exactly the kind of number that would
be invisible if one only reported the fake-quant proxy. We report the converted
int8 number as the deliverable, not the fake-quant proxy.
Reproduction
ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
- Script (committed):
v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py(scp'd to~/quantize_half_int8.pyon ruvultra for the run). - Inputs (on ruvultra, unmodified):
~/wiflow-std-bench/sweep/half_best.pth,~/wiflow-std-bench/preprocessed_csi_data/(seed-42 file-level 70/15/15 split), upstreammodels/dataset/utils/metrics/losses(DY2434/WiFlow @ 06899d29, Apache-2.0), andsweep/model_compact.py(the half-model definition). - Outputs (written, non-destructive):
~/wiflow-std-bench/sweep/int8/—half_int8_qat.pth,half_int8_ptq_static.pth,int8_results.json,int8_run.log. No existing file under~/wiflow-std-benchwas modified. - Run metadata: host
ruvultra, GPU RTX 5080, torch2.11.0+cu128, fbgemm engine,date_utc 2026-06-15T12:35:06Z, QAT ≈ 97 s/epoch.
What is MEASURED vs CLAIMED
- MEASURED: every PCK/MPJPE/size number in the table; the fp32 baseline (which
reproduces the recorded sweep
halfnumbers); the PTQ collapse; the QAT partial recovery; the fake-quant→int8 conversion gap; the 3.2× size reduction. - CLAIMED / not done here: ONNX/TFLite export; on-real-edge (ESP32/Pi/Hailo) latency or energy (int8 here is measured on x86 fbgemm, the dev box, not an edge SoC — the size number transfers, a latency number does not); a per-layer mixed-precision search that might keep the attention block in fp32; QAT beyond 3 epochs or with learned-quant-range schedules. Those are the obvious next levers if int8 is revisited; none is asserted as a result.
Honest scope / limitations
- Single eval split — one seed-42 file-level test partition; no cross-room / cross-environment generalization split (the GraphPose-Fi frontier from ADR-173 is a separate, harder split and is not what is measured here).
- In-domain only — these are in-distribution test numbers; they say nothing about the cross-environment robustness gap.
- x86 int8, not edge-SoC int8 — accuracy and size transfer to an edge int8 runtime; the runtime/latency does not (different kernels, different SoC). No latency claim is made.
- QAT lightly tuned — 3 epochs, single LR, default fbgemm qconfig. A longer / better-tuned QAT might narrow the −29 pp, but on the evidence here int8 does not reach fp32 at PCK@20, and that is the reportable result today.
Consequences
Positive
- The "one untested edge lever" (QAT-int8 on the half model) is now MEASURED. The edge int8 question for the half model is answered with reproducible numbers: at the strict PCK@20 target it loses, and we can say so with a committed script.
- Establishes a reusable, metric-locked quantization+eval harness
(
quantize_half_int8.py) for any future int8 attempt on these compact variants.
Negative
- None to the codebase (additive script + ADR + CHANGELOG only; no production Rust
or signal-pipeline change; Python deterministic proof hash
f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7aunchanged).
Neutral
- The negative verdict means the half model stays fp32/fp16 on the edge for now. int8 for these compact pose nets is parked pending the next-lever work above.
Links
- ADR-173 — metric-locked PCK/MPJPE harness (the locked normalization used here)
- ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark / efficiency sweep
(produced
half_best.pth) docs/research/sota-nn-train-benchmark-brief.md— §edge int8 (the "one untested lever" this ADR measures)- Script:
v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py