wifi-densepose/docs/adr/ADR-175-int8-quantization-h...

9.3 KiB
Raw Blame History

ADR-175: int8 Quantization of the WiFlow-STD "half" Pose Model — MEASURED accuracy/size trade-off

Field Value
Status Accepted — MEASURED, reproducible (honest negative)
Date 2026-06-15
Deciders ruv
Codename EDGE-INT8
Sub-deliverable 8.2 of the benchmark/optimization milestone
Metric lock ADR-173 (one declared PCK normalization for every reported number)
Motivated by docs/research/sota-nn-train-benchmark-brief.md (§edge int8)

Context

The SOTA brief characterized the int8 edge story for the WiFlow-STD pose net as "fully characterized" for PTQ on the published 2.23M model (static QDQ conv-only = the sweet spot; dynamic int8 ≈ no-op on this all-conv net), and named QAT-int8 on the strictly-dominating 843,834-param "half" model as "the one untested edge lever." This ADR is the reading of that lever — a MEASURED fp32-vs-int8 trade-off for the half model, not a claim.

The half model (half_best.pth, 843,834 params) is the efficiency-sweep winner from ADR-152 (run_sweep.py VARIANTS[0]: tcn=[270,220,170,120], conv=[4,8,16,32], attn_groups=4). Its fp32 accuracy was recorded in the sweep; this ADR re-measures it under the locked normalization and quantizes it.

The whole point of this deliverable is reproducibility. Every number below was produced by running v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py on host ruvultra (RTX 5080, torch 2.11.0+cu128) against the real checkpoint and the real seed-42 test split. The script + the exact command + the recorded stdout is the proof artifact. Nothing here is estimated.

Decision

Quantize the half model to int8 with both levers and report both honestly:

  1. QAT (primary target) — FX graph-mode quantization-aware training, fbgemm backend, 3 epochs of fake-quant fine-tuning from half_best.pth (AdamW lr 2e-5, the existing PoseLoss), then convert_fx to a true int8 graph.
  2. PTQ static QDQ (the brief's "sweet spot", measured as the honest fallback) — FX graph-mode static PTQ, fbgemm, calibrated on 64 train batches.

Locked normalization (ADR-173)

Torso-diameter PCK — neck (keypoint idx 2) → pelvis (idx 12) distance — the standard MM-Fi/GraphPose-Fi convention. This is exactly the default use_torso_norm=True path of the upstream harness's utils/metrics.calculate_pck. The same calculate_pck/calculate_mpjpe that produced the sweep's fp32 numbers scores both fp32 and int8 here, so the comparison is metric-locked: no normalization is mixed, and the fp32 baseline reproduces the sweep's recorded half test numbers bit-for-bit (PCK@20 clean = 96.62%), confirming the harness is the same one.

Device note (why int8 is CPU)

PyTorch int8 quantized kernels execute on CPU (fbgemm/x86), not CUDA. So int8 eval is CPU. To keep the accuracy delta device-matched (not confounding int8-vs-fp32 with CPU-vs-GPU), the script measures an fp32-CPU baseline too. fp32-CPU and fp32-GPU agree to 4 decimals (PCK@20 clean 0.96623 vs 0.96623), so CPU/GPU introduces no drift — the int8 deltas below are pure quantization effect.

MEASURED results (clean test subset = 52,560 NaN-free windows; torso-PCK)

Source: stdout of the run below + ~/wiflow-std-bench/sweep/int8/int8_results.json.

model quant size (MB) PCK@20 PCK@50 MPJPE Δ PCK@20 Δ PCK@50 size win
fp32 (cpu) 3.351 96.62% 99.47% 0.008981 1.00×
int8 PTQ static PTQ 1.046 40.98% 94.98% 0.038262 55.64 pp 4.49 pp 3.20× smaller
int8 QAT (3 ep) QAT 1.043 67.48% 98.69% 0.026548 29.15 pp 0.78 pp 3.21× smaller

Full-test-set (54,000 windows incl. NaN-zero-filled files 487499) tracks the clean subset: fp32 96.10% / int8-PTQ 41.11% / int8-QAT 67.48% PCK@20 — same shape, recorded in the JSON.

Verdict

int8 is NOT a win for this model at the tight PCK@20 edge target — honest no.

  • PTQ static collapses (55.64 pp PCK@20). Naive static QDQ destroys the half model. The "sweet spot" characterization from the brief does not transfer from the 2.23M model to this 843k model at the strict torso-PCK@20 threshold.
  • QAT recovers a large share of the relative gap (PTQ 40.98% → QAT 67.48%) but still loses 29.15 pp at PCK@20 for a 3.21× size reduction. At the loose PCK@50 threshold QAT is nearly lossless (0.78 pp), i.e. coarse-localization survives int8 but fine-localization does not.
  • The size win is real and consistent (3.2× smaller, 3.351 MB → ~1.04 MB), but 3.2× compression at 29 pp PCK@20 is a bad trade when the half model already fits comfortably in edge flash at fp32. Recommendation: keep fp32 (or fp16) for the half model on the edge; do not ship this int8 variant as-is.

Observed fake-quant → int8 conversion gap (disclosed, not hidden)

During QAT the fake-quant model's val PCK@20 reached 83.45% (epoch 3), but the converted int8 model scores 67.48% on test. A ~16 pp drop on convert_fx is a real effect — the fbgemm int8 kernels are not bit-identical to the fake-quant simulation (per-tensor activation quant + the axial-attention einsum/softmax path quantize worse than the straight-through estimate predicts). This gap is the honest reason QAT did not close the loss, and it is exactly the kind of number that would be invisible if one only reported the fake-quant proxy. We report the converted int8 number as the deliverable, not the fake-quant proxy.

Reproduction

ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
  python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
  • Script (committed): v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py (scp'd to ~/quantize_half_int8.py on ruvultra for the run).
  • Inputs (on ruvultra, unmodified): ~/wiflow-std-bench/sweep/half_best.pth, ~/wiflow-std-bench/preprocessed_csi_data/ (seed-42 file-level 70/15/15 split), upstream models/dataset/utils/metrics/losses (DY2434/WiFlow @ 06899d29, Apache-2.0), and sweep/model_compact.py (the half-model definition).
  • Outputs (written, non-destructive): ~/wiflow-std-bench/sweep/int8/half_int8_qat.pth, half_int8_ptq_static.pth, int8_results.json, int8_run.log. No existing file under ~/wiflow-std-bench was modified.
  • Run metadata: host ruvultra, GPU RTX 5080, torch 2.11.0+cu128, fbgemm engine, date_utc 2026-06-15T12:35:06Z, QAT ≈ 97 s/epoch.

What is MEASURED vs CLAIMED

  • MEASURED: every PCK/MPJPE/size number in the table; the fp32 baseline (which reproduces the recorded sweep half numbers); the PTQ collapse; the QAT partial recovery; the fake-quant→int8 conversion gap; the 3.2× size reduction.
  • CLAIMED / not done here: ONNX/TFLite export; on-real-edge (ESP32/Pi/Hailo) latency or energy (int8 here is measured on x86 fbgemm, the dev box, not an edge SoC — the size number transfers, a latency number does not); a per-layer mixed-precision search that might keep the attention block in fp32; QAT beyond 3 epochs or with learned-quant-range schedules. Those are the obvious next levers if int8 is revisited; none is asserted as a result.

Honest scope / limitations

  • Single eval split — one seed-42 file-level test partition; no cross-room / cross-environment generalization split (the GraphPose-Fi frontier from ADR-173 is a separate, harder split and is not what is measured here).
  • In-domain only — these are in-distribution test numbers; they say nothing about the cross-environment robustness gap.
  • x86 int8, not edge-SoC int8 — accuracy and size transfer to an edge int8 runtime; the runtime/latency does not (different kernels, different SoC). No latency claim is made.
  • QAT lightly tuned — 3 epochs, single LR, default fbgemm qconfig. A longer / better-tuned QAT might narrow the 29 pp, but on the evidence here int8 does not reach fp32 at PCK@20, and that is the reportable result today.

Consequences

Positive

  • The "one untested edge lever" (QAT-int8 on the half model) is now MEASURED. The edge int8 question for the half model is answered with reproducible numbers: at the strict PCK@20 target it loses, and we can say so with a committed script.
  • Establishes a reusable, metric-locked quantization+eval harness (quantize_half_int8.py) for any future int8 attempt on these compact variants.

Negative

  • None to the codebase (additive script + ADR + CHANGELOG only; no production Rust or signal-pipeline change; Python deterministic proof hash f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a unchanged).

Neutral

  • The negative verdict means the half model stays fp32/fp16 on the edge for now. int8 for these compact pose nets is parked pending the next-lever work above.
  • ADR-173 — metric-locked PCK/MPJPE harness (the locked normalization used here)
  • ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark / efficiency sweep (produced half_best.pth)
  • docs/research/sota-nn-train-benchmark-brief.md — §edge int8 (the "one untested lever" this ADR measures)
  • Script: v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py