wifi-densepose/docs/adr/ADR-175-int8-quantization-h...

173 lines
9.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-175: int8 Quantization of the WiFlow-STD "half" Pose Model — MEASURED accuracy/size trade-off
| Field | Value |
|-------|-------|
| **Status** | Accepted — MEASURED, reproducible (honest negative) |
| **Date** | 2026-06-15 |
| **Deciders** | ruv |
| **Codename** | **EDGE-INT8** |
| **Sub-deliverable** | 8.2 of the benchmark/optimization milestone |
| **Metric lock** | ADR-173 (one declared PCK normalization for every reported number) |
| **Motivated by** | `docs/research/sota-nn-train-benchmark-brief.md` (§edge int8) |
## Context
The SOTA brief characterized the int8 edge story for the WiFlow-STD pose net as
"fully characterized" for PTQ on the **published 2.23M** model (static QDQ
conv-only = the sweet spot; dynamic int8 ≈ no-op on this all-conv net), and named
**QAT-int8 on the strictly-dominating 843,834-param "half" model** as "the one
untested edge lever." This ADR is the reading of that lever — a MEASURED
fp32-vs-int8 trade-off for the half model, not a claim.
The half model (`half_best.pth`, 843,834 params) is the efficiency-sweep winner
from ADR-152 (`run_sweep.py` VARIANTS[0]: `tcn=[270,220,170,120]`,
`conv=[4,8,16,32]`, `attn_groups=4`). Its fp32 accuracy was recorded in the sweep;
this ADR re-measures it under the locked normalization and quantizes it.
**The whole point of this deliverable is reproducibility.** Every number below was
produced by running `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
on host `ruvultra` (RTX 5080, torch 2.11.0+cu128) against the real checkpoint and
the real seed-42 test split. The script + the exact command + the recorded stdout
**is** the proof artifact. Nothing here is estimated.
## Decision
Quantize the half model to int8 with **both** levers and report both honestly:
1. **QAT (primary target)** — FX graph-mode quantization-aware training, fbgemm
backend, 3 epochs of fake-quant fine-tuning from `half_best.pth` (AdamW lr 2e-5,
the existing `PoseLoss`), then `convert_fx` to a true int8 graph.
2. **PTQ static QDQ (the brief's "sweet spot", measured as the honest fallback)**
FX graph-mode static PTQ, fbgemm, calibrated on 64 train batches.
### Locked normalization (ADR-173)
**Torso-diameter PCK** — neck (keypoint idx 2) → pelvis (idx 12) distance — the
standard MM-Fi/GraphPose-Fi convention. This is exactly the default
`use_torso_norm=True` path of the upstream harness's `utils/metrics.calculate_pck`.
The **same** `calculate_pck`/`calculate_mpjpe` that produced the sweep's fp32
numbers scores **both** fp32 and int8 here, so the comparison is metric-locked: no
normalization is mixed, and the fp32 baseline reproduces the sweep's recorded
`half` test numbers bit-for-bit (PCK@20 clean = 96.62%), confirming the harness is
the same one.
### Device note (why int8 is CPU)
PyTorch int8 quantized kernels execute on CPU (fbgemm/x86), not CUDA. So int8 eval
is CPU. To keep the accuracy delta device-matched (not confounding int8-vs-fp32
with CPU-vs-GPU), the script measures an **fp32-CPU** baseline too. fp32-CPU and
fp32-GPU agree to 4 decimals (PCK@20 clean 0.96623 vs 0.96623), so CPU/GPU
introduces no drift — the int8 deltas below are pure quantization effect.
## MEASURED results (clean test subset = 52,560 NaN-free windows; torso-PCK)
Source: stdout of the run below + `~/wiflow-std-bench/sweep/int8/int8_results.json`.
| model | quant | size (MB) | PCK@20 | PCK@50 | MPJPE | Δ PCK@20 | Δ PCK@50 | size win |
|-------|-------|-----------|--------|--------|-------|----------|----------|----------|
| **fp32** (cpu) | — | **3.351** | **96.62%** | **99.47%** | **0.008981** | — | — | 1.00× |
| int8 PTQ static | PTQ | 1.046 | 40.98% | 94.98% | 0.038262 | **55.64 pp** | 4.49 pp | 3.20× smaller |
| int8 QAT (3 ep) | **QAT** | 1.043 | 67.48% | 98.69% | 0.026548 | **29.15 pp** | 0.78 pp | 3.21× smaller |
Full-test-set (54,000 windows incl. NaN-zero-filled files 487499) tracks the
clean subset: fp32 96.10% / int8-PTQ 41.11% / int8-QAT 67.48% PCK@20 — same shape,
recorded in the JSON.
### Verdict
**int8 is NOT a win for this model at the tight PCK@20 edge target — honest no.**
- **PTQ static collapses** (55.64 pp PCK@20). Naive static QDQ destroys the half
model. The "sweet spot" characterization from the brief does not transfer from
the 2.23M model to this 843k model at the strict torso-PCK@20 threshold.
- **QAT recovers a large share of the relative gap** (PTQ 40.98% → QAT 67.48%) but
still **loses 29.15 pp** at PCK@20 for a 3.21× size reduction. At the loose
PCK@50 threshold QAT is nearly lossless (0.78 pp), i.e. coarse-localization
survives int8 but fine-localization does not.
- The size win is real and consistent (3.2× smaller, 3.351 MB → ~1.04 MB), but
**3.2× compression at 29 pp PCK@20 is a bad trade** when the half model already
fits comfortably in edge flash at fp32. Recommendation: **keep fp32 (or fp16)
for the half model on the edge**; do not ship this int8 variant as-is.
### Observed fake-quant → int8 conversion gap (disclosed, not hidden)
During QAT the **fake-quant** model's val PCK@20 reached 83.45% (epoch 3), but the
**converted int8** model scores 67.48% on test. A ~16 pp drop on `convert_fx` is a
real effect — the fbgemm int8 kernels are not bit-identical to the fake-quant
simulation (per-tensor activation quant + the axial-attention `einsum`/softmax path
quantize worse than the straight-through estimate predicts). This gap is the honest
reason QAT did not close the loss, and it is exactly the kind of number that would
be invisible if one only reported the fake-quant proxy. We report the **converted
int8** number as the deliverable, not the fake-quant proxy.
## Reproduction
```bash
ssh ruvultra 'cd ~/wiflow-std-bench && source venv/bin/activate && \
python ~/quantize_half_int8.py --mode both --qat-epochs 3 2>&1'
```
- Script (committed): `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`
(scp'd to `~/quantize_half_int8.py` on ruvultra for the run).
- Inputs (on ruvultra, unmodified): `~/wiflow-std-bench/sweep/half_best.pth`,
`~/wiflow-std-bench/preprocessed_csi_data/` (seed-42 file-level 70/15/15 split),
upstream `models`/`dataset`/`utils/metrics`/`losses` (DY2434/WiFlow @ 06899d29,
Apache-2.0), and `sweep/model_compact.py` (the half-model definition).
- Outputs (written, non-destructive): `~/wiflow-std-bench/sweep/int8/`
`half_int8_qat.pth`, `half_int8_ptq_static.pth`, `int8_results.json`,
`int8_run.log`. **No existing file under `~/wiflow-std-bench` was modified.**
- Run metadata: host `ruvultra`, GPU RTX 5080, torch `2.11.0+cu128`, fbgemm engine,
`date_utc 2026-06-15T12:35:06Z`, QAT ≈ 97 s/epoch.
## What is MEASURED vs CLAIMED
- **MEASURED:** every PCK/MPJPE/size number in the table; the fp32 baseline (which
reproduces the recorded sweep `half` numbers); the PTQ collapse; the QAT partial
recovery; the fake-quant→int8 conversion gap; the 3.2× size reduction.
- **CLAIMED / not done here:** ONNX/TFLite export; on-real-edge (ESP32/Pi/Hailo)
latency or energy (int8 here is measured on x86 fbgemm, the dev box, **not** an
edge SoC — the size number transfers, a latency number does **not**); a
per-layer mixed-precision search that might keep the attention block in fp32; QAT
beyond 3 epochs or with learned-quant-range schedules. Those are the obvious next
levers if int8 is revisited; none is asserted as a result.
## Honest scope / limitations
- **Single eval split** — one seed-42 file-level test partition; no cross-room /
cross-environment generalization split (the GraphPose-Fi frontier from ADR-173 is
a separate, harder split and is not what is measured here).
- **In-domain only** — these are in-distribution test numbers; they say nothing
about the cross-environment robustness gap.
- **x86 int8, not edge-SoC int8** — accuracy and size transfer to an edge int8
runtime; the runtime/latency does not (different kernels, different SoC). No
latency claim is made.
- **QAT lightly tuned** — 3 epochs, single LR, default fbgemm qconfig. A longer /
better-tuned QAT might narrow the 29 pp, but on the evidence here int8 does not
reach fp32 at PCK@20, and that is the reportable result today.
## Consequences
### Positive
- The "one untested edge lever" (QAT-int8 on the half model) is now MEASURED. The
edge int8 question for the half model is answered with reproducible numbers: at
the strict PCK@20 target it loses, and we can say so with a committed script.
- Establishes a reusable, metric-locked quantization+eval harness
(`quantize_half_int8.py`) for any future int8 attempt on these compact variants.
### Negative
- None to the codebase (additive script + ADR + CHANGELOG only; no production Rust
or signal-pipeline change; Python deterministic proof hash
`f8e76f21a0f9852b70b6d9dd5318239f6b20cbcb4cdd995863263cecdc446f7a` unchanged).
### Neutral
- The negative verdict means the half model stays fp32/fp16 on the edge for now.
int8 for these compact pose nets is parked pending the next-lever work above.
## Links
- ADR-173 — metric-locked PCK/MPJPE harness (the locked normalization used here)
- ADR-152 — WiFi-Pose SOTA 2026 intake / WiFlow-STD benchmark / efficiency sweep
(produced `half_best.pth`)
- `docs/research/sota-nn-train-benchmark-brief.md` — §edge int8 (the "one untested
lever" this ADR measures)
- Script: `v2/crates/wifi-densepose-train/scripts/quantize_half_int8.py`