Commit Graph

1 Commits

Author SHA1 Message Date
Claude 7b6a18a059
Add ANE int8/int4 quantization probe
Probe whether Apple Neural Engine executes quantized ops natively
(faster int8-int8 compute path) or just dequantizes to fp16 at load time.

Tests 5 approaches at transformer-representative dimensions:
1. FP16 baseline conv (baked weights)
2. INT8 via constexpr_affine_dequantize (per-channel scale+zp)
3. UINT4 via constexpr_affine_dequantize (per-channel)
4. UINT4 via constexpr_blockwise_shift_scale (block_size=32)
5. 4-bit palettized via constexpr_lut_to_dense (16-entry LUT)

Each test compiles MIL → ANE kernel, benchmarks 100 evals, reports
TFLOPS. If int8 shows ~2x fp16 TFLOPS, ANE has native int8 compute.
If same TFLOPS, it's dequant-only (still useful for memory savings).

Build: xcrun clang -O2 -fobjc-arc -o quant_probe quant_probe.m \
       -framework Foundation -framework IOSurface -ldl

https://claude.ai/code/session_01U5HLjsm4iUzL9iDaHbxeRB
2026-03-03 01:02:05 +00:00