Claude
|
7b6a18a059
|
Add ANE int8/int4 quantization probe
Probe whether Apple Neural Engine executes quantized ops natively
(faster int8-int8 compute path) or just dequantizes to fp16 at load time.
Tests 5 approaches at transformer-representative dimensions:
1. FP16 baseline conv (baked weights)
2. INT8 via constexpr_affine_dequantize (per-channel scale+zp)
3. UINT4 via constexpr_affine_dequantize (per-channel)
4. UINT4 via constexpr_blockwise_shift_scale (block_size=32)
5. 4-bit palettized via constexpr_lut_to_dense (16-entry LUT)
Each test compiles MIL → ANE kernel, benchmarks 100 evals, reports
TFLOPS. If int8 shows ~2x fp16 TFLOPS, ANE has native int8 compute.
If same TFLOPS, it's dequant-only (still useful for memory savings).
Build: xcrun clang -O2 -fobjc-arc -o quant_probe quant_probe.m \
-framework Foundation -framework IOSurface -ldl
https://claude.ai/code/session_01U5HLjsm4iUzL9iDaHbxeRB
|
2026-03-03 01:02:05 +00:00 |