berkus/ANE - ANE

Commit Graph

Author	SHA1	Message	Date
Claude	7b6a18a059	Add ANE int8/int4 quantization probe Probe whether Apple Neural Engine executes quantized ops natively (faster int8-int8 compute path) or just dequantizes to fp16 at load time. Tests 5 approaches at transformer-representative dimensions: 1. FP16 baseline conv (baked weights) 2. INT8 via constexpr_affine_dequantize (per-channel scale+zp) 3. UINT4 via constexpr_affine_dequantize (per-channel) 4. UINT4 via constexpr_blockwise_shift_scale (block_size=32) 5. 4-bit palettized via constexpr_lut_to_dense (16-entry LUT) Each test compiles MIL → ANE kernel, benchmarks 100 evals, reports TFLOPS. If int8 shows ~2x fp16 TFLOPS, ANE has native int8 compute. If same TFLOPS, it's dequant-only (still useful for memory savings). Build: xcrun clang -O2 -fobjc-arc -o quant_probe quant_probe.m \ -framework Foundation -framework IOSurface -ldl https://claude.ai/code/session_01U5HLjsm4iUzL9iDaHbxeRB	2026-03-03 01:02:05 +00:00

Author

SHA1

Message

Date

Claude

7b6a18a059

Add ANE int8/int4 quantization probe

Probe whether Apple Neural Engine executes quantized ops natively
(faster int8-int8 compute path) or just dequantizes to fp16 at load time.

Tests 5 approaches at transformer-representative dimensions:
1. FP16 baseline conv (baked weights)
2. INT8 via constexpr_affine_dequantize (per-channel scale+zp)
3. UINT4 via constexpr_affine_dequantize (per-channel)
4. UINT4 via constexpr_blockwise_shift_scale (block_size=32)
5. 4-bit palettized via constexpr_lut_to_dense (16-entry LUT)

Each test compiles MIL → ANE kernel, benchmarks 100 evals, reports
TFLOPS. If int8 shows ~2x fp16 TFLOPS, ANE has native int8 compute.
If same TFLOPS, it's dequant-only (still useful for memory savings).

Build: xcrun clang -O2 -fobjc-arc -o quant_probe quant_probe.m \
       -framework Foundation -framework IOSurface -ldl

https://claude.ai/code/session_01U5HLjsm4iUzL9iDaHbxeRB

2026-03-03 01:02:05 +00:00

1 Commits