Key discovery: compile and eval can run in parallel via GCD.
119 foreground evals completed during a 26.8ms background compile.
Architecture:
- Two kernel sets (A/B) alternate active/pending
- Background GCD thread compiles pending kernels while active runs
- Atomic swap at batch boundary
- Eliminates 88% compilation bottleneck
Includes:
- train_double_buffer.m: modified train_large.m with async compilation
- PROBE_RESULTS.md: full benchmark data from M4 probe
- Updated Makefile