ANE/training/ANESDK_roadmap.md

# ANE SDK Roadmap: General-Purpose Neural Engine Development Kit

This roadmap outlines the evolution of the current Apple Neural Engine (ANE) training infrastructure into a modular, high-level SDK for developing and training arbitrary neural network architectures on Apple Silicon.

## 🌟 Strategic Vision: "PyTorch for ANE"
Transform low-level, transformer-specific MIL (Model Intermediate Language) generation into a modular, layer-based system that allows developers to define, train, and benchmark any architecture (CNNs, MLPs, RNNs) with minimal boilerplate.

---

## 🛠 Phase 1: Modular Layer Abstractions (Short Term)
**Goal:** Decouple MIL generation from the Transformer-specific logic.
- [x] **ANE-MIL Layer Library**: Created a repository of optimized MIL builders for core primitives:
  - `Linear(in, out)`, `Conv2D(kernel, stride, padding)`
  - `ReLU`, `GELU`, `Sigmoid`, `Softmax` activations
  - `LayerNorm` and `RMSNorm`
- [x] **Unified Tensor API**: High-level wrapper around `IOSurface` and `NEON` via `anesdk.h`.
- [x] **Weights-as-Tensors by Default**: Every layer automatically utilizes the dynamic weight update optimization (zero-recompile).

## 🚀 Phase 2: Automated Graph Engine (Medium Term)
**Goal:** Automate the orchestration of multiple kernels into a cohesive model.
- [x] **ANEGraph Orchestrator**: Implemented **Sequential** model container that automates execution order.
- [ ] **Automatic Backward Pass**: Orchestration of backward kernels in reverse order.
- [ ] **Automatic Gradient Management**: Logic to handle gradient accumulation and weight updates across multi-layer graphs.
- [ ] **Optimizer Library**: Implement standard optimizers (SGD, Adam, AdamW) as native C++ components using the Accelerate framework.

## 📈 Phase 3: Developer Ecosystem & Tooling (Long Term)
**Goal:** Improve developer velocity and integration.
- [ ] **Python Bridge (PyANE)**: A lightweight Python library for defining models that compiles directly to ANE-executable graph binaries.
- [ ] **Model Profiler**: Native tools to measure TFLOPS, memory bandwidth, and ANE utilization per-layer.
- [ ] **Deployment Export**: One-click export to CoreML `.mlpackage` for final production deployment.

---

## 🏁 Success Metrics
- **Agnosticism**: Ability to run a CIFAR-10 CNN and a Stories110M Transformer using the same core runtime.
- **Performance**: Maintain >90 TFLOPS sustained throughput across various architectures.
- **Simplicity**: Reduce the lines of code required to define a new model by >70%.

> [!NOTE]
> This SDK leverages private ANE infrastructure to bypass the limitations of public CoreML training, specifically focusing on high-throughput, on-device weight updates.