Replaces the finite-difference PPO placeholder with a real GPU-capable Candle
0.9 autodiff trainer, adds A-MAPPO heterogeneous-role attention, a runnable
training binary, and right-sized GCP/local launch scripts. This is the unlock
that makes "GPU long training cycles" actually mean something — the previous
ppo_update did no gradient descent.
## Real autodiff PPO (feature `train`, optional `cuda`)
- candle_ppo.rs: CandleActorCritic (64→128→64 MLP + action/value heads +
learnable log_std), CandlePpoConfig, CandleTrainer with GAE and a genuine
optimizer.backward_step over the network. select_device() picks CUDA when
built --features cuda and a GPU is present, else CPU.
- Verified: 5-episode CPU smoke run shows value_loss 12643→12375 (critic
actually learning); safetensors checkpoint saved. Placeholder never moved weights.
## A-MAPPO heterogeneous-role attention (role_attention.rs, always compiled)
Addresses the four sensor-vs-relay edge cases:
- relay attention floor (prevents collapse — relays produce no CSI)
- role-segmented sensor/relay attention pools (variable neighbor cardinality)
- sensor-gated triangulation-geometry penalty (protects 3-view fusion baseline,
ADR-148 §4.2 — relays not dragged into triangulation geometry)
- one-hot role embeddings for keys
## Training binary
- src/bin/train_marl.rs (required-features=["train"], excluded from default build)
- CLI: --episodes --drones --profile --steps --checkpoint-dir --checkpoint-every
- Wires CandleTrainer to the SwarmOrchestrator rollout loop; GAE + PPO update
per episode; periodic safetensors checkpoints
## Right-sized launch (scripts/gcp/)
- provision_marl.sh: g2-standard-16 (1× L4, 16 vCPU, ~$1.40/hr) — NOT the
$29/hr A100×8 box. MARL is rollout-bound not matmul-bound; ~21× cheaper.
- run_marl_train.sh: GCP rsync + train + checkpoint pull
- run_marl_train_local.sh: local RTX 5080, $0
- A100×8 provision_training.sh left for OccWorld (which saturates the GPUs)
## Tests
- --no-default-features: 91/91 (87 + 4 role_attention)
- --features train: 96/96 (+ 5 candle_ppo, incl. real-autodiff verification)
- --features ruflo,itar-unrestricted: 104/104
- default build stays light: train_marl excluded via required-features
Co-Authored-By: claude-flow <ruv@ruv.net>