wifi-densepose/scripts/gcp
ruv 4f004e018b feat(swarm): real Candle autodiff PPO + A-MAPPO role attention + GPU training (M4)
Replaces the finite-difference PPO placeholder with a real GPU-capable Candle
0.9 autodiff trainer, adds A-MAPPO heterogeneous-role attention, a runnable
training binary, and right-sized GCP/local launch scripts. This is the unlock
that makes "GPU long training cycles" actually mean something — the previous
ppo_update did no gradient descent.

## Real autodiff PPO (feature `train`, optional `cuda`)
- candle_ppo.rs: CandleActorCritic (64→128→64 MLP + action/value heads +
  learnable log_std), CandlePpoConfig, CandleTrainer with GAE and a genuine
  optimizer.backward_step over the network. select_device() picks CUDA when
  built --features cuda and a GPU is present, else CPU.
- Verified: 5-episode CPU smoke run shows value_loss 12643→12375 (critic
  actually learning); safetensors checkpoint saved. Placeholder never moved weights.

## A-MAPPO heterogeneous-role attention (role_attention.rs, always compiled)
Addresses the four sensor-vs-relay edge cases:
- relay attention floor (prevents collapse — relays produce no CSI)
- role-segmented sensor/relay attention pools (variable neighbor cardinality)
- sensor-gated triangulation-geometry penalty (protects 3-view fusion baseline,
  ADR-148 §4.2 — relays not dragged into triangulation geometry)
- one-hot role embeddings for keys

## Training binary
- src/bin/train_marl.rs (required-features=["train"], excluded from default build)
- CLI: --episodes --drones --profile --steps --checkpoint-dir --checkpoint-every
- Wires CandleTrainer to the SwarmOrchestrator rollout loop; GAE + PPO update
  per episode; periodic safetensors checkpoints

## Right-sized launch (scripts/gcp/)
- provision_marl.sh: g2-standard-16 (1× L4, 16 vCPU, ~$1.40/hr) — NOT the
  $29/hr A100×8 box. MARL is rollout-bound not matmul-bound; ~21× cheaper.
- run_marl_train.sh: GCP rsync + train + checkpoint pull
- run_marl_train_local.sh: local RTX 5080, $0
- A100×8 provision_training.sh left for OccWorld (which saturates the GPUs)

## Tests
- --no-default-features: 91/91 (87 + 4 role_attention)
- --features train: 96/96 (+ 5 candle_ppo, incl. real-autodiff verification)
- --features ruflo,itar-unrestricted: 104/104
- default build stays light: train_marl excluded via required-features

Co-Authored-By: claude-flow <ruv@ruv.net>
2026-05-30 12:43:56 -04:00
..
cosmos_eval.sh feat(worldmodel): Candle Rust port + GCP GPU scripts (ADR-147 Phase 4+6) 2026-05-29 20:52:51 -04:00
provision_cosmos.sh feat(worldmodel): Candle Rust port + GCP GPU scripts (ADR-147 Phase 4+6) 2026-05-29 20:52:51 -04:00
provision_marl.sh feat(swarm): real Candle autodiff PPO + A-MAPPO role attention + GPU training (M4) 2026-05-30 12:43:56 -04:00
provision_training.sh feat(worldmodel): Candle Rust port + GCP GPU scripts (ADR-147 Phase 4+6) 2026-05-29 20:52:51 -04:00
run_marl_train.sh feat(swarm): real Candle autodiff PPO + A-MAPPO role attention + GPU training (M4) 2026-05-30 12:43:56 -04:00
run_marl_train_local.sh feat(swarm): real Candle autodiff PPO + A-MAPPO role attention + GPU training (M4) 2026-05-30 12:43:56 -04:00
run_training.sh feat(worldmodel): Candle Rust port + GCP GPU scripts (ADR-147 Phase 4+6) 2026-05-29 20:52:51 -04:00
teardown.sh feat(worldmodel): Candle Rust port + GCP GPU scripts (ADR-147 Phase 4+6) 2026-05-29 20:52:51 -04:00