feat(swarm): ADR-149 evaluation harness — GDOP, IQM+bootstrap CI, noise sweep (#875)

Stage-1 kinematic evaluator per ADR-149 (peer-reviewed). Pure Rust, no new deps. evals/: - gdop.rs: 2D Geometric Dilution of Precision ((HᵀH)⁻¹ trace-sqrt); None for <2 observers or collinear/singular geometry - stats.rs: IQM (Agarwal 2021) + 95% stratified-bootstrap CI (deterministic LCG) + probability_of_improvement - metrics.rs: EpisodeMetrics + AggregateMetrics::from_strata (IQM±CI, seed-stratified) - runner.rs: seeded kinematic rollout (FlightPattern-driven), seed×episode matrix, 3σ×3κ default noise sweep (Gaussian amplitude × von Mises phase) - report.rs + eval_swarm bin: generates evals/RESULTS.md leaderboard RESULTS.md surfaces the real coverage-vs-localization-precision trade-off via GDOP: partitioned wins coverage (100%) but single-drone sightings (GDOP 0 → 7.0m); pheromone gets multistatic fusion (GDOP 1.6 → 4.1m). Wi2SAR 5m paper-baseline row included. Stage-2 (Gazebo/PX4 SITL false-alarm + collision on median seeds) is documented follow-on. Tests: 116 default / 133 full+train (+13 eval tests), 0 failed. Clippy clean (-D warnings).
2026-05-30 17:38:49 -04:00 · 2026-05-30 17:38:49 -04:00 · 8d64434d21
parent 0d3d835bf8
commit 8d64434d21
12 changed files with 1368 additions and 0 deletions
--- a/docs/adr/ADR-149-swarm-benchmarking-evaluation-methodology.md
+++ b/docs/adr/ADR-149-swarm-benchmarking-evaluation-methodology.md
@ -0,0 +1,257 @@
+# ADR-149: Drone Swarm Benchmarking & Evaluation Methodology — Metrics, Leaderboards, and Statistical Rigor
+
+| Field      | Value                                                                                   |
+|------------|-----------------------------------------------------------------------------------------|
+| Status     | Accepted (peer-reviewed 2026-05-30)                                                     |
+| Date       | 2026-05-30                                                                              |
+| Deciders   | ruv                                                                                     |
+| Relates to | ADR-148 (ruview-swarm), ADR-147 (OccWorld), ADR-146 (RF encoder), ADR-028 (witness)    |
+
+> Companion to ADR-148. ADR-148 shipped the swarm and 5 criterion micro-benchmarks
+> plus a `SotaComparison` against Wi2SAR. This ADR defines **how we evaluate the swarm
+> rigorously** — what metrics, what statistics, what baselines, and an honest account
+> of which external leaderboards do and do not apply.
+
+---
+
+## 1. Context
+
+ADR-148's `ruview-swarm` reports performance via five `criterion` micro-benchmarks and a
+single `SotaComparison` (localization 1.732 m vs Wi2SAR 5 m; coverage ~223 s vs 810 s).
+These numbers are **internally valid but insufficient as scientific claims**:
+
+- The criterion figures (3.3 µs MARL inference, 43 µs RRT-APF, 54 ns fusion, 248 µs PPO
+  step) measure **wall-clock latency**, not policy quality or coverage/localization quality.
+- The 1.732 m localization comes from a **single synthetic geometry** (3 drones at 120°
+  around a known point), not a distribution of victim positions under realistic noise.
+- The 223 s coverage is an **analytic estimate** (`estimate_coverage_time_secs()`), not an
+  episode rollout.
+- All numbers are **single-run point estimates**. The MARL reproducibility literature
+  (Henderson 2018; Agarwal 2021; Gorsane 2022) shows single/few-seed point estimates
+  routinely flip algorithm rankings and overstate gains.
+
+We need a defined, reproducible evaluation methodology before any "beats SOTA" claim can
+survive external review, and an honest position on external leaderboards.
+
+---
+
+## 2. Decision
+
+Adopt a two-tier evaluation methodology:
+
+1. **Micro-benchmarks (criterion)** — keep for compute-latency regression gating only.
+   Explicitly labeled as latency, never as quality.
+2. **Domain evaluation harness** — a seeded, multi-run, statistically-reported harness
+   producing SAR metrics (localization CEP, coverage, detection rate) and MARL metrics
+   (IQM return, probability-of-improvement) over **≥10 seeds with 95% stratified-bootstrap
+   confidence intervals**, against **≥3 baselines**, following the Agarwal/Gorsane standard.
+
+Do **not** claim leaderboard standing — no public leaderboard accepts drone-swarm CSI-SAR
+submissions. Comparisons to Wi2SAR are **paper-to-paper**, labeled as such, acknowledging
+the sensing-modality difference (RSS bearing vs CSI multi-view fusion).
+
+---
+
+## 3. External Leaderboard Landscape — Honest Assessment
+
+**There is no public, externally-administered leaderboard that accepts a drone-swarm,
+CSI-based, multi-view SAR system.** This is a research niche; comparison is paper-to-paper.
+The adjacent options and their fit:
+
+| Benchmark / Leaderboard | Domain | Live submission? | Fit for ruview-swarm |
+|-------------------------|--------|------------------|----------------------|
+| **Wi2SAR** (arxiv 2604.09115) | Drone WiFi SAR | No (paper) | **Direct baseline** — paper-to-paper only; RSS bearing ≠ CSI fusion |
+| **MARL4DRP** (Springer 2023) | Drone routing MARL | No | Closest drone-MARL benchmark; would need a routing→coverage adapter |
+| **CSI-Bench** (NeurIPS 2025) | Static WiFi sensing | Splits + paper baselines | Adjacent (localization task) but no moving-sensor/multi-view fusion |
+| **SMAC / SMACv2** | StarCraft cooperative MARL | No live LB | Structural analogy (CTDE) only; combat task, not coverage |
+| **PettingZoo MPE** (Simple Spread) | 2D cooperative particles | No | Cheap MARL **correctness check**, no physics/CSI |
+| **Melting Pot** | Social-dynamics MARL | Closed (NeurIPS '24) | Not applicable |
+| **MAMuJoCo / Hanabi / GRF / Overcooked** | Various cooperative MARL | No live LB | Not applicable |
+| **OmniDrones / gym-pybullet-drones / Pegasus** | Drone-control sim platforms | No (platforms) | **Training infrastructure**, not leaderboards; no CSI layer |
+
+**Conclusion:** We will (a) keep Wi2SAR as the cited paper baseline, (b) optionally build a
+MARL4DRP/MPE adapter to post a recognized cooperative-MARL number (tangential to SAR), and
+(c) **not** represent any internal number as a leaderboard placement.
+
+---
+
+## 4. Evaluation Metrics
+
+### 4.1 SAR Domain Metrics (primary — comparable to Wi2SAR)
+
+| Metric | Definition | Reporting |
+|--------|-----------|-----------|
+| Localization CEP50 | Median horizontal error, fused victim position vs ground truth | m, 95% CI |
+| Localization CEP95 | 95th-percentile horizontal error | m |
+| **GDOP** | Geometric Dilution of Precision of the contributing-drone constellation at detection time | dimensionless (tracked per detection) |
+| Coverage rate @ T | Fraction of area scanned ≥1× within T=240 s | %, 95% CI |
+| Coverage time to 95% | Time to scan 95% of bounded area | s, mean ± CI |
+| Time-to-first-detection | Mission start → first confident detection (conf > 0.85) | s, 95% CI |
+| Detection rate | P(detected \| victim present) per mission | %, 95% CI |
+| False-alarm rate | P(confident detection \| no victim) | %, 95% CI |
+| Collision rate | Collisions (d < 1.5 m) per mission | count/mission |
+| Overlap ratio | Fraction of path re-covering scanned cells | % |
+
+### 4.2 MARL Policy-Quality Metrics
+
+| Metric | Definition |
+|--------|-----------|
+| IQM episodic return | Interquartile mean over 10 seeds × 50 eval episodes (Agarwal 2021) |
+| Probability of improvement | P(MAPPO return > IPPO return) on a random episode |
+| Optimality gap | Expected gap to a defined reference performance |
+| Performance profile | Fraction of (seed, episode) with localization error < τ, plotted vs τ ∈ [0,10] m |
+| Sample efficiency | Return vs training steps (curve, not point) |
+
+### 4.3 Micro-benchmarks (criterion — latency only)
+
+Retained from ADR-148, **labeled as compute latency, not quality**:
+`marl_actor_inference` 3.3 µs · `rrt_apf_100iter` 43 µs · `multiview_fusion_3drones` 54 ns ·
+`demo_coverage_estimate` 100 ps · `ppo_update_64transitions` 248 µs. Purpose: prove the
+control loop has no compute bottleneck (all ≪ the 10 ms / 100 Hz budget) and gate
+performance regressions. They are **not** evidence of policy or localization quality.
+
+---
+
+## 5. Statistical Protocol (Agarwal 2021 / Gorsane 2022)
+
+| Requirement | Standard adopted |
+|-------------|------------------|
+| Seeds per condition | **≥10** training runs from distinct seeds |
+| Evaluation episodes | 50 fixed, versioned episodes per trained policy (10 victim layouts × 5 CSI-noise levels) |
+| Aggregate metric | **IQM** (not mean, not median) + performance profiles |
+| Confidence intervals | **95% stratified bootstrap**, 1,000 resamples |
+| Baselines (≥3) | Random walk (lower bound), Boustrophedon+manual-triangulation (heuristic), IPPO (no shared critic) |
+| Reproducibility | Versioned YAML config (drone count, area, victims, CSI σ amplitude / κ phase, wind, packet loss) + all seeds committed with results |
+
+Rationale: Henderson et al. (2018) found ≤5-seed point estimates flip rankings; Agarwal et
+al. (2021, NeurIPS Outstanding Paper) show IQM needs ~10 runs for the statistical power that
+the median needs ~200 runs for; Gorsane et al. (2022) made ≥10 seeds + IQM + stratified CIs
+the cooperative-MARL standard. `rliable` (google-research/rliable) is the reference impl.
+
+---
+
+## 6. Reproducibility Harness (`evals/`)
+
+A new evaluation harness (separate from criterion micro-benchmarks):
+
+1. **Seeded episodes** — every episode, noise perturbation, and training run seeded from a
+   versioned config; seeds committed with results (no `Date.now()`/unseeded RNG).
+2. **Per-episode logging** — coverage %, localization error, GDOP, time-to-first-detection,
+   collisions, detection binary → JSONL (reuses the ADR-148 telemetry schema).
+3. **Aggregation** — IQM ± 95% stratified-bootstrap CI across the 10-seed × 50-episode matrix.
+4. **Baseline sweep** — random / boustrophedon-heuristic / IPPO / MAPPO, so
+   probability-of-improvement and performance profiles are computable.
+5. **Output** — committed `evals/RESULTS.md`: a reproducible internal leaderboard ranking
+   our 6 flight patterns × learning patterns on the SAR metrics, plus the Wi2SAR paper row.
+
+This `RESULTS.md` is the **real, defensible "leaderboard" for this system** — patterns ranked
+against each other and the cited baseline, reproducibly, with CIs.
+
+### 6.1 Dual-stage pipeline (compute-cost mitigation)
+
+The full matrix is **10 seeds × 50 episodes × ≥4 conditions = ≥2,000 rollouts per policy**.
+Running each rollout against the OccWorld 3D prior (ADR-147, ~375 ms/inference) would melt
+the L4 / RTX 5080 budget. Split evaluation into two stages:
+
+- **Stage 1 — Kinematic (fast, full matrix).** Stripped vector environment; OccWorld paths
+  pre-cached or treated as static analytical volumes. Produces episodic **return, IQM,
+  sample-efficiency curves, coverage %, GDOP, localization error** over the full 10-seed matrix.
+- **Stage 2 — High-fidelity physics (sub-sampled).** Take the **3 median seeds** (by Stage-1
+  IQM) into Gazebo + PX4 SITL with full CSI phase/amplitude noise. Extracts **false-alarm
+  rate** and **collision rate** under realistic dynamics (heading-rate limits, APF repulsion,
+  motor response) that the kinematic sim omits.
+
+Stage 1 is CI-runnable today; Stage 2 requires the Gazebo/PX4 SITL bring-up (follow-on).
+
+### 6.2 Noise sweep (coherence-gate threshold)
+
+The config generator systematically varies the two CSI noise parameters:
+- **σ** — Gaussian amplitude noise (CSI magnitude)
+- **κ** — von Mises phase concentration (lower κ = noisier phase)
+
+Sweeping (σ, κ) isolates the exact environmental threshold where `CrossViewpointAttention`
+(ADR-016) drops out of its coherence gate (`coherence_gate.rs` Accept → PredictOnly/Reject,
+ADR-135). This finds the operating envelope, not just a single-point accuracy.
+
+### 6.3 GDOP tracking
+
+Localization accuracy is meaningless without the constellation geometry that produced it.
+The harness records **GDOP** per detection: 3 drones in a ~120° constellation give the
+√3 ≈ 1.73× CRLB improvement; 3 **collinear** drones degrade toward the single-view
+Cramer-Rao limit (~2.9 m). Reporting localization error **stratified by GDOP band** prevents
+the headline number from being a best-case geometric artifact.
+
+---
+
+## 7. Evidence Grading of Current ADR-148 Numbers
+
+| Claim | Grade | Why |
+|-------|-------|-----|
+| criterion latencies (3.3 µs / 43 µs / 54 ns / 248 µs) | **High** | Deterministic compute, hardware-specific, reproducible |
+| Wi2SAR baseline (5 m, 160k m²/13.5 min) | **High** | Published field trial, open source |
+| 1.732 m 3-view localization | **Low–Medium** | Single synthetic geometry; no noise distribution; CRLB predicts ~2.9 m for N=3 |
+| 223 s 4-drone coverage | **Low** | Analytic estimate, not an episode rollout |
+| "beats SOTA" | **Directional only** | Valid as paper-to-paper direction; not leaderboard, not multi-seed |
+
+The √N multi-view scaling claim is theoretically sound (CRLB: σ ∝ 1/√(N·SNR); N=3 → √3 ≈
+1.73× improvement), but the measured 1.732 m must be reproduced over a victim-position and
+noise distribution before it is defensible.
+
+---
+
+## 8. Consequences
+
+### Positive
+- Converts scattered numbers into a reproducible, statistically-honest evaluation.
+- The `RESULTS.md` internal leaderboard ranks the 6 flight × 4 learning patterns fairly.
+- Aligns with the recognized MARL evaluation standard (IQM + stratified CIs + ≥10 seeds).
+- Honest external-leaderboard position avoids overclaiming.
+
+### Costs / Risks
+- ≥10 seeds × 50 episodes × N patterns × N baselines is a real compute cost — this is where
+  the ADR-148 GCP L4 / local RTX 5080 training budget is actually spent.
+- Requires the MARL policy to be **trained to convergence** first (the ADR-148 5-episode CPU
+  run shows decreasing value_loss, not convergence).
+- Coverage/localization must move from analytic estimate / synthetic geometry to **episode
+  rollouts under realistic CSI noise** before headline numbers are republished.
+
+### Open issues → follow-on work
+1. Train MAPPO/IPPO to convergence (M4 follow-on) before running the eval harness.
+2. Build the seeded `evals/` harness + `RESULTS.md` generator.
+3. Optional: MARL4DRP or MPE Simple-Spread adapter for a recognized cooperative-MARL number.
+4. Re-state ADR-148 §14 headline numbers with CIs once the harness has run.
+
+---
+
+## 9. Research Notes & References
+
+Compiled by `ruflo-goals:deep-researcher` (2026-05-30). Full landscape in the agent record.
+
+**MARL evaluation rigor**
+- Henderson et al., "Deep RL That Matters", arxiv 1709.06560 — ≤5-seed estimates flip rankings
+- Agarwal et al., "Deep RL at the Edge of the Statistical Precipice", NeurIPS 2021, arxiv 2108.13264 — IQM, performance profiles, stratified bootstrap; `rliable`
+- Gorsane et al., "Standardised Evaluation Protocol for Cooperative MARL", NeurIPS 2022, arxiv 2209.10485 — ≥10 seeds + IQM standard
+- BenchMARL, arxiv 2312.01472 — operationalizes the above
+
+**Cooperative-MARL benchmarks**
+- SMACv2, arxiv 2212.07489 · PettingZoo MPE (Farama) · Melting Pot (DeepMind, NeurIPS 2024 contest) · MAMuJoCo (Gymnasium-Robotics) · MARL4DRP, Springer 2023 (closest drone-MARL)
+
+**Drone-sim platforms**
+- gym-pybullet-drones, arxiv 2103.02142 · OmniDrones, IEEE RA-L 2024 · Pegasus, arxiv 2307.05263 · Flightmare (IROS 2021) · AirSim (discontinued 2022) · Crazyswarm2
+
+**SAR / coverage / CSI sensing**
+- Wi2SAR, arxiv 2604.09115 (direct baseline: 5 m, 160k m²/13.5 min, 18.4° median AoA)
+- CSI-Bench, NeurIPS 2025, arxiv 2505.21866 (461 h WiFi sensing, localization task)
+- Coverage path planning, PMC9571681 (boustrophedon ~5% faster than spiral)
+- Bio-inspired SAR, Nature s41598-025-33223-z (PSO > Levy/ACO on exploration score)
+- CRLB for CSI localization, IEEE 8110647 (σ ∝ 1/√(N·SNR))
+
+**Tooling**
+- criterion.rs known limitations — wall-clock only, not algorithmic quality
+- rliable, github.com/google-research/rliable
+
+---
+
+*ADR authored with research support from `ruflo-goals:deep-researcher` (2026-05-30).
+ Companion to ADR-148. Defines the evaluation methodology that the ADR-148 headline
+ numbers must satisfy before being republished as defensible claims.*
--- a/v2/crates/ruview-swarm/Cargo.toml
+++ b/v2/crates/ruview-swarm/Cargo.toml
@ -78,3 +78,7 @@ harness = false
 [[bin]]
 name = "train_marl"
 required-features = ["train"]
+
+# ADR-149 Stage-1 evaluation CLI — pure Rust, no special feature needed.
+[[bin]]
+name = "eval_swarm"
--- a/v2/crates/ruview-swarm/evals/.gitkeep
+++ b/v2/crates/ruview-swarm/evals/.gitkeep
@ -0,0 +1,2 @@
+# ADR-149 evaluation outputs
+RESULTS.md is generated by the `eval_swarm` binary.
--- a/v2/crates/ruview-swarm/evals/RESULTS.md
+++ b/v2/crates/ruview-swarm/evals/RESULTS.md
@ -0,0 +1,26 @@
+# ruview-swarm Evaluation Results (ADR-149 Stage 1, kinematic)
+
+Statistically-rigorous evaluation harness: seeded multi-run rollouts with IQM + 95% stratified-bootstrap confidence intervals (Agarwal et al., NeurIPS 2021).
+
+## Run configuration
+
+- **Stage**: 1 (kinematic, self-contained, deterministic per seed)
+- **Episodes per pattern**: 100 (seed × episode matrix)
+- **CI method**: 95% stratified bootstrap of the IQM, stratified by seed
+- **GDOP**: 2-D geometric dilution of precision at first detection
+
+> **Stage 2 pending**: high-fidelity Gazebo/PX4 SITL evaluation (false-alarm rate, real collision rate on the median seeds) is a follow-on — see ADR-149 §6.1. The collision figures below are a kinematic min-separation proxy, not SITL physics.
+
+## Flight-pattern leaderboard
+
+| Flight pattern | Coverage IQM [95% CI] | Localization (m) IQM [95% CI] | Detection rate | Mean GDOP |
+|----------------|-----------------------|-------------------------------|----------------|-----------|
+| partitioned_lawnmower | 1.000 [1.000, 1.000] | 7.022 [5.669, 8.379] | 100.0% | 0.000 |
+| pheromone | 0.662 [0.652, 0.671] | 4.110 [3.346, 5.141] | 95.0% | 1.598 |
+| levy_flight | 0.490 [0.489, 0.491] | 3.523 [2.897, 4.160] | 100.0% | 0.000 |
+| boustrophedon | 0.370 [0.370, 0.370] | 2.740 [2.357, 3.207] | 100.0% | 0.000 |
+| spiral | 0.336 [0.336, 0.336] | 3.082 [2.678, 3.568] | 100.0% | 0.000 |
+| potential_field | 0.254 [0.252, 0.256] | 4.343 [3.489, 5.265] | 100.0% | 0.000 |
+| _Wi2SAR (paper baseline)_ | _n/a_ | _5.0 (paper)_ | _n/a_ | _n/a_ |
+
+_Wi2SAR row is the published single-drone localization figure (arxiv 2604.09115), shown paper-to-paper for reference only — it was not re-run through this kinematic harness._
--- a/v2/crates/ruview-swarm/src/bin/eval_swarm.rs
+++ b/v2/crates/ruview-swarm/src/bin/eval_swarm.rs
@ -0,0 +1,104 @@
+//! ADR-149 Stage-1 evaluation CLI.
+//!
+//! Runs the kinematic eval matrix over every flight pattern (default) and
+//! writes a ranked `RESULTS.md` leaderboard. Pure Rust — no special feature
+//! flag required, so it builds and runs in default CI.
+//!
+//! Defaults are intentionally small (10 seeds × 10 episodes) so the run is fast.
+//! The full ADR-149 reporting configuration is 10 seeds × 50 episodes — pass
+//! `--seeds 10 --episodes 50` for the publication run.
+//!
+//! ```text
+//! cargo run -p ruview-swarm --bin eval_swarm -- \
+//!   --seeds 10 --episodes 10 --out crates/ruview-swarm/evals/RESULTS.md
+//! ```
+
+use std::path::PathBuf;
+
+use ruview_swarm::evals::metrics::AggregateMetrics;
+use ruview_swarm::evals::report::render_results_md;
+use ruview_swarm::evals::runner::{run_matrix, EvalConfig};
+use ruview_swarm::planning::patterns::FlightPattern;
+
+fn main() {
+    let args: Vec<String> = std::env::args().collect();
+    let mut seeds = 10usize;
+    let mut episodes = 10usize;
+    let mut out = PathBuf::from("crates/ruview-swarm/evals/RESULTS.md");
+
+    let mut i = 1;
+    while i < args.len() {
+        match args[i].as_str() {
+            "--seeds" => {
+                i += 1;
+                seeds = args.get(i).and_then(|s| s.parse().ok()).unwrap_or(seeds);
+            }
+            "--episodes" => {
+                i += 1;
+                episodes = args.get(i).and_then(|s| s.parse().ok()).unwrap_or(episodes);
+            }
+            "--out" => {
+                i += 1;
+                if let Some(p) = args.get(i) {
+                    out = PathBuf::from(p);
+                }
+            }
+            "--help" | "-h" => {
+                eprintln!(
+                    "eval_swarm — ADR-149 Stage-1 kinematic evaluator\n\
+                     Usage: eval_swarm [--seeds N] [--episodes M] [--out PATH]\n\
+                     Defaults: --seeds 10 --episodes 10 --out crates/ruview-swarm/evals/RESULTS.md"
+                );
+                return;
+            }
+            other => {
+                eprintln!("warning: ignoring unknown argument '{other}'");
+            }
+        }
+        i += 1;
+    }
+
+    eprintln!(
+        "Running ADR-149 Stage-1 eval: {seeds} seeds × {episodes} episodes \
+         over {} flight patterns...",
+        FlightPattern::all().len()
+    );
+
+    let mut rows: Vec<(String, AggregateMetrics)> = Vec::new();
+    for (idx, pattern) in FlightPattern::all().into_iter().enumerate() {
+        let mut cfg = EvalConfig::sar_small(pattern);
+        cfg.seeds = seeds;
+        cfg.episodes_per_seed = episodes;
+        let matrix = run_matrix(&cfg);
+        let agg = AggregateMetrics::from_strata(&matrix, 0x0149 ^ idx as u64);
+        eprintln!(
+            "  {}: coverage IQM {:.3}, detection {:.0}%",
+            pattern.name(),
+            agg.coverage_iqm.point,
+            agg.detection_rate * 100.0
+        );
+        rows.push((pattern.name().to_string(), agg));
+    }
+
+    // Rank by descending coverage point estimate.
+    rows.sort_by(|a, b| {
+        b.1.coverage_iqm
+            .point
+            .partial_cmp(&a.1.coverage_iqm.point)
+            .unwrap_or(std::cmp::Ordering::Equal)
+    });
+
+    let md = render_results_md(&rows);
+
+    if let Some(parent) = out.parent() {
+        if let Err(e) = std::fs::create_dir_all(parent) {
+            eprintln!("error: could not create {}: {e}", parent.display());
+            std::process::exit(1);
+        }
+    }
+    if let Err(e) = std::fs::write(&out, &md) {
+        eprintln!("error: could not write {}: {e}", out.display());
+        std::process::exit(1);
+    }
+    eprintln!("Wrote {} ({} bytes).", out.display(), md.len());
+}
--- a/v2/crates/ruview-swarm/src/evals/gdop.rs
+++ b/v2/crates/ruview-swarm/src/evals/gdop.rs
@ -0,0 +1,118 @@
+//! Geometric Dilution of Precision (GDOP) for a constellation of observers.
+//!
+//! GDOP quantifies how observer geometry amplifies measurement error into
+//! position-estimate error. Build the geometry matrix `H` of unit
+//! line-of-sight (LOS) vectors from each observer to the target, form the
+//! normal matrix `HᵀH`, invert it, and take `GDOP = sqrt(trace((HᵀH)⁻¹))`.
+//!
+//! For the 2-D (x, y) localization case `H` is `N×2` and `HᵀH` is `2×2`, so a
+//! closed-form 2×2 inverse suffices (no linear-algebra dependency needed).
+//!
+//! Lower GDOP = better geometry: observers spread ~120° apart around the target
+//! give low GDOP; (near-)collinear observers give a singular/ill-conditioned
+//! `HᵀH` → GDOP → ∞.
+
+use crate::types::Position3D;
+
+/// Geometric Dilution of Precision (2-D) for `observers` viewing a `target`.
+///
+/// Lower = better geometry. A ~120° constellation → low GDOP; collinear → very
+/// large (→∞). Returns `None` if fewer than two observers, if any observer is
+/// coincident with the target (undefined LOS), or if the geometry is singular
+/// / degenerate (collinear) so `HᵀH` is not invertible.
+pub fn gdop(observers: &[Position3D], target: &Position3D) -> Option<f64> {
+    if observers.len() < 2 {
+        return None;
+    }
+
+    // Accumulate HᵀH directly (2×2 symmetric) from unit LOS vectors.
+    // Row i of H is the unit vector from target → observer i in (x, y).
+    let mut a = 0.0; // sum ux*ux
+    let mut b = 0.0; // sum ux*uy
+    let mut d = 0.0; // sum uy*uy
+
+    for obs in observers {
+        let dx = obs.x - target.x;
+        let dy = obs.y - target.y;
+        let range = (dx * dx + dy * dy).sqrt();
+        if range < 1e-9 {
+            // Observer on top of the target → LOS undefined.
+            return None;
+        }
+        let ux = dx / range;
+        let uy = dy / range;
+        a += ux * ux;
+        b += ux * uy;
+        d += uy * uy;
+    }
+
+    // Determinant of HᵀH = [[a, b], [b, d]].
+    let det = a * d - b * b;
+    if det.abs() < 1e-12 {
+        // Singular: observers are (near-)collinear with the target.
+        return None;
+    }
+
+    // (HᵀH)⁻¹ = 1/det * [[d, -b], [-b, a]]; trace = (d + a) / det.
+    let trace_inv = (a + d) / det;
+    if trace_inv <= 0.0 || !trace_inv.is_finite() {
+        return None;
+    }
+    Some(trace_inv.sqrt())
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn p(x: f64, y: f64) -> Position3D {
+        Position3D { x, y, z: 0.0 }
+    }
+
+    #[test]
+    fn test_triangle_lower_than_collinear() {
+        let target = p(0.0, 0.0);
+        // Three observers at 120° around the target, radius 10.
+        let r = 10.0;
+        let triangle = [
+            p(r * 0.0_f64.cos(), r * 0.0_f64.sin()),
+            p(
+                r * (2.0 * std::f64::consts::PI / 3.0).cos(),
+                r * (2.0 * std::f64::consts::PI / 3.0).sin(),
+            ),
+            p(
+                r * (4.0 * std::f64::consts::PI / 3.0).cos(),
+                r * (4.0 * std::f64::consts::PI / 3.0).sin(),
+            ),
+        ];
+        // Three nearly-collinear observers (tiny y perturbation to stay invertible).
+        let near_collinear = [p(5.0, 0.01), p(10.0, 0.0), p(15.0, 0.01)];
+
+        let tri = gdop(&triangle, &target).expect("triangle finite GDOP");
+        let col = gdop(&near_collinear, &target).expect("near-collinear finite GDOP");
+        assert!(tri.is_finite(), "triangle GDOP must be finite: {tri}");
+        assert!(
+            tri < col,
+            "120° constellation should have lower GDOP than near-collinear: tri={tri}, col={col}"
+        );
+    }
+
+    #[test]
+    fn test_collinear_degenerate() {
+        let target = p(0.0, 0.0);
+        // Perfectly collinear observers along +x → singular HᵀH.
+        let collinear = [p(5.0, 0.0), p(10.0, 0.0), p(20.0, 0.0)];
+        let g = gdop(&collinear, &target);
+        assert!(
+            g.is_none() || g.unwrap() > 1e6,
+            "perfectly collinear geometry must be None or huge, got {g:?}"
+        );
+    }
+
+    #[test]
+    fn test_single_observer_none() {
+        let target = p(0.0, 0.0);
+        assert!(gdop(&[p(5.0, 5.0)], &target).is_none());
+        assert!(gdop(&[], &target).is_none());
+    }
+}
--- a/v2/crates/ruview-swarm/src/evals/metrics.rs
+++ b/v2/crates/ruview-swarm/src/evals/metrics.rs
@ -0,0 +1,150 @@
+//! Per-episode and aggregate SAR + MARL metrics (ADR-149 Stage 1).
+
+use crate::evals::stats::{stratified_bootstrap_ci, ConfidenceInterval};
+
+/// Per-episode SAR metrics (Stage 1 kinematic).
+#[derive(Debug, Clone)]
+pub struct EpisodeMetrics {
+    /// Fraction of the mission area scanned at least once, in [0, 1].
+    pub coverage_pct: f64,
+    /// Localization error (m) of the fused victim estimate; `None` if no detection.
+    pub localization_error_m: Option<f64>,
+    /// GDOP of the contributing-drone constellation at detection; `None` if none.
+    pub gdop_at_detection: Option<f64>,
+    /// Mission-elapsed seconds to first detection; `None` if no detection.
+    pub time_to_first_detection_s: Option<f64>,
+    /// Whether at least one victim was detected this episode.
+    pub detected: bool,
+    /// Count of inter-drone proximity violations (kinematic proxy for collisions).
+    pub collisions: u32,
+    /// Fraction of scanned area covered by more than one drone, in [0, 1].
+    pub overlap_ratio: f64,
+    /// Scalar episodic return (reward-like coverage/detection objective).
+    pub episodic_return: f64,
+}
+
+/// Aggregate over a seed × episode matrix with IQM + 95% bootstrap CIs.
+#[derive(Debug, Clone)]
+pub struct AggregateMetrics {
+    pub coverage_iqm: ConfidenceInterval,
+    /// IQM over detected episodes only (undetected episodes carry no error).
+    pub localization_iqm: ConfidenceInterval,
+    pub detection_rate: f64,
+    pub mean_gdop: f64,
+    pub return_iqm: ConfidenceInterval,
+    pub n_episodes: usize,
+}
+
+impl AggregateMetrics {
+    /// Aggregate a seed-stratified matrix of episodes. Each inner `Vec` is one
+    /// seed's episodes; bootstrap resampling is stratified by seed so the CI
+    /// reflects between-seed variance (the dominant source per ADR-149).
+    pub fn from_strata(per_seed: &[Vec<EpisodeMetrics>], boot_seed: u64) -> Self {
+        const N_BOOT: usize = 1000;
+
+        let coverage_strata: Vec<Vec<f64>> = per_seed
+            .iter()
+            .map(|s| s.iter().map(|e| e.coverage_pct).collect())
+            .collect();
+        let return_strata: Vec<Vec<f64>> = per_seed
+            .iter()
+            .map(|s| s.iter().map(|e| e.episodic_return).collect())
+            .collect();
+        // Localization: only detected episodes contribute. Keep stratification
+        // by seed but drop empty strata so the bootstrap doesn't degenerate.
+        let loc_strata: Vec<Vec<f64>> = per_seed
+            .iter()
+            .map(|s| {
+                s.iter()
+                    .filter_map(|e| e.localization_error_m)
+                    .collect::<Vec<f64>>()
+            })
+            .filter(|v: &Vec<f64>| !v.is_empty())
+            .collect();
+
+        let mut detected = 0usize;
+        let mut total = 0usize;
+        let mut gdop_sum = 0.0;
+        let mut gdop_n = 0usize;
+        for seed in per_seed {
+            for e in seed {
+                total += 1;
+                if e.detected {
+                    detected += 1;
+                }
+                if let Some(g) = e.gdop_at_detection {
+                    if g.is_finite() {
+                        gdop_sum += g;
+                        gdop_n += 1;
+                    }
+                }
+            }
+        }
+
+        let detection_rate = if total == 0 {
+            0.0
+        } else {
+            detected as f64 / total as f64
+        };
+        let mean_gdop = if gdop_n == 0 {
+            0.0
+        } else {
+            gdop_sum / gdop_n as f64
+        };
+
+        AggregateMetrics {
+            coverage_iqm: stratified_bootstrap_ci(&coverage_strata, N_BOOT, boot_seed),
+            localization_iqm: stratified_bootstrap_ci(
+                &loc_strata,
+                N_BOOT,
+                boot_seed.wrapping_add(1),
+            ),
+            detection_rate,
+            mean_gdop,
+            return_iqm: stratified_bootstrap_ci(
+                &return_strata,
+                N_BOOT,
+                boot_seed.wrapping_add(2),
+            ),
+            n_episodes: total,
+        }
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    fn ep(cov: f64, loc: Option<f64>, ret: f64, detected: bool) -> EpisodeMetrics {
+        EpisodeMetrics {
+            coverage_pct: cov,
+            localization_error_m: loc,
+            gdop_at_detection: if detected { Some(2.0) } else { None },
+            time_to_first_detection_s: if detected { Some(10.0) } else { None },
+            detected,
+            collisions: 0,
+            overlap_ratio: 0.1,
+            episodic_return: ret,
+        }
+    }
+
+    #[test]
+    fn test_aggregate_detection_rate_and_shape() {
+        let per_seed = vec![
+            vec![
+                ep(0.8, Some(1.5), 80.0, true),
+                ep(0.7, None, 70.0, false),
+            ],
+            vec![
+                ep(0.9, Some(2.0), 90.0, true),
+                ep(0.85, Some(1.0), 85.0, true),
+            ],
+        ];
+        let agg = AggregateMetrics::from_strata(&per_seed, 7);
+        assert_eq!(agg.n_episodes, 4);
+        assert!((agg.detection_rate - 0.75).abs() < 1e-9);
+        assert!(agg.coverage_iqm.lo <= agg.coverage_iqm.point);
+        assert!(agg.coverage_iqm.point <= agg.coverage_iqm.hi);
+        assert!(agg.mean_gdop > 0.0);
+    }
+}
--- a/v2/crates/ruview-swarm/src/evals/mod.rs
+++ b/v2/crates/ruview-swarm/src/evals/mod.rs
@ -0,0 +1,19 @@
+//! ADR-149 statistically-rigorous evaluation harness (Stage 1, kinematic).
+//!
+//! Produces SAR + MARL metrics over a seeded N-seed × M-episode matrix with
+//! IQM + 95% stratified-bootstrap CIs, a (sigma, kappa) CSI-noise sweep, and
+//! GDOP-stratified localization error. Generates evals/RESULTS.md.
+//!
+//! Stage 2 (Gazebo/PX4 SITL high-fidelity, false-alarm + collision rate on the
+//! median seeds) is a follow-on — see ADR-149 §6.1.
+pub mod gdop;
+pub mod stats;
+pub mod metrics;
+pub mod runner;
+pub mod report;
+
+pub use gdop::gdop;
+pub use stats::{iqm, stratified_bootstrap_ci, ConfidenceInterval};
+pub use metrics::{EpisodeMetrics, AggregateMetrics};
+pub use runner::{EvalConfig, NoiseLevel, run_matrix};
+pub use report::render_results_md;
--- a/v2/crates/ruview-swarm/src/evals/report.rs
+++ b/v2/crates/ruview-swarm/src/evals/report.rs
@ -0,0 +1,120 @@
+//! RESULTS.md leaderboard generator (ADR-149 Stage 1).
+
+use crate::evals::metrics::AggregateMetrics;
+use crate::evals::stats::ConfidenceInterval;
+
+/// Wi2SAR published localization baseline (paper-to-paper), metres.
+const WI2SAR_LOCALIZATION_M: f64 = 5.0;
+
+/// Format a CI as `point [lo, hi]` with two decimals.
+fn fmt_ci(ci: &ConfidenceInterval) -> String {
+    format!("{:.3} [{:.3}, {:.3}]", ci.point, ci.lo, ci.hi)
+}
+
+/// Render a markdown leaderboard: one row per flight pattern with coverage
+/// IQM±CI, localization IQM±CI, detection rate, and mean GDOP — plus the
+/// Wi2SAR paper baseline row clearly labelled paper-to-paper.
+///
+/// `rows` is `(pattern_name, aggregate)`; rows are emitted in the order given,
+/// so callers should pre-sort (e.g. by descending coverage point estimate).
+pub fn render_results_md(rows: &[(String, AggregateMetrics)]) -> String {
+    let mut s = String::new();
+    s.push_str("# ruview-swarm Evaluation Results (ADR-149 Stage 1, kinematic)\n\n");
+    s.push_str(
+        "Statistically-rigorous evaluation harness: seeded multi-run rollouts with \
+         IQM + 95% stratified-bootstrap confidence intervals (Agarwal et al., \
+         NeurIPS 2021).\n\n",
+    );
+
+    // Run configuration header.
+    let (n_episodes, n_seeds) = rows
+        .first()
+        .map(|(_, a)| {
+            let n = a.n_episodes;
+            // Episodes-per-seed isn't stored; report total + leave seed split to caller note.
+            (n, 0usize)
+        })
+        .unwrap_or((0, 0));
+    s.push_str("## Run configuration\n\n");
+    s.push_str(&format!(
+        "- **Stage**: 1 (kinematic, self-contained, deterministic per seed)\n\
+         - **Episodes per pattern**: {n_episodes} (seed × episode matrix)\n\
+         - **CI method**: 95% stratified bootstrap of the IQM, stratified by seed\n\
+         - **GDOP**: 2-D geometric dilution of precision at first detection\n"
+    ));
+    let _ = n_seeds;
+    s.push_str(
+        "\n> **Stage 2 pending**: high-fidelity Gazebo/PX4 SITL evaluation \
+         (false-alarm rate, real collision rate on the median seeds) is a \
+         follow-on — see ADR-149 §6.1. The collision figures below are a \
+         kinematic min-separation proxy, not SITL physics.\n\n",
+    );
+
+    // Leaderboard table.
+    s.push_str("## Flight-pattern leaderboard\n\n");
+    s.push_str(
+        "| Flight pattern | Coverage IQM [95% CI] | Localization (m) IQM [95% CI] | \
+         Detection rate | Mean GDOP |\n",
+    );
+    s.push_str(
+        "|----------------|-----------------------|-------------------------------|\
+         ----------------|-----------|\n",
+    );
+    for (name, agg) in rows {
+        s.push_str(&format!(
+            "| {} | {} | {} | {:.1}% | {:.3} |\n",
+            name,
+            fmt_ci(&agg.coverage_iqm),
+            fmt_ci(&agg.localization_iqm),
+            agg.detection_rate * 100.0,
+            agg.mean_gdop,
+        ));
+    }
+    // Wi2SAR paper baseline row (paper-to-paper, no kinematic re-run).
+    s.push_str(&format!(
+        "| _Wi2SAR (paper baseline)_ | _n/a_ | _{:.1} (paper)_ | _n/a_ | _n/a_ |\n",
+        WI2SAR_LOCALIZATION_M,
+    ));
+
+    s.push_str(
+        "\n_Wi2SAR row is the published single-drone localization figure \
+         (arxiv 2604.09115), shown paper-to-paper for reference only — it was \
+         not re-run through this kinematic harness._\n",
+    );
+
+    s
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::evals::stats::ConfidenceInterval;
+
+    fn agg(cov: f64, det: f64) -> AggregateMetrics {
+        let ci = |p: f64| ConfidenceInterval { point: p, lo: p - 0.05, hi: p + 0.05 };
+        AggregateMetrics {
+            coverage_iqm: ci(cov),
+            localization_iqm: ci(1.5),
+            detection_rate: det,
+            mean_gdop: 2.1,
+            return_iqm: ci(80.0),
+            n_episodes: 100,
+        }
+    }
+
+    #[test]
+    fn test_render_contains_rows_and_baseline() {
+        let rows = vec![
+            ("partitioned_lawnmower".to_string(), agg(0.92, 0.95)),
+            ("levy_flight".to_string(), agg(0.40, 0.50)),
+        ];
+        let md = render_results_md(&rows);
+        assert!(md.contains("partitioned_lawnmower"));
+        assert!(md.contains("levy_flight"));
+        assert!(md.contains("Wi2SAR"));
+        assert!(md.contains("Stage 2 pending"));
+        assert!(md.contains("95% stratified bootstrap"));
+        // Coverage point estimate appears.
+        assert!(md.contains("0.920"));
+    }
+}
--- a/v2/crates/ruview-swarm/src/evals/runner.rs
+++ b/v2/crates/ruview-swarm/src/evals/runner.rs
@ -0,0 +1,364 @@
+//! Stage-1 kinematic rollout + seed × episode matrix (ADR-149).
+//!
+//! A single `run_episode` deterministically drives `drones` drones across a
+//! mission area under a chosen [`FlightPattern`], marks coverage on a grid,
+//! simulates CSI victim detection perturbed by `(sigma, kappa)` amplitude /
+//! von-Mises-phase noise, and computes the GDOP of the contributing-drone
+//! constellation at first detection. It is self-contained and seeded — no
+//! Candle / training backend required — so it runs in CI by default.
+
+use crate::config::SwarmConfig;
+use crate::evals::gdop::gdop;
+use crate::evals::metrics::EpisodeMetrics;
+use crate::planning::patterns::{FlightPattern, PatternContext};
+use crate::types::{NodeId, Position3D};
+
+/// CSI-noise level: amplitude std `sigma` and von-Mises phase concentration `kappa`.
+/// Higher `sigma` = noisier amplitude; *lower* `kappa` = noisier phase (more diffuse).
+#[derive(Debug, Clone, Copy)]
+pub struct NoiseLevel {
+    pub sigma: f64,
+    pub kappa: f64,
+}
+
+/// One evaluation configuration: a flight pattern + swarm/mission parameters.
+#[derive(Debug, Clone)]
+pub struct EvalConfig {
+    pub flight: FlightPattern,
+    pub config: SwarmConfig,
+    pub drones: usize,
+    pub steps: usize,
+    pub seeds: usize,             // ≥10 per ADR-149
+    pub episodes_per_seed: usize, // e.g. 50
+    pub victims: Vec<Position3D>,
+    pub noise: NoiseLevel,
+}
+
+impl EvalConfig {
+    /// A small SAR default suitable for fast CI runs.
+    pub fn sar_small(flight: FlightPattern) -> Self {
+        EvalConfig {
+            flight,
+            config: SwarmConfig::sar_default(),
+            drones: 4,
+            steps: 120,
+            seeds: 10,
+            episodes_per_seed: 10,
+            victims: vec![
+                Position3D { x: 120.0, y: 90.0, z: 0.0 },
+                Position3D { x: 320.0, y: 280.0, z: 0.0 },
+            ],
+            noise: NoiseLevel { sigma: 0.05, kappa: 8.0 },
+        }
+    }
+}
+
+/// Minimal reproducible LCG → f64 in [0, 1). Self-contained for determinism.
+struct Lcg(u64);
+impl Lcg {
+    fn new(seed: u64) -> Self {
+        Lcg(seed ^ 0xD1B5_4A32_D192_ED03)
+    }
+    #[inline]
+    fn next_u64(&mut self) -> u64 {
+        self.0 = self
+            .0
+            .wrapping_mul(6364136223846793005)
+            .wrapping_add(1442695040888963407);
+        self.0
+    }
+    #[inline]
+    fn unit(&mut self) -> f64 {
+        (self.next_u64() >> 11) as f64 / (1u64 << 53) as f64
+    }
+    /// Standard-normal sample via Box–Muller (deterministic).
+    #[inline]
+    fn normal(&mut self) -> f64 {
+        let u1 = self.unit().max(1e-12);
+        let u2 = self.unit();
+        (-2.0 * u1.ln()).sqrt() * (2.0 * std::f64::consts::PI * u2).cos()
+    }
+}
+
+/// Run one kinematic episode deterministically from `seed`.
+///
+/// Drives drones step-by-step by the flight pattern, marks a coarse coverage
+/// grid, and on the first step a drone comes within scan range of any victim
+/// records a fused localization estimate (weighted centroid of contributing
+/// drones' per-drone victim estimates, each perturbed by `(sigma, kappa)`
+/// noise) and the GDOP of those contributing drones.
+pub fn run_episode(cfg: &EvalConfig, seed: u64) -> EpisodeMetrics {
+    let mut rng = Lcg::new(seed);
+
+    let area_w = cfg.config.mission.area_width_m;
+    let area_h = cfg.config.mission.area_height_m;
+    let altitude_z = -cfg.config.planning.flight_altitude_m;
+    let scan_width = cfg.config.planning.csi_scan_width_m.max(1.0);
+    let min_sep = cfg.config.formation.min_separation_m.max(0.1);
+    let n = cfg.drones.max(1);
+
+    // Coverage grid sized so each cell ~= scan_width.
+    let gx = ((area_w / scan_width).ceil() as usize).max(1);
+    let gy = ((area_h / scan_width).ceil() as usize).max(1);
+    let cell_w = area_w / gx as f64;
+    let cell_h = area_h / gy as f64;
+    let mut cover_count = vec![0u32; gx * gy];
+
+    // Spread drones along the bottom edge with a small seeded jitter.
+    let mut positions: Vec<Position3D> = (0..n)
+        .map(|i| {
+            let frac = (i as f64 + 0.5) / n as f64;
+            Position3D {
+                x: (frac * area_w + (rng.unit() - 0.5) * scan_width).clamp(0.0, area_w),
+                y: (rng.unit() * scan_width).clamp(0.0, area_h),
+                z: altitude_z,
+            }
+        })
+        .collect();
+
+    // Recent-visit ring buffer for pheromone / potential-field patterns.
+    let mut visited: Vec<Position3D> = Vec::new();
+    let max_visited = 32usize;
+
+    let scan_range = scan_width; // detect a victim within one scan footprint
+    let mut collisions = 0u32;
+    let mut detected = false;
+    let mut loc_error: Option<f64> = None;
+    let mut gdop_val: Option<f64> = None;
+    let mut t_detect: Option<f64> = None;
+
+    let dt = step_seconds(cfg);
+
+    for step in 0..cfg.steps {
+        // Advance each drone one waypoint under the pattern.
+        let snapshot = positions.clone();
+        for (i, pos) in positions.iter_mut().enumerate() {
+            let peers: Vec<Position3D> = snapshot
+                .iter()
+                .enumerate()
+                .filter(|(j, _)| *j != i)
+                .map(|(_, p)| *p)
+                .collect();
+            let ctx = PatternContext {
+                drone_id: NodeId(i as u32),
+                swarm_size: n,
+                current: *pos,
+                area_w,
+                area_h,
+                altitude_z,
+                scan_width_m: scan_width,
+                step: step as u64,
+                visited: &visited,
+                peers: &peers,
+            };
+            *pos = cfg.flight.next_target(&ctx);
+        }
+
+        // Mark coverage + record visits.
+        for pos in &positions {
+            let cx = ((pos.x / cell_w).floor() as i64).clamp(0, gx as i64 - 1) as usize;
+            let cy = ((pos.y / cell_h).floor() as i64).clamp(0, gy as i64 - 1) as usize;
+            cover_count[cy * gx + cx] = cover_count[cy * gx + cx].saturating_add(1);
+            visited.push(*pos);
+        }
+        if visited.len() > max_visited {
+            let drop = visited.len() - max_visited;
+            visited.drain(0..drop);
+        }
+
+        // Proximity / collision check (kinematic proxy).
+        for a in 0..positions.len() {
+            for b in (a + 1)..positions.len() {
+                let d = positions[a].distance_to(&positions[b]);
+                if d < min_sep {
+                    collisions = collisions.saturating_add(1);
+                }
+            }
+        }
+
+        // Detection: first step any victim falls within scan range of ≥1 drone,
+        // fuse a localization estimate from the contributing drones. A single
+        // contributor still yields a (noisier) estimate; GDOP is only defined
+        // for the multistatic ≥2-drone case and is `None` otherwise.
+        if !detected {
+            for victim in &cfg.victims {
+                let contributors: Vec<Position3D> = positions
+                    .iter()
+                    .filter(|p| horiz_dist(p, victim) <= scan_range)
+                    .copied()
+                    .collect();
+                if !contributors.is_empty() {
+                    let (est, g) = fuse_estimate(&contributors, victim, cfg.noise, &mut rng);
+                    loc_error = Some(horiz_dist(&est, victim));
+                    gdop_val = g; // None for a single contributor
+                    t_detect = Some((step as f64 + 1.0) * dt);
+                    detected = true;
+                    break;
+                }
+            }
+        }
+    }
+
+    // Coverage + overlap.
+    let total_cells = (gx * gy) as f64;
+    let scanned = cover_count.iter().filter(|&&c| c > 0).count() as f64;
+    let overlapped = cover_count.iter().filter(|&&c| c > 1).count() as f64;
+    let coverage_pct = if total_cells > 0.0 { scanned / total_cells } else { 0.0 };
+    let overlap_ratio = if scanned > 0.0 { overlapped / scanned } else { 0.0 };
+
+    // Episodic return: reward coverage + detection, penalize overlap + collisions.
+    let detect_bonus = if detected { 1.0 } else { 0.0 };
+    let loc_term = match loc_error {
+        Some(e) => (1.0 / (1.0 + e)).max(0.0),
+        None => 0.0,
+    };
+    let episodic_return = 100.0 * coverage_pct + 30.0 * detect_bonus + 20.0 * loc_term
+        - 10.0 * overlap_ratio
+        - 5.0 * collisions as f64;
+
+    EpisodeMetrics {
+        coverage_pct,
+        localization_error_m: loc_error,
+        gdop_at_detection: gdop_val,
+        time_to_first_detection_s: t_detect,
+        detected,
+        collisions,
+        overlap_ratio,
+        episodic_return,
+    }
+}
+
+/// Per-step wall-clock seconds, derived from scan width and drone speed.
+fn step_seconds(cfg: &EvalConfig) -> f64 {
+    let speed = cfg.config.planning.max_speed_ms.max(0.1);
+    (cfg.config.planning.csi_scan_width_m.max(1.0) / speed).max(0.1)
+}
+
+/// Horizontal (x, y) distance, ignoring altitude.
+fn horiz_dist(a: &Position3D, b: &Position3D) -> f64 {
+    (a.x - b.x).hypot(a.y - b.y)
+}
+
+/// Fuse contributing drones' per-drone victim estimates into a weighted
+/// centroid, perturbed by `(sigma, kappa)` CSI noise, and compute the GDOP of
+/// the contributing constellation.
+fn fuse_estimate(
+    contributors: &[Position3D],
+    victim: &Position3D,
+    noise: NoiseLevel,
+    rng: &mut Lcg,
+) -> (Position3D, Option<f64>) {
+    // Phase noise std from von Mises concentration: sigma_phase ≈ 1/sqrt(kappa).
+    let phase_std = 1.0 / noise.kappa.max(1e-3).sqrt();
+    let mut sx = 0.0;
+    let mut sy = 0.0;
+    let mut wsum = 0.0;
+    for c in contributors {
+        let range = horiz_dist(c, victim).max(1e-6);
+        // Each drone's estimate = true victim + range-scaled amplitude noise +
+        // bearing error from phase noise (perpendicular to LOS).
+        let amp = noise.sigma * range;
+        let nx = rng.normal() * amp;
+        let ny = rng.normal() * amp;
+        // Bearing wobble: rotate LOS unit vector by a small phase-noise angle.
+        let bearing = (victim.y - c.y).atan2(victim.x - c.x);
+        let dtheta = rng.normal() * phase_std;
+        let bx = range * (bearing + dtheta).cos();
+        let by = range * (bearing + dtheta).sin();
+        let est_x = c.x + bx + nx;
+        let est_y = c.y + by + ny;
+        // Inverse-range weighting: closer drones trusted more.
+        let w = 1.0 / range;
+        sx += est_x * w;
+        sy += est_y * w;
+        wsum += w;
+    }
+    let w = wsum.max(1e-9);
+    let est = Position3D { x: sx / w, y: sy / w, z: 0.0 };
+    let g = gdop(contributors, victim);
+    (est, g)
+}
+
+/// Run the full seed × episode matrix → per-seed strata of [`EpisodeMetrics`].
+pub fn run_matrix(cfg: &EvalConfig) -> Vec<Vec<EpisodeMetrics>> {
+    (0..cfg.seeds)
+        .map(|s| {
+            (0..cfg.episodes_per_seed)
+                .map(|e| {
+                    // Distinct deterministic seed per (seed, episode) cell.
+                    let cell_seed = (s as u64)
+                        .wrapping_mul(0x100_0000)
+                        .wrapping_add(e as u64)
+                        .wrapping_add(0xABCD);
+                    run_episode(cfg, cell_seed)
+                })
+                .collect()
+        })
+        .collect()
+}
+
+/// Standard ADR-149 noise sweep grid: cartesian product of σ × κ levels.
+pub fn default_noise_sweep() -> Vec<NoiseLevel> {
+    let sigmas = [0.02, 0.05, 0.10];
+    let kappas = [16.0, 8.0, 4.0];
+    let mut out = Vec::with_capacity(sigmas.len() * kappas.len());
+    for &sigma in &sigmas {
+        for &kappa in &kappas {
+            out.push(NoiseLevel { sigma, kappa });
+        }
+    }
+    out
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_run_episode_deterministic() {
+        let cfg = EvalConfig::sar_small(FlightPattern::PartitionedLawnmower);
+        let a = run_episode(&cfg, 12345);
+        let b = run_episode(&cfg, 12345);
+        assert_eq!(a.coverage_pct, b.coverage_pct);
+        assert_eq!(a.detected, b.detected);
+        assert_eq!(a.localization_error_m, b.localization_error_m);
+        assert_eq!(a.collisions, b.collisions);
+        assert_eq!(a.episodic_return, b.episodic_return);
+    }
+
+    #[test]
+    fn test_partitioned_beats_levy_coverage() {
+        let mut part = EvalConfig::sar_small(FlightPattern::PartitionedLawnmower);
+        part.seeds = 3;
+        part.episodes_per_seed = 5;
+        let mut levy = part.clone();
+        levy.flight = FlightPattern::LevyFlight;
+
+        let part_m = run_matrix(&part);
+        let levy_m = run_matrix(&levy);
+        let part_agg = crate::evals::metrics::AggregateMetrics::from_strata(&part_m, 1);
+        let levy_agg = crate::evals::metrics::AggregateMetrics::from_strata(&levy_m, 1);
+        assert!(
+            part_agg.coverage_iqm.point > levy_agg.coverage_iqm.point,
+            "partitioned coverage {} should beat levy {}",
+            part_agg.coverage_iqm.point,
+            levy_agg.coverage_iqm.point
+        );
+    }
+
+    #[test]
+    fn test_matrix_shape() {
+        let mut cfg = EvalConfig::sar_small(FlightPattern::Spiral);
+        cfg.seeds = 4;
+        cfg.episodes_per_seed = 6;
+        let m = run_matrix(&cfg);
+        assert_eq!(m.len(), 4);
+        assert!(m.iter().all(|s| s.len() == 6));
+    }
+
+    #[test]
+    fn test_noise_sweep_grid() {
+        let sweep = default_noise_sweep();
+        assert_eq!(sweep.len(), 9);
+    }
+}
--- a/v2/crates/ruview-swarm/src/evals/stats.rs
+++ b/v2/crates/ruview-swarm/src/evals/stats.rs
@ -0,0 +1,203 @@
+//! Hand-rolled robust statistics for the evaluation harness (Agarwal 2021).
+//!
+//! Implements the interquartile mean (IQM), a 95% stratified-bootstrap
+//! confidence interval of the IQM, and the probability-of-improvement metric —
+//! the three statistics recommended by "Deep RL at the Edge of the
+//! Statistical Precipice" (Agarwal et al., NeurIPS 2021) for reporting
+//! few-seed RL results.
+//!
+//! All randomness comes from a local linear-congruential generator (LCG) seeded
+//! explicitly, so every CI is fully reproducible — no `thread_rng`, no clock.
+
+/// Interquartile mean: mean of the middle 50% of samples (drop the bottom 25%
+/// and the top 25%). Robust to outliers in either tail.
+///
+/// Small-N behaviour: with fewer than 4 samples the trim would empty the set,
+/// so it falls back to the plain arithmetic mean. An empty slice returns 0.0.
+pub fn iqm(samples: &[f64]) -> f64 {
+    if samples.is_empty() {
+        return 0.0;
+    }
+    if samples.len() < 4 {
+        return samples.iter().sum::<f64>() / samples.len() as f64;
+    }
+    let mut sorted = samples.to_vec();
+    sorted.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
+    let n = sorted.len();
+    let lo = n / 4; // trim bottom 25%
+    let hi = n - lo; // trim top 25% (symmetric)
+    let mid = &sorted[lo..hi];
+    if mid.is_empty() {
+        return sorted.iter().sum::<f64>() / n as f64;
+    }
+    mid.iter().sum::<f64>() / mid.len() as f64
+}
+
+/// A point estimate with its lower / upper 95% confidence bounds.
+#[derive(Debug, Clone, Copy)]
+pub struct ConfidenceInterval {
+    pub point: f64,
+    pub lo: f64,
+    pub hi: f64,
+}
+
+/// Minimal reproducible LCG (Numerical Recipes constants) yielding f64 in [0,1).
+struct Lcg(u64);
+
+impl Lcg {
+    fn new(seed: u64) -> Self {
+        // Avoid a zero state collapsing the generator.
+        Lcg(seed ^ 0x9E37_79B9_7F4A_7C15)
+    }
+    #[inline]
+    fn next_u64(&mut self) -> u64 {
+        self.0 = self
+            .0
+            .wrapping_mul(6364136223846793005)
+            .wrapping_add(1442695040888963407);
+        self.0
+    }
+    /// Uniform index in [0, n).
+    #[inline]
+    fn index(&mut self, n: usize) -> usize {
+        if n == 0 {
+            return 0;
+        }
+        (self.next_u64() >> 11) as usize % n
+    }
+}
+
+/// 95% stratified-bootstrap CI of the IQM.
+///
+/// `strata` groups samples (one inner `Vec` per stratum, e.g. per task or per
+/// seed). Each bootstrap replicate resamples WITH replacement *within* each
+/// stratum (preserving the stratum sizes), pools all resampled values, and
+/// recomputes the IQM. Repeat `n_boot` times and take the 2.5 / 97.5
+/// percentiles for the CI bounds. The `point` estimate is the IQM of the pooled
+/// original samples. Deterministic for a fixed `seed`.
+pub fn stratified_bootstrap_ci(
+    strata: &[Vec<f64>],
+    n_boot: usize,
+    seed: u64,
+) -> ConfidenceInterval {
+    let pooled: Vec<f64> = strata.iter().flatten().copied().collect();
+    let point = iqm(&pooled);
+
+    if pooled.is_empty() || n_boot == 0 {
+        return ConfidenceInterval { point, lo: point, hi: point };
+    }
+
+    let mut rng = Lcg::new(seed);
+    let mut replicates = Vec::with_capacity(n_boot);
+    let mut buf: Vec<f64> = Vec::with_capacity(pooled.len());
+
+    for _ in 0..n_boot {
+        buf.clear();
+        for stratum in strata {
+            let m = stratum.len();
+            for _ in 0..m {
+                buf.push(stratum[rng.index(m)]);
+            }
+        }
+        replicates.push(iqm(&buf));
+    }
+
+    replicates.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal));
+    let lo = percentile(&replicates, 2.5);
+    let hi = percentile(&replicates, 97.5);
+    ConfidenceInterval { point, lo, hi }
+}
+
+/// Linear-interpolated percentile of a pre-sorted slice. `p` in [0, 100].
+fn percentile(sorted: &[f64], p: f64) -> f64 {
+    if sorted.is_empty() {
+        return 0.0;
+    }
+    if sorted.len() == 1 {
+        return sorted[0];
+    }
+    let rank = (p / 100.0) * (sorted.len() as f64 - 1.0);
+    let lo = rank.floor() as usize;
+    let hi = rank.ceil() as usize;
+    if lo == hi {
+        return sorted[lo];
+    }
+    let frac = rank - lo as f64;
+    sorted[lo] * (1.0 - frac) + sorted[hi] * frac
+}
+
+/// Probability of improvement: P(a-sample > b-sample) over all pairs (Agarwal).
+///
+/// Counts each (a_i, b_j) pair where `a_i > b_j` as 1, a tie as 0.5, and
+/// normalizes by the pair count. 1.0 means `a` strictly dominates; ~0.5 means
+/// the two are statistically indistinguishable. Returns 0.5 if either is empty.
+pub fn probability_of_improvement(a: &[f64], b: &[f64]) -> f64 {
+    if a.is_empty() || b.is_empty() {
+        return 0.5;
+    }
+    let mut wins = 0.0;
+    for &ai in a {
+        for &bj in b {
+            if ai > bj {
+                wins += 1.0;
+            } else if (ai - bj).abs() < f64::EPSILON {
+                wins += 0.5;
+            }
+        }
+    }
+    wins / (a.len() as f64 * b.len() as f64)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_iqm_trims_outliers() {
+        // 0..=100 plus one extreme outlier; IQM should sit near the middle (~50),
+        // not be dragged toward 1e9.
+        let mut samples: Vec<f64> = (0..=100).map(|i| i as f64).collect();
+        samples.push(1e9);
+        let v = iqm(&samples);
+        assert!(
+            (40.0..=60.0).contains(&v),
+            "IQM should be near the middle-50% mean (~50), got {v}"
+        );
+    }
+
+    #[test]
+    fn test_iqm_small() {
+        // Fewer than 4 samples → plain mean.
+        assert_eq!(iqm(&[2.0, 4.0]), 3.0);
+        assert_eq!(iqm(&[10.0]), 10.0);
+        assert_eq!(iqm(&[1.0, 2.0, 3.0]), 2.0);
+        assert_eq!(iqm(&[]), 0.0);
+    }
+
+    #[test]
+    fn test_bootstrap_ci_brackets_point() {
+        let strata = vec![
+            vec![1.0, 2.0, 3.0, 4.0, 5.0],
+            vec![2.0, 3.0, 4.0, 5.0, 6.0],
+        ];
+        let ci = stratified_bootstrap_ci(&strata, 500, 42);
+        assert!(ci.lo <= ci.point, "lo ≤ point: {} ≤ {}", ci.lo, ci.point);
+        assert!(ci.point <= ci.hi, "point ≤ hi: {} ≤ {}", ci.point, ci.hi);
+        // Deterministic: same seed → identical interval.
+        let ci2 = stratified_bootstrap_ci(&strata, 500, 42);
+        assert_eq!(ci.point, ci2.point);
+        assert_eq!(ci.lo, ci2.lo);
+        assert_eq!(ci.hi, ci2.hi);
+    }
+
+    #[test]
+    fn test_prob_improvement_obvious() {
+        assert_eq!(
+            probability_of_improvement(&[10.0, 10.0, 10.0], &[0.0, 0.0, 0.0]),
+            1.0
+        );
+        // Identical samples → all ties → 0.5.
+        let poi = probability_of_improvement(&[5.0, 5.0], &[5.0, 5.0]);
+        assert!((poi - 0.5).abs() < 1e-9, "symmetric ties → ~0.5, got {poi}");
+    }
+}
--- a/v2/crates/ruview-swarm/src/lib.rs
+++ b/v2/crates/ruview-swarm/src/lib.rs
@ -13,6 +13,7 @@ pub mod security;
 pub mod failsafe;
 pub mod config;
 pub mod demo;
+pub mod evals;
 pub mod integration;
 pub mod bench_support;
 pub mod orchestrator;