diff --git a/docs/adr/ADR-149-swarm-benchmarking-evaluation-methodology.md b/docs/adr/ADR-149-swarm-benchmarking-evaluation-methodology.md new file mode 100644 index 00000000..204e1456 --- /dev/null +++ b/docs/adr/ADR-149-swarm-benchmarking-evaluation-methodology.md @@ -0,0 +1,257 @@ +# ADR-149: Drone Swarm Benchmarking & Evaluation Methodology — Metrics, Leaderboards, and Statistical Rigor + +| Field | Value | +|------------|-----------------------------------------------------------------------------------------| +| Status | Accepted (peer-reviewed 2026-05-30) | +| Date | 2026-05-30 | +| Deciders | ruv | +| Relates to | ADR-148 (ruview-swarm), ADR-147 (OccWorld), ADR-146 (RF encoder), ADR-028 (witness) | + +> Companion to ADR-148. ADR-148 shipped the swarm and 5 criterion micro-benchmarks +> plus a `SotaComparison` against Wi2SAR. This ADR defines **how we evaluate the swarm +> rigorously** — what metrics, what statistics, what baselines, and an honest account +> of which external leaderboards do and do not apply. + +--- + +## 1. Context + +ADR-148's `ruview-swarm` reports performance via five `criterion` micro-benchmarks and a +single `SotaComparison` (localization 1.732 m vs Wi2SAR 5 m; coverage ~223 s vs 810 s). +These numbers are **internally valid but insufficient as scientific claims**: + +- The criterion figures (3.3 µs MARL inference, 43 µs RRT-APF, 54 ns fusion, 248 µs PPO + step) measure **wall-clock latency**, not policy quality or coverage/localization quality. +- The 1.732 m localization comes from a **single synthetic geometry** (3 drones at 120° + around a known point), not a distribution of victim positions under realistic noise. +- The 223 s coverage is an **analytic estimate** (`estimate_coverage_time_secs()`), not an + episode rollout. +- All numbers are **single-run point estimates**. The MARL reproducibility literature + (Henderson 2018; Agarwal 2021; Gorsane 2022) shows single/few-seed point estimates + routinely flip algorithm rankings and overstate gains. + +We need a defined, reproducible evaluation methodology before any "beats SOTA" claim can +survive external review, and an honest position on external leaderboards. + +--- + +## 2. Decision + +Adopt a two-tier evaluation methodology: + +1. **Micro-benchmarks (criterion)** — keep for compute-latency regression gating only. + Explicitly labeled as latency, never as quality. +2. **Domain evaluation harness** — a seeded, multi-run, statistically-reported harness + producing SAR metrics (localization CEP, coverage, detection rate) and MARL metrics + (IQM return, probability-of-improvement) over **≥10 seeds with 95% stratified-bootstrap + confidence intervals**, against **≥3 baselines**, following the Agarwal/Gorsane standard. + +Do **not** claim leaderboard standing — no public leaderboard accepts drone-swarm CSI-SAR +submissions. Comparisons to Wi2SAR are **paper-to-paper**, labeled as such, acknowledging +the sensing-modality difference (RSS bearing vs CSI multi-view fusion). + +--- + +## 3. External Leaderboard Landscape — Honest Assessment + +**There is no public, externally-administered leaderboard that accepts a drone-swarm, +CSI-based, multi-view SAR system.** This is a research niche; comparison is paper-to-paper. +The adjacent options and their fit: + +| Benchmark / Leaderboard | Domain | Live submission? | Fit for ruview-swarm | +|-------------------------|--------|------------------|----------------------| +| **Wi2SAR** (arxiv 2604.09115) | Drone WiFi SAR | No (paper) | **Direct baseline** — paper-to-paper only; RSS bearing ≠ CSI fusion | +| **MARL4DRP** (Springer 2023) | Drone routing MARL | No | Closest drone-MARL benchmark; would need a routing→coverage adapter | +| **CSI-Bench** (NeurIPS 2025) | Static WiFi sensing | Splits + paper baselines | Adjacent (localization task) but no moving-sensor/multi-view fusion | +| **SMAC / SMACv2** | StarCraft cooperative MARL | No live LB | Structural analogy (CTDE) only; combat task, not coverage | +| **PettingZoo MPE** (Simple Spread) | 2D cooperative particles | No | Cheap MARL **correctness check**, no physics/CSI | +| **Melting Pot** | Social-dynamics MARL | Closed (NeurIPS '24) | Not applicable | +| **MAMuJoCo / Hanabi / GRF / Overcooked** | Various cooperative MARL | No live LB | Not applicable | +| **OmniDrones / gym-pybullet-drones / Pegasus** | Drone-control sim platforms | No (platforms) | **Training infrastructure**, not leaderboards; no CSI layer | + +**Conclusion:** We will (a) keep Wi2SAR as the cited paper baseline, (b) optionally build a +MARL4DRP/MPE adapter to post a recognized cooperative-MARL number (tangential to SAR), and +(c) **not** represent any internal number as a leaderboard placement. + +--- + +## 4. Evaluation Metrics + +### 4.1 SAR Domain Metrics (primary — comparable to Wi2SAR) + +| Metric | Definition | Reporting | +|--------|-----------|-----------| +| Localization CEP50 | Median horizontal error, fused victim position vs ground truth | m, 95% CI | +| Localization CEP95 | 95th-percentile horizontal error | m | +| **GDOP** | Geometric Dilution of Precision of the contributing-drone constellation at detection time | dimensionless (tracked per detection) | +| Coverage rate @ T | Fraction of area scanned ≥1× within T=240 s | %, 95% CI | +| Coverage time to 95% | Time to scan 95% of bounded area | s, mean ± CI | +| Time-to-first-detection | Mission start → first confident detection (conf > 0.85) | s, 95% CI | +| Detection rate | P(detected \| victim present) per mission | %, 95% CI | +| False-alarm rate | P(confident detection \| no victim) | %, 95% CI | +| Collision rate | Collisions (d < 1.5 m) per mission | count/mission | +| Overlap ratio | Fraction of path re-covering scanned cells | % | + +### 4.2 MARL Policy-Quality Metrics + +| Metric | Definition | +|--------|-----------| +| IQM episodic return | Interquartile mean over 10 seeds × 50 eval episodes (Agarwal 2021) | +| Probability of improvement | P(MAPPO return > IPPO return) on a random episode | +| Optimality gap | Expected gap to a defined reference performance | +| Performance profile | Fraction of (seed, episode) with localization error < τ, plotted vs τ ∈ [0,10] m | +| Sample efficiency | Return vs training steps (curve, not point) | + +### 4.3 Micro-benchmarks (criterion — latency only) + +Retained from ADR-148, **labeled as compute latency, not quality**: +`marl_actor_inference` 3.3 µs · `rrt_apf_100iter` 43 µs · `multiview_fusion_3drones` 54 ns · +`demo_coverage_estimate` 100 ps · `ppo_update_64transitions` 248 µs. Purpose: prove the +control loop has no compute bottleneck (all ≪ the 10 ms / 100 Hz budget) and gate +performance regressions. They are **not** evidence of policy or localization quality. + +--- + +## 5. Statistical Protocol (Agarwal 2021 / Gorsane 2022) + +| Requirement | Standard adopted | +|-------------|------------------| +| Seeds per condition | **≥10** training runs from distinct seeds | +| Evaluation episodes | 50 fixed, versioned episodes per trained policy (10 victim layouts × 5 CSI-noise levels) | +| Aggregate metric | **IQM** (not mean, not median) + performance profiles | +| Confidence intervals | **95% stratified bootstrap**, 1,000 resamples | +| Baselines (≥3) | Random walk (lower bound), Boustrophedon+manual-triangulation (heuristic), IPPO (no shared critic) | +| Reproducibility | Versioned YAML config (drone count, area, victims, CSI σ amplitude / κ phase, wind, packet loss) + all seeds committed with results | + +Rationale: Henderson et al. (2018) found ≤5-seed point estimates flip rankings; Agarwal et +al. (2021, NeurIPS Outstanding Paper) show IQM needs ~10 runs for the statistical power that +the median needs ~200 runs for; Gorsane et al. (2022) made ≥10 seeds + IQM + stratified CIs +the cooperative-MARL standard. `rliable` (google-research/rliable) is the reference impl. + +--- + +## 6. Reproducibility Harness (`evals/`) + +A new evaluation harness (separate from criterion micro-benchmarks): + +1. **Seeded episodes** — every episode, noise perturbation, and training run seeded from a + versioned config; seeds committed with results (no `Date.now()`/unseeded RNG). +2. **Per-episode logging** — coverage %, localization error, GDOP, time-to-first-detection, + collisions, detection binary → JSONL (reuses the ADR-148 telemetry schema). +3. **Aggregation** — IQM ± 95% stratified-bootstrap CI across the 10-seed × 50-episode matrix. +4. **Baseline sweep** — random / boustrophedon-heuristic / IPPO / MAPPO, so + probability-of-improvement and performance profiles are computable. +5. **Output** — committed `evals/RESULTS.md`: a reproducible internal leaderboard ranking + our 6 flight patterns × learning patterns on the SAR metrics, plus the Wi2SAR paper row. + +This `RESULTS.md` is the **real, defensible "leaderboard" for this system** — patterns ranked +against each other and the cited baseline, reproducibly, with CIs. + +### 6.1 Dual-stage pipeline (compute-cost mitigation) + +The full matrix is **10 seeds × 50 episodes × ≥4 conditions = ≥2,000 rollouts per policy**. +Running each rollout against the OccWorld 3D prior (ADR-147, ~375 ms/inference) would melt +the L4 / RTX 5080 budget. Split evaluation into two stages: + +- **Stage 1 — Kinematic (fast, full matrix).** Stripped vector environment; OccWorld paths + pre-cached or treated as static analytical volumes. Produces episodic **return, IQM, + sample-efficiency curves, coverage %, GDOP, localization error** over the full 10-seed matrix. +- **Stage 2 — High-fidelity physics (sub-sampled).** Take the **3 median seeds** (by Stage-1 + IQM) into Gazebo + PX4 SITL with full CSI phase/amplitude noise. Extracts **false-alarm + rate** and **collision rate** under realistic dynamics (heading-rate limits, APF repulsion, + motor response) that the kinematic sim omits. + +Stage 1 is CI-runnable today; Stage 2 requires the Gazebo/PX4 SITL bring-up (follow-on). + +### 6.2 Noise sweep (coherence-gate threshold) + +The config generator systematically varies the two CSI noise parameters: +- **σ** — Gaussian amplitude noise (CSI magnitude) +- **κ** — von Mises phase concentration (lower κ = noisier phase) + +Sweeping (σ, κ) isolates the exact environmental threshold where `CrossViewpointAttention` +(ADR-016) drops out of its coherence gate (`coherence_gate.rs` Accept → PredictOnly/Reject, +ADR-135). This finds the operating envelope, not just a single-point accuracy. + +### 6.3 GDOP tracking + +Localization accuracy is meaningless without the constellation geometry that produced it. +The harness records **GDOP** per detection: 3 drones in a ~120° constellation give the +√3 ≈ 1.73× CRLB improvement; 3 **collinear** drones degrade toward the single-view +Cramer-Rao limit (~2.9 m). Reporting localization error **stratified by GDOP band** prevents +the headline number from being a best-case geometric artifact. + +--- + +## 7. Evidence Grading of Current ADR-148 Numbers + +| Claim | Grade | Why | +|-------|-------|-----| +| criterion latencies (3.3 µs / 43 µs / 54 ns / 248 µs) | **High** | Deterministic compute, hardware-specific, reproducible | +| Wi2SAR baseline (5 m, 160k m²/13.5 min) | **High** | Published field trial, open source | +| 1.732 m 3-view localization | **Low–Medium** | Single synthetic geometry; no noise distribution; CRLB predicts ~2.9 m for N=3 | +| 223 s 4-drone coverage | **Low** | Analytic estimate, not an episode rollout | +| "beats SOTA" | **Directional only** | Valid as paper-to-paper direction; not leaderboard, not multi-seed | + +The √N multi-view scaling claim is theoretically sound (CRLB: σ ∝ 1/√(N·SNR); N=3 → √3 ≈ +1.73× improvement), but the measured 1.732 m must be reproduced over a victim-position and +noise distribution before it is defensible. + +--- + +## 8. Consequences + +### Positive +- Converts scattered numbers into a reproducible, statistically-honest evaluation. +- The `RESULTS.md` internal leaderboard ranks the 6 flight × 4 learning patterns fairly. +- Aligns with the recognized MARL evaluation standard (IQM + stratified CIs + ≥10 seeds). +- Honest external-leaderboard position avoids overclaiming. + +### Costs / Risks +- ≥10 seeds × 50 episodes × N patterns × N baselines is a real compute cost — this is where + the ADR-148 GCP L4 / local RTX 5080 training budget is actually spent. +- Requires the MARL policy to be **trained to convergence** first (the ADR-148 5-episode CPU + run shows decreasing value_loss, not convergence). +- Coverage/localization must move from analytic estimate / synthetic geometry to **episode + rollouts under realistic CSI noise** before headline numbers are republished. + +### Open issues → follow-on work +1. Train MAPPO/IPPO to convergence (M4 follow-on) before running the eval harness. +2. Build the seeded `evals/` harness + `RESULTS.md` generator. +3. Optional: MARL4DRP or MPE Simple-Spread adapter for a recognized cooperative-MARL number. +4. Re-state ADR-148 §14 headline numbers with CIs once the harness has run. + +--- + +## 9. Research Notes & References + +Compiled by `ruflo-goals:deep-researcher` (2026-05-30). Full landscape in the agent record. + +**MARL evaluation rigor** +- Henderson et al., "Deep RL That Matters", arxiv 1709.06560 — ≤5-seed estimates flip rankings +- Agarwal et al., "Deep RL at the Edge of the Statistical Precipice", NeurIPS 2021, arxiv 2108.13264 — IQM, performance profiles, stratified bootstrap; `rliable` +- Gorsane et al., "Standardised Evaluation Protocol for Cooperative MARL", NeurIPS 2022, arxiv 2209.10485 — ≥10 seeds + IQM standard +- BenchMARL, arxiv 2312.01472 — operationalizes the above + +**Cooperative-MARL benchmarks** +- SMACv2, arxiv 2212.07489 · PettingZoo MPE (Farama) · Melting Pot (DeepMind, NeurIPS 2024 contest) · MAMuJoCo (Gymnasium-Robotics) · MARL4DRP, Springer 2023 (closest drone-MARL) + +**Drone-sim platforms** +- gym-pybullet-drones, arxiv 2103.02142 · OmniDrones, IEEE RA-L 2024 · Pegasus, arxiv 2307.05263 · Flightmare (IROS 2021) · AirSim (discontinued 2022) · Crazyswarm2 + +**SAR / coverage / CSI sensing** +- Wi2SAR, arxiv 2604.09115 (direct baseline: 5 m, 160k m²/13.5 min, 18.4° median AoA) +- CSI-Bench, NeurIPS 2025, arxiv 2505.21866 (461 h WiFi sensing, localization task) +- Coverage path planning, PMC9571681 (boustrophedon ~5% faster than spiral) +- Bio-inspired SAR, Nature s41598-025-33223-z (PSO > Levy/ACO on exploration score) +- CRLB for CSI localization, IEEE 8110647 (σ ∝ 1/√(N·SNR)) + +**Tooling** +- criterion.rs known limitations — wall-clock only, not algorithmic quality +- rliable, github.com/google-research/rliable + +--- + +*ADR authored with research support from `ruflo-goals:deep-researcher` (2026-05-30). + Companion to ADR-148. Defines the evaluation methodology that the ADR-148 headline + numbers must satisfy before being republished as defensible claims.* diff --git a/v2/crates/ruview-swarm/Cargo.toml b/v2/crates/ruview-swarm/Cargo.toml index 0e0436a4..0fbab7d6 100644 --- a/v2/crates/ruview-swarm/Cargo.toml +++ b/v2/crates/ruview-swarm/Cargo.toml @@ -78,3 +78,7 @@ harness = false [[bin]] name = "train_marl" required-features = ["train"] + +# ADR-149 Stage-1 evaluation CLI — pure Rust, no special feature needed. +[[bin]] +name = "eval_swarm" diff --git a/v2/crates/ruview-swarm/evals/.gitkeep b/v2/crates/ruview-swarm/evals/.gitkeep new file mode 100644 index 00000000..0a53c5df --- /dev/null +++ b/v2/crates/ruview-swarm/evals/.gitkeep @@ -0,0 +1,2 @@ +# ADR-149 evaluation outputs +RESULTS.md is generated by the `eval_swarm` binary. diff --git a/v2/crates/ruview-swarm/evals/RESULTS.md b/v2/crates/ruview-swarm/evals/RESULTS.md new file mode 100644 index 00000000..10047bcb --- /dev/null +++ b/v2/crates/ruview-swarm/evals/RESULTS.md @@ -0,0 +1,26 @@ +# ruview-swarm Evaluation Results (ADR-149 Stage 1, kinematic) + +Statistically-rigorous evaluation harness: seeded multi-run rollouts with IQM + 95% stratified-bootstrap confidence intervals (Agarwal et al., NeurIPS 2021). + +## Run configuration + +- **Stage**: 1 (kinematic, self-contained, deterministic per seed) +- **Episodes per pattern**: 100 (seed × episode matrix) +- **CI method**: 95% stratified bootstrap of the IQM, stratified by seed +- **GDOP**: 2-D geometric dilution of precision at first detection + +> **Stage 2 pending**: high-fidelity Gazebo/PX4 SITL evaluation (false-alarm rate, real collision rate on the median seeds) is a follow-on — see ADR-149 §6.1. The collision figures below are a kinematic min-separation proxy, not SITL physics. + +## Flight-pattern leaderboard + +| Flight pattern | Coverage IQM [95% CI] | Localization (m) IQM [95% CI] | Detection rate | Mean GDOP | +|----------------|-----------------------|-------------------------------|----------------|-----------| +| partitioned_lawnmower | 1.000 [1.000, 1.000] | 7.022 [5.669, 8.379] | 100.0% | 0.000 | +| pheromone | 0.662 [0.652, 0.671] | 4.110 [3.346, 5.141] | 95.0% | 1.598 | +| levy_flight | 0.490 [0.489, 0.491] | 3.523 [2.897, 4.160] | 100.0% | 0.000 | +| boustrophedon | 0.370 [0.370, 0.370] | 2.740 [2.357, 3.207] | 100.0% | 0.000 | +| spiral | 0.336 [0.336, 0.336] | 3.082 [2.678, 3.568] | 100.0% | 0.000 | +| potential_field | 0.254 [0.252, 0.256] | 4.343 [3.489, 5.265] | 100.0% | 0.000 | +| _Wi2SAR (paper baseline)_ | _n/a_ | _5.0 (paper)_ | _n/a_ | _n/a_ | + +_Wi2SAR row is the published single-drone localization figure (arxiv 2604.09115), shown paper-to-paper for reference only — it was not re-run through this kinematic harness._ diff --git a/v2/crates/ruview-swarm/src/bin/eval_swarm.rs b/v2/crates/ruview-swarm/src/bin/eval_swarm.rs new file mode 100644 index 00000000..ad39b54c --- /dev/null +++ b/v2/crates/ruview-swarm/src/bin/eval_swarm.rs @@ -0,0 +1,104 @@ +//! ADR-149 Stage-1 evaluation CLI. +//! +//! Runs the kinematic eval matrix over every flight pattern (default) and +//! writes a ranked `RESULTS.md` leaderboard. Pure Rust — no special feature +//! flag required, so it builds and runs in default CI. +//! +//! Defaults are intentionally small (10 seeds × 10 episodes) so the run is fast. +//! The full ADR-149 reporting configuration is 10 seeds × 50 episodes — pass +//! `--seeds 10 --episodes 50` for the publication run. +//! +//! ```text +//! cargo run -p ruview-swarm --bin eval_swarm -- \ +//! --seeds 10 --episodes 10 --out crates/ruview-swarm/evals/RESULTS.md +//! ``` + +use std::path::PathBuf; + +use ruview_swarm::evals::metrics::AggregateMetrics; +use ruview_swarm::evals::report::render_results_md; +use ruview_swarm::evals::runner::{run_matrix, EvalConfig}; +use ruview_swarm::planning::patterns::FlightPattern; + +fn main() { + let args: Vec = std::env::args().collect(); + let mut seeds = 10usize; + let mut episodes = 10usize; + let mut out = PathBuf::from("crates/ruview-swarm/evals/RESULTS.md"); + + let mut i = 1; + while i < args.len() { + match args[i].as_str() { + "--seeds" => { + i += 1; + seeds = args.get(i).and_then(|s| s.parse().ok()).unwrap_or(seeds); + } + "--episodes" => { + i += 1; + episodes = args.get(i).and_then(|s| s.parse().ok()).unwrap_or(episodes); + } + "--out" => { + i += 1; + if let Some(p) = args.get(i) { + out = PathBuf::from(p); + } + } + "--help" | "-h" => { + eprintln!( + "eval_swarm — ADR-149 Stage-1 kinematic evaluator\n\ + Usage: eval_swarm [--seeds N] [--episodes M] [--out PATH]\n\ + Defaults: --seeds 10 --episodes 10 --out crates/ruview-swarm/evals/RESULTS.md" + ); + return; + } + other => { + eprintln!("warning: ignoring unknown argument '{other}'"); + } + } + i += 1; + } + + eprintln!( + "Running ADR-149 Stage-1 eval: {seeds} seeds × {episodes} episodes \ + over {} flight patterns...", + FlightPattern::all().len() + ); + + let mut rows: Vec<(String, AggregateMetrics)> = Vec::new(); + for (idx, pattern) in FlightPattern::all().into_iter().enumerate() { + let mut cfg = EvalConfig::sar_small(pattern); + cfg.seeds = seeds; + cfg.episodes_per_seed = episodes; + let matrix = run_matrix(&cfg); + let agg = AggregateMetrics::from_strata(&matrix, 0x0149 ^ idx as u64); + eprintln!( + " {}: coverage IQM {:.3}, detection {:.0}%", + pattern.name(), + agg.coverage_iqm.point, + agg.detection_rate * 100.0 + ); + rows.push((pattern.name().to_string(), agg)); + } + + // Rank by descending coverage point estimate. + rows.sort_by(|a, b| { + b.1.coverage_iqm + .point + .partial_cmp(&a.1.coverage_iqm.point) + .unwrap_or(std::cmp::Ordering::Equal) + }); + + let md = render_results_md(&rows); + + if let Some(parent) = out.parent() { + if let Err(e) = std::fs::create_dir_all(parent) { + eprintln!("error: could not create {}: {e}", parent.display()); + std::process::exit(1); + } + } + if let Err(e) = std::fs::write(&out, &md) { + eprintln!("error: could not write {}: {e}", out.display()); + std::process::exit(1); + } + eprintln!("Wrote {} ({} bytes).", out.display(), md.len()); +} diff --git a/v2/crates/ruview-swarm/src/evals/gdop.rs b/v2/crates/ruview-swarm/src/evals/gdop.rs new file mode 100644 index 00000000..200c3443 --- /dev/null +++ b/v2/crates/ruview-swarm/src/evals/gdop.rs @@ -0,0 +1,118 @@ +//! Geometric Dilution of Precision (GDOP) for a constellation of observers. +//! +//! GDOP quantifies how observer geometry amplifies measurement error into +//! position-estimate error. Build the geometry matrix `H` of unit +//! line-of-sight (LOS) vectors from each observer to the target, form the +//! normal matrix `HᵀH`, invert it, and take `GDOP = sqrt(trace((HᵀH)⁻¹))`. +//! +//! For the 2-D (x, y) localization case `H` is `N×2` and `HᵀH` is `2×2`, so a +//! closed-form 2×2 inverse suffices (no linear-algebra dependency needed). +//! +//! Lower GDOP = better geometry: observers spread ~120° apart around the target +//! give low GDOP; (near-)collinear observers give a singular/ill-conditioned +//! `HᵀH` → GDOP → ∞. + +use crate::types::Position3D; + +/// Geometric Dilution of Precision (2-D) for `observers` viewing a `target`. +/// +/// Lower = better geometry. A ~120° constellation → low GDOP; collinear → very +/// large (→∞). Returns `None` if fewer than two observers, if any observer is +/// coincident with the target (undefined LOS), or if the geometry is singular +/// / degenerate (collinear) so `HᵀH` is not invertible. +pub fn gdop(observers: &[Position3D], target: &Position3D) -> Option { + if observers.len() < 2 { + return None; + } + + // Accumulate HᵀH directly (2×2 symmetric) from unit LOS vectors. + // Row i of H is the unit vector from target → observer i in (x, y). + let mut a = 0.0; // sum ux*ux + let mut b = 0.0; // sum ux*uy + let mut d = 0.0; // sum uy*uy + + for obs in observers { + let dx = obs.x - target.x; + let dy = obs.y - target.y; + let range = (dx * dx + dy * dy).sqrt(); + if range < 1e-9 { + // Observer on top of the target → LOS undefined. + return None; + } + let ux = dx / range; + let uy = dy / range; + a += ux * ux; + b += ux * uy; + d += uy * uy; + } + + // Determinant of HᵀH = [[a, b], [b, d]]. + let det = a * d - b * b; + if det.abs() < 1e-12 { + // Singular: observers are (near-)collinear with the target. + return None; + } + + // (HᵀH)⁻¹ = 1/det * [[d, -b], [-b, a]]; trace = (d + a) / det. + let trace_inv = (a + d) / det; + if trace_inv <= 0.0 || !trace_inv.is_finite() { + return None; + } + Some(trace_inv.sqrt()) +} + +#[cfg(test)] +mod tests { + use super::*; + + fn p(x: f64, y: f64) -> Position3D { + Position3D { x, y, z: 0.0 } + } + + #[test] + fn test_triangle_lower_than_collinear() { + let target = p(0.0, 0.0); + // Three observers at 120° around the target, radius 10. + let r = 10.0; + let triangle = [ + p(r * 0.0_f64.cos(), r * 0.0_f64.sin()), + p( + r * (2.0 * std::f64::consts::PI / 3.0).cos(), + r * (2.0 * std::f64::consts::PI / 3.0).sin(), + ), + p( + r * (4.0 * std::f64::consts::PI / 3.0).cos(), + r * (4.0 * std::f64::consts::PI / 3.0).sin(), + ), + ]; + // Three nearly-collinear observers (tiny y perturbation to stay invertible). + let near_collinear = [p(5.0, 0.01), p(10.0, 0.0), p(15.0, 0.01)]; + + let tri = gdop(&triangle, &target).expect("triangle finite GDOP"); + let col = gdop(&near_collinear, &target).expect("near-collinear finite GDOP"); + assert!(tri.is_finite(), "triangle GDOP must be finite: {tri}"); + assert!( + tri < col, + "120° constellation should have lower GDOP than near-collinear: tri={tri}, col={col}" + ); + } + + #[test] + fn test_collinear_degenerate() { + let target = p(0.0, 0.0); + // Perfectly collinear observers along +x → singular HᵀH. + let collinear = [p(5.0, 0.0), p(10.0, 0.0), p(20.0, 0.0)]; + let g = gdop(&collinear, &target); + assert!( + g.is_none() || g.unwrap() > 1e6, + "perfectly collinear geometry must be None or huge, got {g:?}" + ); + } + + #[test] + fn test_single_observer_none() { + let target = p(0.0, 0.0); + assert!(gdop(&[p(5.0, 5.0)], &target).is_none()); + assert!(gdop(&[], &target).is_none()); + } +} diff --git a/v2/crates/ruview-swarm/src/evals/metrics.rs b/v2/crates/ruview-swarm/src/evals/metrics.rs new file mode 100644 index 00000000..c1669911 --- /dev/null +++ b/v2/crates/ruview-swarm/src/evals/metrics.rs @@ -0,0 +1,150 @@ +//! Per-episode and aggregate SAR + MARL metrics (ADR-149 Stage 1). + +use crate::evals::stats::{stratified_bootstrap_ci, ConfidenceInterval}; + +/// Per-episode SAR metrics (Stage 1 kinematic). +#[derive(Debug, Clone)] +pub struct EpisodeMetrics { + /// Fraction of the mission area scanned at least once, in [0, 1]. + pub coverage_pct: f64, + /// Localization error (m) of the fused victim estimate; `None` if no detection. + pub localization_error_m: Option, + /// GDOP of the contributing-drone constellation at detection; `None` if none. + pub gdop_at_detection: Option, + /// Mission-elapsed seconds to first detection; `None` if no detection. + pub time_to_first_detection_s: Option, + /// Whether at least one victim was detected this episode. + pub detected: bool, + /// Count of inter-drone proximity violations (kinematic proxy for collisions). + pub collisions: u32, + /// Fraction of scanned area covered by more than one drone, in [0, 1]. + pub overlap_ratio: f64, + /// Scalar episodic return (reward-like coverage/detection objective). + pub episodic_return: f64, +} + +/// Aggregate over a seed × episode matrix with IQM + 95% bootstrap CIs. +#[derive(Debug, Clone)] +pub struct AggregateMetrics { + pub coverage_iqm: ConfidenceInterval, + /// IQM over detected episodes only (undetected episodes carry no error). + pub localization_iqm: ConfidenceInterval, + pub detection_rate: f64, + pub mean_gdop: f64, + pub return_iqm: ConfidenceInterval, + pub n_episodes: usize, +} + +impl AggregateMetrics { + /// Aggregate a seed-stratified matrix of episodes. Each inner `Vec` is one + /// seed's episodes; bootstrap resampling is stratified by seed so the CI + /// reflects between-seed variance (the dominant source per ADR-149). + pub fn from_strata(per_seed: &[Vec], boot_seed: u64) -> Self { + const N_BOOT: usize = 1000; + + let coverage_strata: Vec> = per_seed + .iter() + .map(|s| s.iter().map(|e| e.coverage_pct).collect()) + .collect(); + let return_strata: Vec> = per_seed + .iter() + .map(|s| s.iter().map(|e| e.episodic_return).collect()) + .collect(); + // Localization: only detected episodes contribute. Keep stratification + // by seed but drop empty strata so the bootstrap doesn't degenerate. + let loc_strata: Vec> = per_seed + .iter() + .map(|s| { + s.iter() + .filter_map(|e| e.localization_error_m) + .collect::>() + }) + .filter(|v: &Vec| !v.is_empty()) + .collect(); + + let mut detected = 0usize; + let mut total = 0usize; + let mut gdop_sum = 0.0; + let mut gdop_n = 0usize; + for seed in per_seed { + for e in seed { + total += 1; + if e.detected { + detected += 1; + } + if let Some(g) = e.gdop_at_detection { + if g.is_finite() { + gdop_sum += g; + gdop_n += 1; + } + } + } + } + + let detection_rate = if total == 0 { + 0.0 + } else { + detected as f64 / total as f64 + }; + let mean_gdop = if gdop_n == 0 { + 0.0 + } else { + gdop_sum / gdop_n as f64 + }; + + AggregateMetrics { + coverage_iqm: stratified_bootstrap_ci(&coverage_strata, N_BOOT, boot_seed), + localization_iqm: stratified_bootstrap_ci( + &loc_strata, + N_BOOT, + boot_seed.wrapping_add(1), + ), + detection_rate, + mean_gdop, + return_iqm: stratified_bootstrap_ci( + &return_strata, + N_BOOT, + boot_seed.wrapping_add(2), + ), + n_episodes: total, + } + } +} + +#[cfg(test)] +mod tests { + use super::*; + + fn ep(cov: f64, loc: Option, ret: f64, detected: bool) -> EpisodeMetrics { + EpisodeMetrics { + coverage_pct: cov, + localization_error_m: loc, + gdop_at_detection: if detected { Some(2.0) } else { None }, + time_to_first_detection_s: if detected { Some(10.0) } else { None }, + detected, + collisions: 0, + overlap_ratio: 0.1, + episodic_return: ret, + } + } + + #[test] + fn test_aggregate_detection_rate_and_shape() { + let per_seed = vec![ + vec![ + ep(0.8, Some(1.5), 80.0, true), + ep(0.7, None, 70.0, false), + ], + vec![ + ep(0.9, Some(2.0), 90.0, true), + ep(0.85, Some(1.0), 85.0, true), + ], + ]; + let agg = AggregateMetrics::from_strata(&per_seed, 7); + assert_eq!(agg.n_episodes, 4); + assert!((agg.detection_rate - 0.75).abs() < 1e-9); + assert!(agg.coverage_iqm.lo <= agg.coverage_iqm.point); + assert!(agg.coverage_iqm.point <= agg.coverage_iqm.hi); + assert!(agg.mean_gdop > 0.0); + } +} diff --git a/v2/crates/ruview-swarm/src/evals/mod.rs b/v2/crates/ruview-swarm/src/evals/mod.rs new file mode 100644 index 00000000..a8b72ec9 --- /dev/null +++ b/v2/crates/ruview-swarm/src/evals/mod.rs @@ -0,0 +1,19 @@ +//! ADR-149 statistically-rigorous evaluation harness (Stage 1, kinematic). +//! +//! Produces SAR + MARL metrics over a seeded N-seed × M-episode matrix with +//! IQM + 95% stratified-bootstrap CIs, a (sigma, kappa) CSI-noise sweep, and +//! GDOP-stratified localization error. Generates evals/RESULTS.md. +//! +//! Stage 2 (Gazebo/PX4 SITL high-fidelity, false-alarm + collision rate on the +//! median seeds) is a follow-on — see ADR-149 §6.1. +pub mod gdop; +pub mod stats; +pub mod metrics; +pub mod runner; +pub mod report; + +pub use gdop::gdop; +pub use stats::{iqm, stratified_bootstrap_ci, ConfidenceInterval}; +pub use metrics::{EpisodeMetrics, AggregateMetrics}; +pub use runner::{EvalConfig, NoiseLevel, run_matrix}; +pub use report::render_results_md; diff --git a/v2/crates/ruview-swarm/src/evals/report.rs b/v2/crates/ruview-swarm/src/evals/report.rs new file mode 100644 index 00000000..31bea04b --- /dev/null +++ b/v2/crates/ruview-swarm/src/evals/report.rs @@ -0,0 +1,120 @@ +//! RESULTS.md leaderboard generator (ADR-149 Stage 1). + +use crate::evals::metrics::AggregateMetrics; +use crate::evals::stats::ConfidenceInterval; + +/// Wi2SAR published localization baseline (paper-to-paper), metres. +const WI2SAR_LOCALIZATION_M: f64 = 5.0; + +/// Format a CI as `point [lo, hi]` with two decimals. +fn fmt_ci(ci: &ConfidenceInterval) -> String { + format!("{:.3} [{:.3}, {:.3}]", ci.point, ci.lo, ci.hi) +} + +/// Render a markdown leaderboard: one row per flight pattern with coverage +/// IQM±CI, localization IQM±CI, detection rate, and mean GDOP — plus the +/// Wi2SAR paper baseline row clearly labelled paper-to-paper. +/// +/// `rows` is `(pattern_name, aggregate)`; rows are emitted in the order given, +/// so callers should pre-sort (e.g. by descending coverage point estimate). +pub fn render_results_md(rows: &[(String, AggregateMetrics)]) -> String { + let mut s = String::new(); + s.push_str("# ruview-swarm Evaluation Results (ADR-149 Stage 1, kinematic)\n\n"); + s.push_str( + "Statistically-rigorous evaluation harness: seeded multi-run rollouts with \ + IQM + 95% stratified-bootstrap confidence intervals (Agarwal et al., \ + NeurIPS 2021).\n\n", + ); + + // Run configuration header. + let (n_episodes, n_seeds) = rows + .first() + .map(|(_, a)| { + let n = a.n_episodes; + // Episodes-per-seed isn't stored; report total + leave seed split to caller note. + (n, 0usize) + }) + .unwrap_or((0, 0)); + s.push_str("## Run configuration\n\n"); + s.push_str(&format!( + "- **Stage**: 1 (kinematic, self-contained, deterministic per seed)\n\ + - **Episodes per pattern**: {n_episodes} (seed × episode matrix)\n\ + - **CI method**: 95% stratified bootstrap of the IQM, stratified by seed\n\ + - **GDOP**: 2-D geometric dilution of precision at first detection\n" + )); + let _ = n_seeds; + s.push_str( + "\n> **Stage 2 pending**: high-fidelity Gazebo/PX4 SITL evaluation \ + (false-alarm rate, real collision rate on the median seeds) is a \ + follow-on — see ADR-149 §6.1. The collision figures below are a \ + kinematic min-separation proxy, not SITL physics.\n\n", + ); + + // Leaderboard table. + s.push_str("## Flight-pattern leaderboard\n\n"); + s.push_str( + "| Flight pattern | Coverage IQM [95% CI] | Localization (m) IQM [95% CI] | \ + Detection rate | Mean GDOP |\n", + ); + s.push_str( + "|----------------|-----------------------|-------------------------------|\ + ----------------|-----------|\n", + ); + for (name, agg) in rows { + s.push_str(&format!( + "| {} | {} | {} | {:.1}% | {:.3} |\n", + name, + fmt_ci(&agg.coverage_iqm), + fmt_ci(&agg.localization_iqm), + agg.detection_rate * 100.0, + agg.mean_gdop, + )); + } + // Wi2SAR paper baseline row (paper-to-paper, no kinematic re-run). + s.push_str(&format!( + "| _Wi2SAR (paper baseline)_ | _n/a_ | _{:.1} (paper)_ | _n/a_ | _n/a_ |\n", + WI2SAR_LOCALIZATION_M, + )); + + s.push_str( + "\n_Wi2SAR row is the published single-drone localization figure \ + (arxiv 2604.09115), shown paper-to-paper for reference only — it was \ + not re-run through this kinematic harness._\n", + ); + + s +} + +#[cfg(test)] +mod tests { + use super::*; + use crate::evals::stats::ConfidenceInterval; + + fn agg(cov: f64, det: f64) -> AggregateMetrics { + let ci = |p: f64| ConfidenceInterval { point: p, lo: p - 0.05, hi: p + 0.05 }; + AggregateMetrics { + coverage_iqm: ci(cov), + localization_iqm: ci(1.5), + detection_rate: det, + mean_gdop: 2.1, + return_iqm: ci(80.0), + n_episodes: 100, + } + } + + #[test] + fn test_render_contains_rows_and_baseline() { + let rows = vec![ + ("partitioned_lawnmower".to_string(), agg(0.92, 0.95)), + ("levy_flight".to_string(), agg(0.40, 0.50)), + ]; + let md = render_results_md(&rows); + assert!(md.contains("partitioned_lawnmower")); + assert!(md.contains("levy_flight")); + assert!(md.contains("Wi2SAR")); + assert!(md.contains("Stage 2 pending")); + assert!(md.contains("95% stratified bootstrap")); + // Coverage point estimate appears. + assert!(md.contains("0.920")); + } +} diff --git a/v2/crates/ruview-swarm/src/evals/runner.rs b/v2/crates/ruview-swarm/src/evals/runner.rs new file mode 100644 index 00000000..997df703 --- /dev/null +++ b/v2/crates/ruview-swarm/src/evals/runner.rs @@ -0,0 +1,364 @@ +//! Stage-1 kinematic rollout + seed × episode matrix (ADR-149). +//! +//! A single `run_episode` deterministically drives `drones` drones across a +//! mission area under a chosen [`FlightPattern`], marks coverage on a grid, +//! simulates CSI victim detection perturbed by `(sigma, kappa)` amplitude / +//! von-Mises-phase noise, and computes the GDOP of the contributing-drone +//! constellation at first detection. It is self-contained and seeded — no +//! Candle / training backend required — so it runs in CI by default. + +use crate::config::SwarmConfig; +use crate::evals::gdop::gdop; +use crate::evals::metrics::EpisodeMetrics; +use crate::planning::patterns::{FlightPattern, PatternContext}; +use crate::types::{NodeId, Position3D}; + +/// CSI-noise level: amplitude std `sigma` and von-Mises phase concentration `kappa`. +/// Higher `sigma` = noisier amplitude; *lower* `kappa` = noisier phase (more diffuse). +#[derive(Debug, Clone, Copy)] +pub struct NoiseLevel { + pub sigma: f64, + pub kappa: f64, +} + +/// One evaluation configuration: a flight pattern + swarm/mission parameters. +#[derive(Debug, Clone)] +pub struct EvalConfig { + pub flight: FlightPattern, + pub config: SwarmConfig, + pub drones: usize, + pub steps: usize, + pub seeds: usize, // ≥10 per ADR-149 + pub episodes_per_seed: usize, // e.g. 50 + pub victims: Vec, + pub noise: NoiseLevel, +} + +impl EvalConfig { + /// A small SAR default suitable for fast CI runs. + pub fn sar_small(flight: FlightPattern) -> Self { + EvalConfig { + flight, + config: SwarmConfig::sar_default(), + drones: 4, + steps: 120, + seeds: 10, + episodes_per_seed: 10, + victims: vec![ + Position3D { x: 120.0, y: 90.0, z: 0.0 }, + Position3D { x: 320.0, y: 280.0, z: 0.0 }, + ], + noise: NoiseLevel { sigma: 0.05, kappa: 8.0 }, + } + } +} + +/// Minimal reproducible LCG → f64 in [0, 1). Self-contained for determinism. +struct Lcg(u64); +impl Lcg { + fn new(seed: u64) -> Self { + Lcg(seed ^ 0xD1B5_4A32_D192_ED03) + } + #[inline] + fn next_u64(&mut self) -> u64 { + self.0 = self + .0 + .wrapping_mul(6364136223846793005) + .wrapping_add(1442695040888963407); + self.0 + } + #[inline] + fn unit(&mut self) -> f64 { + (self.next_u64() >> 11) as f64 / (1u64 << 53) as f64 + } + /// Standard-normal sample via Box–Muller (deterministic). + #[inline] + fn normal(&mut self) -> f64 { + let u1 = self.unit().max(1e-12); + let u2 = self.unit(); + (-2.0 * u1.ln()).sqrt() * (2.0 * std::f64::consts::PI * u2).cos() + } +} + +/// Run one kinematic episode deterministically from `seed`. +/// +/// Drives drones step-by-step by the flight pattern, marks a coarse coverage +/// grid, and on the first step a drone comes within scan range of any victim +/// records a fused localization estimate (weighted centroid of contributing +/// drones' per-drone victim estimates, each perturbed by `(sigma, kappa)` +/// noise) and the GDOP of those contributing drones. +pub fn run_episode(cfg: &EvalConfig, seed: u64) -> EpisodeMetrics { + let mut rng = Lcg::new(seed); + + let area_w = cfg.config.mission.area_width_m; + let area_h = cfg.config.mission.area_height_m; + let altitude_z = -cfg.config.planning.flight_altitude_m; + let scan_width = cfg.config.planning.csi_scan_width_m.max(1.0); + let min_sep = cfg.config.formation.min_separation_m.max(0.1); + let n = cfg.drones.max(1); + + // Coverage grid sized so each cell ~= scan_width. + let gx = ((area_w / scan_width).ceil() as usize).max(1); + let gy = ((area_h / scan_width).ceil() as usize).max(1); + let cell_w = area_w / gx as f64; + let cell_h = area_h / gy as f64; + let mut cover_count = vec![0u32; gx * gy]; + + // Spread drones along the bottom edge with a small seeded jitter. + let mut positions: Vec = (0..n) + .map(|i| { + let frac = (i as f64 + 0.5) / n as f64; + Position3D { + x: (frac * area_w + (rng.unit() - 0.5) * scan_width).clamp(0.0, area_w), + y: (rng.unit() * scan_width).clamp(0.0, area_h), + z: altitude_z, + } + }) + .collect(); + + // Recent-visit ring buffer for pheromone / potential-field patterns. + let mut visited: Vec = Vec::new(); + let max_visited = 32usize; + + let scan_range = scan_width; // detect a victim within one scan footprint + let mut collisions = 0u32; + let mut detected = false; + let mut loc_error: Option = None; + let mut gdop_val: Option = None; + let mut t_detect: Option = None; + + let dt = step_seconds(cfg); + + for step in 0..cfg.steps { + // Advance each drone one waypoint under the pattern. + let snapshot = positions.clone(); + for (i, pos) in positions.iter_mut().enumerate() { + let peers: Vec = snapshot + .iter() + .enumerate() + .filter(|(j, _)| *j != i) + .map(|(_, p)| *p) + .collect(); + let ctx = PatternContext { + drone_id: NodeId(i as u32), + swarm_size: n, + current: *pos, + area_w, + area_h, + altitude_z, + scan_width_m: scan_width, + step: step as u64, + visited: &visited, + peers: &peers, + }; + *pos = cfg.flight.next_target(&ctx); + } + + // Mark coverage + record visits. + for pos in &positions { + let cx = ((pos.x / cell_w).floor() as i64).clamp(0, gx as i64 - 1) as usize; + let cy = ((pos.y / cell_h).floor() as i64).clamp(0, gy as i64 - 1) as usize; + cover_count[cy * gx + cx] = cover_count[cy * gx + cx].saturating_add(1); + visited.push(*pos); + } + if visited.len() > max_visited { + let drop = visited.len() - max_visited; + visited.drain(0..drop); + } + + // Proximity / collision check (kinematic proxy). + for a in 0..positions.len() { + for b in (a + 1)..positions.len() { + let d = positions[a].distance_to(&positions[b]); + if d < min_sep { + collisions = collisions.saturating_add(1); + } + } + } + + // Detection: first step any victim falls within scan range of ≥1 drone, + // fuse a localization estimate from the contributing drones. A single + // contributor still yields a (noisier) estimate; GDOP is only defined + // for the multistatic ≥2-drone case and is `None` otherwise. + if !detected { + for victim in &cfg.victims { + let contributors: Vec = positions + .iter() + .filter(|p| horiz_dist(p, victim) <= scan_range) + .copied() + .collect(); + if !contributors.is_empty() { + let (est, g) = fuse_estimate(&contributors, victim, cfg.noise, &mut rng); + loc_error = Some(horiz_dist(&est, victim)); + gdop_val = g; // None for a single contributor + t_detect = Some((step as f64 + 1.0) * dt); + detected = true; + break; + } + } + } + } + + // Coverage + overlap. + let total_cells = (gx * gy) as f64; + let scanned = cover_count.iter().filter(|&&c| c > 0).count() as f64; + let overlapped = cover_count.iter().filter(|&&c| c > 1).count() as f64; + let coverage_pct = if total_cells > 0.0 { scanned / total_cells } else { 0.0 }; + let overlap_ratio = if scanned > 0.0 { overlapped / scanned } else { 0.0 }; + + // Episodic return: reward coverage + detection, penalize overlap + collisions. + let detect_bonus = if detected { 1.0 } else { 0.0 }; + let loc_term = match loc_error { + Some(e) => (1.0 / (1.0 + e)).max(0.0), + None => 0.0, + }; + let episodic_return = 100.0 * coverage_pct + 30.0 * detect_bonus + 20.0 * loc_term + - 10.0 * overlap_ratio + - 5.0 * collisions as f64; + + EpisodeMetrics { + coverage_pct, + localization_error_m: loc_error, + gdop_at_detection: gdop_val, + time_to_first_detection_s: t_detect, + detected, + collisions, + overlap_ratio, + episodic_return, + } +} + +/// Per-step wall-clock seconds, derived from scan width and drone speed. +fn step_seconds(cfg: &EvalConfig) -> f64 { + let speed = cfg.config.planning.max_speed_ms.max(0.1); + (cfg.config.planning.csi_scan_width_m.max(1.0) / speed).max(0.1) +} + +/// Horizontal (x, y) distance, ignoring altitude. +fn horiz_dist(a: &Position3D, b: &Position3D) -> f64 { + (a.x - b.x).hypot(a.y - b.y) +} + +/// Fuse contributing drones' per-drone victim estimates into a weighted +/// centroid, perturbed by `(sigma, kappa)` CSI noise, and compute the GDOP of +/// the contributing constellation. +fn fuse_estimate( + contributors: &[Position3D], + victim: &Position3D, + noise: NoiseLevel, + rng: &mut Lcg, +) -> (Position3D, Option) { + // Phase noise std from von Mises concentration: sigma_phase ≈ 1/sqrt(kappa). + let phase_std = 1.0 / noise.kappa.max(1e-3).sqrt(); + let mut sx = 0.0; + let mut sy = 0.0; + let mut wsum = 0.0; + for c in contributors { + let range = horiz_dist(c, victim).max(1e-6); + // Each drone's estimate = true victim + range-scaled amplitude noise + + // bearing error from phase noise (perpendicular to LOS). + let amp = noise.sigma * range; + let nx = rng.normal() * amp; + let ny = rng.normal() * amp; + // Bearing wobble: rotate LOS unit vector by a small phase-noise angle. + let bearing = (victim.y - c.y).atan2(victim.x - c.x); + let dtheta = rng.normal() * phase_std; + let bx = range * (bearing + dtheta).cos(); + let by = range * (bearing + dtheta).sin(); + let est_x = c.x + bx + nx; + let est_y = c.y + by + ny; + // Inverse-range weighting: closer drones trusted more. + let w = 1.0 / range; + sx += est_x * w; + sy += est_y * w; + wsum += w; + } + let w = wsum.max(1e-9); + let est = Position3D { x: sx / w, y: sy / w, z: 0.0 }; + let g = gdop(contributors, victim); + (est, g) +} + +/// Run the full seed × episode matrix → per-seed strata of [`EpisodeMetrics`]. +pub fn run_matrix(cfg: &EvalConfig) -> Vec> { + (0..cfg.seeds) + .map(|s| { + (0..cfg.episodes_per_seed) + .map(|e| { + // Distinct deterministic seed per (seed, episode) cell. + let cell_seed = (s as u64) + .wrapping_mul(0x100_0000) + .wrapping_add(e as u64) + .wrapping_add(0xABCD); + run_episode(cfg, cell_seed) + }) + .collect() + }) + .collect() +} + +/// Standard ADR-149 noise sweep grid: cartesian product of σ × κ levels. +pub fn default_noise_sweep() -> Vec { + let sigmas = [0.02, 0.05, 0.10]; + let kappas = [16.0, 8.0, 4.0]; + let mut out = Vec::with_capacity(sigmas.len() * kappas.len()); + for &sigma in &sigmas { + for &kappa in &kappas { + out.push(NoiseLevel { sigma, kappa }); + } + } + out +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_run_episode_deterministic() { + let cfg = EvalConfig::sar_small(FlightPattern::PartitionedLawnmower); + let a = run_episode(&cfg, 12345); + let b = run_episode(&cfg, 12345); + assert_eq!(a.coverage_pct, b.coverage_pct); + assert_eq!(a.detected, b.detected); + assert_eq!(a.localization_error_m, b.localization_error_m); + assert_eq!(a.collisions, b.collisions); + assert_eq!(a.episodic_return, b.episodic_return); + } + + #[test] + fn test_partitioned_beats_levy_coverage() { + let mut part = EvalConfig::sar_small(FlightPattern::PartitionedLawnmower); + part.seeds = 3; + part.episodes_per_seed = 5; + let mut levy = part.clone(); + levy.flight = FlightPattern::LevyFlight; + + let part_m = run_matrix(&part); + let levy_m = run_matrix(&levy); + let part_agg = crate::evals::metrics::AggregateMetrics::from_strata(&part_m, 1); + let levy_agg = crate::evals::metrics::AggregateMetrics::from_strata(&levy_m, 1); + assert!( + part_agg.coverage_iqm.point > levy_agg.coverage_iqm.point, + "partitioned coverage {} should beat levy {}", + part_agg.coverage_iqm.point, + levy_agg.coverage_iqm.point + ); + } + + #[test] + fn test_matrix_shape() { + let mut cfg = EvalConfig::sar_small(FlightPattern::Spiral); + cfg.seeds = 4; + cfg.episodes_per_seed = 6; + let m = run_matrix(&cfg); + assert_eq!(m.len(), 4); + assert!(m.iter().all(|s| s.len() == 6)); + } + + #[test] + fn test_noise_sweep_grid() { + let sweep = default_noise_sweep(); + assert_eq!(sweep.len(), 9); + } +} diff --git a/v2/crates/ruview-swarm/src/evals/stats.rs b/v2/crates/ruview-swarm/src/evals/stats.rs new file mode 100644 index 00000000..4dc6315a --- /dev/null +++ b/v2/crates/ruview-swarm/src/evals/stats.rs @@ -0,0 +1,203 @@ +//! Hand-rolled robust statistics for the evaluation harness (Agarwal 2021). +//! +//! Implements the interquartile mean (IQM), a 95% stratified-bootstrap +//! confidence interval of the IQM, and the probability-of-improvement metric — +//! the three statistics recommended by "Deep RL at the Edge of the +//! Statistical Precipice" (Agarwal et al., NeurIPS 2021) for reporting +//! few-seed RL results. +//! +//! All randomness comes from a local linear-congruential generator (LCG) seeded +//! explicitly, so every CI is fully reproducible — no `thread_rng`, no clock. + +/// Interquartile mean: mean of the middle 50% of samples (drop the bottom 25% +/// and the top 25%). Robust to outliers in either tail. +/// +/// Small-N behaviour: with fewer than 4 samples the trim would empty the set, +/// so it falls back to the plain arithmetic mean. An empty slice returns 0.0. +pub fn iqm(samples: &[f64]) -> f64 { + if samples.is_empty() { + return 0.0; + } + if samples.len() < 4 { + return samples.iter().sum::() / samples.len() as f64; + } + let mut sorted = samples.to_vec(); + sorted.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)); + let n = sorted.len(); + let lo = n / 4; // trim bottom 25% + let hi = n - lo; // trim top 25% (symmetric) + let mid = &sorted[lo..hi]; + if mid.is_empty() { + return sorted.iter().sum::() / n as f64; + } + mid.iter().sum::() / mid.len() as f64 +} + +/// A point estimate with its lower / upper 95% confidence bounds. +#[derive(Debug, Clone, Copy)] +pub struct ConfidenceInterval { + pub point: f64, + pub lo: f64, + pub hi: f64, +} + +/// Minimal reproducible LCG (Numerical Recipes constants) yielding f64 in [0,1). +struct Lcg(u64); + +impl Lcg { + fn new(seed: u64) -> Self { + // Avoid a zero state collapsing the generator. + Lcg(seed ^ 0x9E37_79B9_7F4A_7C15) + } + #[inline] + fn next_u64(&mut self) -> u64 { + self.0 = self + .0 + .wrapping_mul(6364136223846793005) + .wrapping_add(1442695040888963407); + self.0 + } + /// Uniform index in [0, n). + #[inline] + fn index(&mut self, n: usize) -> usize { + if n == 0 { + return 0; + } + (self.next_u64() >> 11) as usize % n + } +} + +/// 95% stratified-bootstrap CI of the IQM. +/// +/// `strata` groups samples (one inner `Vec` per stratum, e.g. per task or per +/// seed). Each bootstrap replicate resamples WITH replacement *within* each +/// stratum (preserving the stratum sizes), pools all resampled values, and +/// recomputes the IQM. Repeat `n_boot` times and take the 2.5 / 97.5 +/// percentiles for the CI bounds. The `point` estimate is the IQM of the pooled +/// original samples. Deterministic for a fixed `seed`. +pub fn stratified_bootstrap_ci( + strata: &[Vec], + n_boot: usize, + seed: u64, +) -> ConfidenceInterval { + let pooled: Vec = strata.iter().flatten().copied().collect(); + let point = iqm(&pooled); + + if pooled.is_empty() || n_boot == 0 { + return ConfidenceInterval { point, lo: point, hi: point }; + } + + let mut rng = Lcg::new(seed); + let mut replicates = Vec::with_capacity(n_boot); + let mut buf: Vec = Vec::with_capacity(pooled.len()); + + for _ in 0..n_boot { + buf.clear(); + for stratum in strata { + let m = stratum.len(); + for _ in 0..m { + buf.push(stratum[rng.index(m)]); + } + } + replicates.push(iqm(&buf)); + } + + replicates.sort_by(|a, b| a.partial_cmp(b).unwrap_or(std::cmp::Ordering::Equal)); + let lo = percentile(&replicates, 2.5); + let hi = percentile(&replicates, 97.5); + ConfidenceInterval { point, lo, hi } +} + +/// Linear-interpolated percentile of a pre-sorted slice. `p` in [0, 100]. +fn percentile(sorted: &[f64], p: f64) -> f64 { + if sorted.is_empty() { + return 0.0; + } + if sorted.len() == 1 { + return sorted[0]; + } + let rank = (p / 100.0) * (sorted.len() as f64 - 1.0); + let lo = rank.floor() as usize; + let hi = rank.ceil() as usize; + if lo == hi { + return sorted[lo]; + } + let frac = rank - lo as f64; + sorted[lo] * (1.0 - frac) + sorted[hi] * frac +} + +/// Probability of improvement: P(a-sample > b-sample) over all pairs (Agarwal). +/// +/// Counts each (a_i, b_j) pair where `a_i > b_j` as 1, a tie as 0.5, and +/// normalizes by the pair count. 1.0 means `a` strictly dominates; ~0.5 means +/// the two are statistically indistinguishable. Returns 0.5 if either is empty. +pub fn probability_of_improvement(a: &[f64], b: &[f64]) -> f64 { + if a.is_empty() || b.is_empty() { + return 0.5; + } + let mut wins = 0.0; + for &ai in a { + for &bj in b { + if ai > bj { + wins += 1.0; + } else if (ai - bj).abs() < f64::EPSILON { + wins += 0.5; + } + } + } + wins / (a.len() as f64 * b.len() as f64) +} + +#[cfg(test)] +mod tests { + use super::*; + + #[test] + fn test_iqm_trims_outliers() { + // 0..=100 plus one extreme outlier; IQM should sit near the middle (~50), + // not be dragged toward 1e9. + let mut samples: Vec = (0..=100).map(|i| i as f64).collect(); + samples.push(1e9); + let v = iqm(&samples); + assert!( + (40.0..=60.0).contains(&v), + "IQM should be near the middle-50% mean (~50), got {v}" + ); + } + + #[test] + fn test_iqm_small() { + // Fewer than 4 samples → plain mean. + assert_eq!(iqm(&[2.0, 4.0]), 3.0); + assert_eq!(iqm(&[10.0]), 10.0); + assert_eq!(iqm(&[1.0, 2.0, 3.0]), 2.0); + assert_eq!(iqm(&[]), 0.0); + } + + #[test] + fn test_bootstrap_ci_brackets_point() { + let strata = vec![ + vec![1.0, 2.0, 3.0, 4.0, 5.0], + vec![2.0, 3.0, 4.0, 5.0, 6.0], + ]; + let ci = stratified_bootstrap_ci(&strata, 500, 42); + assert!(ci.lo <= ci.point, "lo ≤ point: {} ≤ {}", ci.lo, ci.point); + assert!(ci.point <= ci.hi, "point ≤ hi: {} ≤ {}", ci.point, ci.hi); + // Deterministic: same seed → identical interval. + let ci2 = stratified_bootstrap_ci(&strata, 500, 42); + assert_eq!(ci.point, ci2.point); + assert_eq!(ci.lo, ci2.lo); + assert_eq!(ci.hi, ci2.hi); + } + + #[test] + fn test_prob_improvement_obvious() { + assert_eq!( + probability_of_improvement(&[10.0, 10.0, 10.0], &[0.0, 0.0, 0.0]), + 1.0 + ); + // Identical samples → all ties → 0.5. + let poi = probability_of_improvement(&[5.0, 5.0], &[5.0, 5.0]); + assert!((poi - 0.5).abs() < 1e-9, "symmetric ties → ~0.5, got {poi}"); + } +} diff --git a/v2/crates/ruview-swarm/src/lib.rs b/v2/crates/ruview-swarm/src/lib.rs index b0f87de2..cd9b8ae4 100644 --- a/v2/crates/ruview-swarm/src/lib.rs +++ b/v2/crates/ruview-swarm/src/lib.rs @@ -13,6 +13,7 @@ pub mod security; pub mod failsafe; pub mod config; pub mod demo; +pub mod evals; pub mod integration; pub mod bench_support; pub mod orchestrator;