wifi-densepose/api-docs/adr/ADR-080-qe-remediation-plan.md

8.4 KiB
Raw Blame History

ADR-080: QE Analysis Remediation Plan

Context

An 8-agent QE swarm analyzed ~305K lines across Rust, Python, C firmware, and TypeScript on 2026-04-05. The overall score was 55/100 (C+) — Quality Gate FAILED. This ADR captures the findings and establishes a remediation plan.

Decision

Address the 15 prioritized issues from the QE analysis in three waves: P0 (immediate), P1 (this sprint), P2 (this quarter).

Security P0 closure note (2026-06-13) — Rust sensing-server boundary

The three P0 security findings below were logged against the Python v1 API (archive/v1/src/…). ADR-164 G11 re-scoped them to the shipped boundary: wifi-densepose-sensing-server (Rust). They were verified against the current Rust crate and closed on branch fix/adr-080-sensing-server-security. Each fix (or already-fixed finding) is pinned by a test that fails on the old behavior. The Python v1 paths remain as-is — v1 is archived and not the shipped surface; this closure governs the live Rust server only.

P0 — Fix Immediately

1. Rate Limiter Bypass / XFF spoofing (Security HIGH) — RESOLVED (verified absent on Rust boundary)

  • Original location (v1): archive/v1/src/middleware/rate_limit.py:200-206
  • Problem: Trusts X-Forwarded-For without validation. Any client bypasses rate limits via header spoofing.
  • Rust verification (2026-06-13): The Rust sensing-server has no XFF-trusting control to bypass — there is no IP-based rate-limiter and no IP-allowlist, and neither security middleware reads a forwarded header. bearer_auth.rs authenticates on the token alone (require_bearer inspects only the AUTHORIZATION header); host_validation.rs decides on the Host header only. A repo-wide grep for x-forwarded-for|forwarded|peer_addr|client_ip|real-ip over wifi-densepose-sensing-server returns nothing. The only "rate limiter" is the MQTT sample-rate gate (mqtt/state.rs), a per-entity publish throttle with no IP/header input.
  • Resolution: No code change needed (no vulnerable surface). Regression tests pin the immunity: bearer_auth::tests::xff_header_never_affects_auth_decision (spoofed XFF never flips a 401↔200 decision) and host_validation::tests::forwarded_headers_never_bypass_host_allowlist (spoofed X-Forwarded-Host: localhost never lets a foreign Host: evil.com past the allowlist). Residual: if an IP-based control is ever added, it must derive the peer from the socket (ConnectInfo<SocketAddr>) and only honor XFF from an explicit --trusted-proxy CIDR — captured as guidance in the test docstrings.

2. Exception Details Leaked in Responses (Security HIGH, CWE-209) — RESOLVED

  • Original location (v1): archive/v1/src/api/routers/pose.py:140, stream.py:297, +5 endpoints
  • Problem: Internal error/stack-trace detail serialized into client responses.
  • Rust finding (2026-06-13): Six handlers in wifi-densepose-sensing-server/src/main.rs serialized the internal error Display into the JSON body: edge_registry_endpoint returned a panicked spawn_blocking JoinError ("task … panicked") in a 500 and the raw upstream error in a 503; delete_model/delete_recording/start_recording returned std::io::Error strings (OS detail / path); calibration_start/calibration_stop returned the FieldModel error chain.
  • Fix: New src/error_response.rs module — internal_error / internal_error_json / upstream_unavailable log the full detail server-side only (tagged with a correlation id) and return a generic body ({"error":"internal_error","correlation_id":…}) with no panicked, no file paths, no Debug chain. All six call-sites rewired. Pinned by error_response::tests::internal_error_body_does_not_leak_detail (leak-substring guard, verified to fail on the reverted old body) + 4 sibling tests.

3. WebSocket JWT in URL (Security HIGH, CWE-598) — RESOLVED (verified absent on Rust boundary)

  • Original location (v1): archive/v1/src/api/routers/stream.py:74, archive/v1/src/middleware/auth.py:243
  • Problem: Tokens in query strings visible in logs/proxies/browser history.
  • Rust verification (2026-06-13): The Rust sensing-server never reads a token from the URL. require_bearer (bearer_auth.rs) inspects only the Authorization header; the WebSocket handlers (ws_sensing_handler/ws_introspection_handler/ws_pose_handler) take a bare WebSocketUpgrade with no Query extractor; the single Query in the crate (EdgeRegistryParams) is a non-secret refresh flag.
  • Resolution: No code change needed (no query-token path exists). Regression test bearer_auth::tests::query_string_token_is_never_accepted proves ?token=/?access_token= in the URL never authenticates (stays 401) while the same token in the header succeeds (200) — verified to fail if a query-token path is re-introduced.

4. Rust Tests Not in CI

  • Problem: 2,618 tests across 153K lines of Rust — zero run in any GitHub Actions workflow. Regressions ship undetected.
  • Fix: Add cargo test --workspace --no-default-features to CI. 1-2 hour task.

5. WebSocket Path Mismatch (Bug)

  • Location: ui/mobile/src/services/ws.service.ts:104 constructs /ws/sensing, but constants/websocket.ts:1 defines WS_PATH = '/api/v1/stream/pose'.
  • Problem: Mobile WebSocket silently fails.
  • Fix: Align paths. Verify which endpoint the server actually serves.

P1 — Fix This Sprint

# Issue Location Impact
6 God file: 4,846 lines, CC=121 sensing-server/src/main.rs Untestable monolith
7 O(L×V) voxel scan per frame ruvsense/tomography.rs:345-383 ~10ms wasted; use DDA ray march
8 Sequential neural inference wifi-densepose-nn inference.rs:334-336 2-4× GPU latency penalty
9 720 .unwrap() in Rust Workspace-wide Each = potential panic in RT paths
10 112KB alloc/frame in Python csi_processor.py:412-414 Deque→list→numpy every frame

P2 — Fix This Quarter

# Issue Impact
11 11/12 Python modules have zero unit tests (12,280 LOC) Services, middleware, DB untested
12 Firmware at 19% coverage (WASM runtime, OTA, swarm) Security-critical code untested
13 MAT screen auto-falls back to simulated data Disaster responders could monitor fake data
14 Token blacklist never consulted during auth Revoked tokens remain valid
15 50ms frame budget never benchmarked Real-time requirement unverified

Bright Spots

  • 79 ADRs (exceptional governance)
  • Witness bundle system (ADR-028) with SHA-256 proof
  • 2,618 Rust tests with mathematical rigor
  • Daily security scanning (Bandit, Semgrep, Safety)
  • Ed25519 WASM signature verification on firmware
  • Clean mobile state management with good test coverage

Full QE Reports (9 files, 4,914 lines)

Report What it covers
EXECUTIVE-SUMMARY.md Top-level synthesis with all scores and priority matrix
00-qe-queen-summary.md Master coordination, quality posture, test pyramid
01-code-quality-complexity.md Cyclomatic complexity, code smells, top 20 hotspots
02-security-review.md 15 security findings (3 HIGH, 7 MEDIUM), OWASP coverage
03-performance-analysis.md 23 perf findings (4 CRITICAL), frame budget analysis
04-test-analysis.md 3,353 tests inventoried, duplication, quality grading
05-quality-experience.md API/CLI/Mobile/DX UX assessment
06-product-assessment-sfdipot.md SFDIPOT analysis, 57 test ideas, 14 session charters
07-coverage-gaps.md Coverage matrix, top 20 risk gaps, 8-week roadmap

Consequences

  • P0 fixes eliminate 3 security vulnerabilities and 2 functional bugs
  • P1 fixes improve performance, reliability, and maintainability
  • P2 fixes close coverage gaps and harden the system for production
  • Target score improvement: 55 → 75+ after P0+P1 completion

Generated from QE swarm analysis (fleet-02558e91) on 2026-04-05