docs(adr): ADR-161 — HOMECORE server-layer security & honest-labeling sweep (M7)

Records the Milestone 7 audit: library cores are real (anti-slop positive) but
the network boundary had a CRITICAL WS auth bypass (A1) + reply-theater (A2) +
documented-but-no-op automation (A3-A7) + a network-exposed dev bin (A8), all
fixed and graded MEASURED with failing-on-old tests. Cites the NO-ACTION
security positives (uuid::v4 CSPRNG refuted-suspicion, hardened CORS,
no-traversal migrate, no-secrets-in-logs, honest HAP stub) and the deferred
backlog (plugin authority-isolation P5, sig-verification P4, HAP real pairing
P2, bounded run-modes, YAML load-at-boot).

Co-Authored-By: claude-flow <ruv@ruv.net>
This commit is contained in:
ruv 2026-06-12 00:55:52 -04:00
parent e51704cd25
commit d0da5888e3
1 changed files with 260 additions and 0 deletions

View File

@ -0,0 +1,260 @@
# ADR-161: HOMECORE Server Layer — WebSocket Auth Bypass, Reply-Theater & Documented-but-No-Op Automation (Security & Honest Labeling)
- **Status**: accepted
- **Date**: 2026-06-12
- **Deciders**: ruv
- **Tags**: homecore, http-ws-boundary, websocket-auth-bypass, security, automation-engine, documented-no-op, prove-everything, soundness, honest-labeling
- **Amends**: ADR-130 (HOMECORE-API WS protocol), ADR-129 (HOMECORE-AUTO automation engine), ADR-128 (plugin manifest)
## Context
Beyond-SOTA sweep **Milestone 7**, over the HOMECORE **server/network layer**
crates only — `homecore-api`, `homecore-server`, `homecore-automation`,
`homecore-hap`, `homecore-plugins` — executed under the project's
**prove-everything / anti-"AI-slop"** directive.
### Headline — the library cores are real, but the network boundary was unsound
The same audit pattern as ADR-160 held for the *library logic*: the automation
trigger/condition/template/action evaluators, the REST handlers, the HAP
mapping, and the plugin manifest parser are **real, tested code** — not stubs.
That is the anti-slop positive and it is cited here as such.
What the audit found was **not fake business logic but an unsound trust
boundary plus documented-but-no-op features**:
1. A **CRITICAL WebSocket authentication bypass** — the WS handshake accepted
any non-empty token, ignoring the provisioned token whitelist the REST path
enforces.
2. **Reply-theater** — WS command responses were computed, then logged and
**discarded**; no `result`/`pong`/`event` ever reached the client.
3. **Documented-but-idle automation** — the engine was constructed and dropped
(never started); time triggers, `RunMode`, `Choose` branches, and template
conditions were each **documented as working but were no-ops in the live
path**.
This is a worse class than ADR-160's over-naming: here the **doc claimed a
capability the code did not deliver** (auth enforcement, reply transport,
running automations). The fix is **implement where feasible, honestly relabel
where not — never leave a false doc.** Every fix is pinned by a test that
**fails on the old code**.
Grading vocabulary (ADR-152 / ADR-158 / ADR-160):
- **MEASURED** — reproduced in this worktree, command + failing-on-old test recorded.
- **NO-ACTION (already-honest/already-hardened)** — audited, found correct, cited as a positive.
- **ACCEPTED-FUTURE** — deliberately deferred, nothing dropped.
## Decision — Fixes Landed
### §A1 — WebSocket auth bypass (CRITICAL, security) — MEASURED
`homecore-api/src/ws.rs` handshake checked only `token.trim().is_empty()` and
sent `auth_ok` for **any** non-empty token. It never called
`state.tokens().is_valid()` — the check the REST path uses via
`auth::BearerAuth`. With a provisioned `HOMECORE_TOKENS` whitelist, **any
attacker-chosen non-empty token got full WS access** (read all states, call any
service, subscribe to all events).
**Real fix:** the handshake now calls
`state.tokens().is_valid(&token).await` (the *same* store + method as REST).
A wrong token receives `auth_invalid` and the socket closes. DEV (`allow_any`)
mode still accepts any non-empty bearer with a warn, so smoke tests keep
working; the empty token is rejected inside `is_valid`.
**Failing-on-old test** (`tests/ws_handshake.rs`):
`wrong_token_is_rejected` — provisions a real (non-dev) store with one good
token, sends a DIFFERENT non-empty token over the WS handshake, asserts
`auth_invalid`. On the old source the client received
`{"type":"auth_ok",…}` (verified: the test panics on old `ws.rs` with
`left: "auth_ok", right: "auth_invalid"`). Companion: `correct_token_is_accepted`.
**Grade: MEASURED. This is the milestone headline.**
### §A2 — WS replies never transmitted (HIGH, functional) — MEASURED
`ws.rs::Connection::run` moved the socket into a recv-only task; the only
consumer of the response mpsc just did `debug!("ws emit: {msg}")` and dropped
every message. No command reply ever reached the wire.
**Real fix:** the socket is split with `futures_util::StreamExt::split`. A
dedicated **writer task** drains the response channel onto `sink.send(...)`
(text frames; a `__pong:<n>` sentinel maps to a Pong control frame); the reader
task parses commands concurrently. On reader exit the senders drop and the
writer task ends cleanly.
**Failing-on-old tests:** `result_reply_is_received` (connect → auth →
`get_states` → assert a `result` reply is RECEIVED within 5s) and
`ping_pong_reply_is_received`. Both time out on the old source (verified:
`Elapsed` panic). **Grade: MEASURED.**
### §A8 — `homecore-api` bin: no env-token path, network-exposed (HIGH, security) — MEASURED
`homecore-api/src/bin/server.rs` bound `0.0.0.0:8123` with
`SharedState::new()``allow_any_non_empty()` and **no** `HOMECORE_TOKENS`
path (unlike `homecore-server`), so a provisioned operator had no way to lock
it down.
**Real fix:** the bin now mirrors `homecore-server`'s provisioning — prefer the
`HOMECORE_TOKENS` whitelist (`LongLivedTokenStore::from_env()`), fall back to an
**explicitly warn-logged** DEV mode only when unset. It also defaults the bind
address to **`127.0.0.1`** (loopback) so a bare `cargo run` is not
network-exposed, with `HOMECORE_BIND` to opt into LAN.
**Failing-on-old test** (`tests/server_bin_auth.rs`):
`provisioned_bin_rejects_wrong_bearer` reproduces the bin's exact provisioning
path (a populated, non-dev store) and asserts a wrong bearer → 401;
`from_env_path_enforces_whitelist` proves `from_env()` is not dev mode and
enforces the list. The old bin's `allow_any_non_empty()` accepted the wrong
bearer. **Grade: MEASURED.**
### §A3 — Automation engine never started (HIGH) — MEASURED
`homecore-server/src/main.rs` did `let _automation_engine = AutomationEngine::new(...)`
then dropped it immediately, while the header doc claimed "Automation engine
subscribed to the state machine."
**Real fix:** the engine is now built into a long-lived binding and `.start()`
is called, spawning the event loop + timer task; the header/log lines state it
is started with N automations and which trigger classes are active. (With A4A7
the running engine is genuinely functional, not theater.)
**Evidence:** the engine-behavior tests below run against the same
`AutomationEngine::start()` path now wired into the bin. **Grade: MEASURED.**
### §A4 — `Trigger::Time` hard-coded `false`, no timer (HIGH) — MEASURED
`trigger.rs::matches_sync` returned `false` for `Time` and there was **no timer
task** anywhere, so time automations could never fire.
**Real fix:** `AutomationEngine::start_timer` — a 1 Hz tokio interval that
compares each `time:` automation's `at` (`HH:MM` or `HH:MM:SS`) against the
local wall-clock second and fires it once per match (conditions still gate it).
`matches_sync` returning `false` for `Time` is now **correct and documented**
(it is a wall-clock trigger with no state-change context); a public
`fire_time_for_test` exposes the same path deterministically.
**Failing-on-old test** (`tests/engine_behaviors.rs`):
`time_trigger_fires_via_timer_path` (+ unit `time_at_matches_handles_hh_mm_and_hh_mm_ss`).
The method does not exist on the old engine. **Grade: MEASURED.**
### §A5 — `RunMode` documented as AtomicBool-enforced but unbounded-parallel (HIGH) — MEASURED
`engine.rs` doc claimed "RunMode::Single is enforced via a per-automation
AtomicBool" — but no such code existed and **every** trigger spawned an
unbounded parallel task regardless of `mode`.
**Real fix:** each registered automation carries a `running: Arc<AtomicBool>`.
`Single`/`IgnoreFirst` modes `compare_exchange` the flag before spawning and
**skip** the trigger if a run is already in flight, clearing it on completion;
`Parallel` (and, for now, `Restart`/`Queued`) spawn on every trigger.
**Failing-on-old tests** (`tests/engine_behaviors.rs`):
`single_mode_does_not_double_fire_on_rapid_triggers` (two rapid triggers while
the first run sleeps → exactly **1** run; old code fired **2**, verified) and
`parallel_mode_does_fire_concurrently` (→ 2). **Grade: MEASURED (Single/Parallel
honored; bounded `Queued`/`Restart`/`max` ordering → ACCEPTED-FUTURE, see below).**
### §A6 — `Action::Choose` ignored branches (HIGH) — MEASURED
`action.rs` discarded `choices` and always ran `default`.
**Real fix:** `ChoiceBranch::matches` deserialises each branch's
`serde_yaml::Value` conditions into `Condition` and evaluates them (AND
semantics, against an `EvalContext` now carried on `ExecutionContext`). `Choose`
runs the **first matching branch's** sequence and falls to `default` only if
none match.
**Failing-on-old tests** (`action.rs` inline):
`choose_runs_matching_branch_not_default` (matching branch runs, default does
NOT — old code ran default, verified) and
`choose_falls_to_default_when_no_branch_matches`. **Grade: MEASURED.**
### §A7 — Template conditions always false in the live engine (MEDIUM) — MEASURED
`condition.rs` returned `false` for `Template` whenever `template_env` was
`None`, and the engine built every `EvalContext` with `template_env: None`
(`EvalContext::new`), so `template:` conditions could never be true in
production — only in unit tests that hand-built a template env.
**Real fix:** the engine constructs one `TemplateEnvironment` over the state
machine and threads it into every `EvalContext` via
`EvalContext::with_templates` (event loop, timer task, and
`ExecutionContext` for `Choose` branches).
**Failing-on-old tests** (`tests/engine_behaviors.rs`):
`template_condition_evaluates_true_in_engine` (a `{{ is_state(...) }}` condition
gates an action true) and `template_condition_evaluates_false_blocks_action`.
On the old engine the action never ran (template always false, verified).
**Grade: MEASURED.**
### §B5 — Plugin manifest sig/hash "verified before execution" doc was false (LOW, honesty) — relabeled
`homecore-plugins/src/manifest.rs` documented `wasm_module_hash` as "verified
before execution" and carried `wasm_module_sig` / `publisher_key`, but these
fields are **never read** for verification (only ever set to `None` in tests).
**Fix (honest labeling — no false capability claimed):** the three fields are
re-doc'd **"(P4 — not yet enforced, ADR-161/B5)"** — parsed and round-tripped,
but no integrity/signature check happens before a plugin runs. No verification
code was added (that is P4); the doc now matches the code.
**Grade: doc-honesty (no behavior change).**
## Negative Results (NO-ACTION positives — audited, found correct, cited not edited)
These were checked and are genuinely sound/honest; cited as positives, **not**
touched:
- **CSPRNG correctness** — all IDs are `uuid::v4`; the rng/`randn` suspicion was
**REFUTED**. No weak-randomness issue exists.
- **CORS allowlist** (`app.rs`) — already hardened (explicit `AllowOrigin::list`,
no `permissive()`, `allow_credentials(false)`, env override). NO-ACTION.
- **No path traversal in `homecore-migrate`** — audited, clean.
- **No secrets in logs** — audited, clean.
- **HAP pairing stub** — honestly disclaimed as a surface stub; not over-claimed.
- **`InProcessRuntime` "no sandbox" disclaimer** — honest; left as-is.
## Deferred Backlog (Nothing Dropped)
- **Plugin authority-isolation (P5)**`homecore_permissions` claims are parsed
but not enforced at the host-call boundary. **ACCEPTED-FUTURE.**
- **Plugin signature/hash verification (P4)** — implement the
`wasm_module_hash`/`wasm_module_sig`/`publisher_key` gate that B5 now honestly
says is absent. **ACCEPTED-FUTURE.**
- **HAP real pairing (P2)** — SRP/HKDF pairing + encrypted sessions; current
bridge is an accessory-mapping surface. **ACCEPTED-FUTURE (honestly stubbed).**
- **`RunMode::Queued`/`Restart`/`max` ordering** — `Single`/`Parallel` are
honored; bounded queueing, restart-kill, and `max` concurrency are not yet
wired (every non-Single mode is parallel). **ACCEPTED-FUTURE** — the
`engine.rs` doc states exactly this, no over-claim.
- **Automation YAML load-at-boot** — the engine starts empty; a YAML loader is
P-next. The bin log states "0 automations registered" honestly.
## Reproduction (MEASURED)
```bash
cd v2
cargo test -p homecore-api -p homecore-server -p homecore-automation -p homecore-hap --no-default-features
cargo test -p homecore-plugins --features wasmtime
cargo build --workspace --no-default-features
```
Result at time of writing (all 0 failed):
- **homecore-api****25 passed** (lib 18; `server_bin_auth` 3; `ws_handshake` 4)
- **homecore-automation****42 passed** (lib 37; `engine_behaviors` 5)
- **homecore-hap** — **17 passed**
- **homecore-server** — bin, **0 tests**
- (**homecore-plugins** — **15 passed**: lib 12; integration 3)
- Full workspace `cargo build --workspace --no-default-features` succeeds.
## Consequences
- The WebSocket path can no longer be entered with a forged token — it enforces
the same `LongLivedTokenStore` whitelist as REST (A1).
- WS clients now actually receive `result`/`pong`/`event` frames (A2).
- The `homecore-api` dev bin defaults to loopback and honors `HOMECORE_TOKENS`
(A8); it is no longer an open `0.0.0.0` accept-any endpoint by default.
- The automation engine is started for real and its time triggers, `Single`
run-mode, `Choose` branches, and `template:` conditions all function — no doc
claims a capability the code lacks (A3A7).
- The plugin manifest no longer claims signature verification it does not
perform (B5).
- Files kept under the 500-line guideline (`engine.rs` 462; behavioral tests
moved to `tests/engine_behaviors.rs`).