wifi-densepose/api-docs/adr/ADR-161-homecore-server-lay...

268 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-161: HOMECORE Server Layer — WebSocket Auth Bypass, Reply-Theater & Documented-but-No-Op Automation (Security & Honest Labeling)
- **Status**: accepted
- **Date**: 2026-06-12
- **Deciders**: ruv
- **Tags**: homecore, http-ws-boundary, websocket-auth-bypass, security, automation-engine, documented-no-op, prove-everything, soundness, honest-labeling
- **Amends**: ADR-130 (HOMECORE-API WS protocol), ADR-129 (HOMECORE-AUTO automation engine), ADR-128 (plugin manifest)
## Context
Beyond-SOTA sweep **Milestone 7**, over the HOMECORE **server/network layer**
crates only — `homecore-api`, `homecore-server`, `homecore-automation`,
`homecore-hap`, `homecore-plugins` — executed under the project's
**prove-everything / anti-"AI-slop"** directive.
### Headline — the library cores are real, but the network boundary was unsound
The same audit pattern as ADR-160 held for the *library logic*: the automation
trigger/condition/template/action evaluators, the REST handlers, the HAP
mapping, and the plugin manifest parser are **real, tested code** — not stubs.
That is the anti-slop positive and it is cited here as such.
What the audit found was **not fake business logic but an unsound trust
boundary plus documented-but-no-op features**:
1. A **CRITICAL WebSocket authentication bypass** — the WS handshake accepted
any non-empty token, ignoring the provisioned token whitelist the REST path
enforces.
2. **Reply-theater** — WS command responses were computed, then logged and
**discarded**; no `result`/`pong`/`event` ever reached the client.
3. **Documented-but-idle automation** — the engine was constructed and dropped
(never started); time triggers, `RunMode`, `Choose` branches, and template
conditions were each **documented as working but were no-ops in the live
path**.
This is a worse class than ADR-160's over-naming: here the **doc claimed a
capability the code did not deliver** (auth enforcement, reply transport,
running automations). The fix is **implement where feasible, honestly relabel
where not — never leave a false doc.** Every fix is pinned by a test that
**fails on the old code**.
Grading vocabulary (ADR-152 / ADR-158 / ADR-160):
- **MEASURED** — reproduced in this worktree, command + failing-on-old test recorded.
- **NO-ACTION (already-honest/already-hardened)** — audited, found correct, cited as a positive.
- **ACCEPTED-FUTURE** — deliberately deferred, nothing dropped.
## Decision — Fixes Landed
### §A1 — WebSocket auth bypass (CRITICAL, security) — MEASURED
`homecore-api/src/ws.rs` handshake checked only `token.trim().is_empty()` and
sent `auth_ok` for **any** non-empty token. It never called
`state.tokens().is_valid()` — the check the REST path uses via
`auth::BearerAuth`. With a provisioned `HOMECORE_TOKENS` whitelist, **any
attacker-chosen non-empty token got full WS access** (read all states, call any
service, subscribe to all events).
**Real fix:** the handshake now calls
`state.tokens().is_valid(&token).await` (the *same* store + method as REST).
A wrong token receives `auth_invalid` and the socket closes. DEV (`allow_any`)
mode still accepts any non-empty bearer with a warn, so smoke tests keep
working; the empty token is rejected inside `is_valid`.
**Failing-on-old test** (`tests/ws_handshake.rs`):
`wrong_token_is_rejected` — provisions a real (non-dev) store with one good
token, sends a DIFFERENT non-empty token over the WS handshake, asserts
`auth_invalid`. On the old source the client received
`{"type":"auth_ok",…}` (verified: the test panics on old `ws.rs` with
`left: "auth_ok", right: "auth_invalid"`). Companion: `correct_token_is_accepted`.
**Grade: MEASURED. This is the milestone headline.**
### §A2 — WS replies never transmitted (HIGH, functional) — MEASURED
`ws.rs::Connection::run` moved the socket into a recv-only task; the only
consumer of the response mpsc just did `debug!("ws emit: {msg}")` and dropped
every message. No command reply ever reached the wire.
**Real fix:** the socket is split with `futures_util::StreamExt::split`. A
dedicated **writer task** drains the response channel onto `sink.send(...)`
(text frames; a `__pong:<n>` sentinel maps to a Pong control frame); the reader
task parses commands concurrently. On reader exit the senders drop and the
writer task ends cleanly.
**Failing-on-old tests:** `result_reply_is_received` (connect → auth →
`get_states` → assert a `result` reply is RECEIVED within 5s) and
`ping_pong_reply_is_received`. Both time out on the old source (verified:
`Elapsed` panic). **Grade: MEASURED.**
### §A8 — `homecore-api` bin: no env-token path, network-exposed (HIGH, security) — MEASURED
`homecore-api/src/bin/server.rs` bound `0.0.0.0:8123` with
`SharedState::new()``allow_any_non_empty()` and **no** `HOMECORE_TOKENS`
path (unlike `homecore-server`), so a provisioned operator had no way to lock
it down.
**Real fix:** the bin now mirrors `homecore-server`'s provisioning — prefer the
`HOMECORE_TOKENS` whitelist (`LongLivedTokenStore::from_env()`), fall back to an
**explicitly warn-logged** DEV mode only when unset. It also defaults the bind
address to **`127.0.0.1`** (loopback) so a bare `cargo run` is not
network-exposed, with `HOMECORE_BIND` to opt into LAN.
**Failing-on-old test** (`tests/server_bin_auth.rs`):
`provisioned_bin_rejects_wrong_bearer` reproduces the bin's exact provisioning
path (a populated, non-dev store) and asserts a wrong bearer → 401;
`from_env_path_enforces_whitelist` proves `from_env()` is not dev mode and
enforces the list. The old bin's `allow_any_non_empty()` accepted the wrong
bearer. **Grade: MEASURED.**
### §A3 — Automation engine never started (HIGH) — MEASURED
`homecore-server/src/main.rs` did `let _automation_engine = AutomationEngine::new(...)`
then dropped it immediately, while the header doc claimed "Automation engine
subscribed to the state machine."
**Real fix:** the engine is now built into a long-lived binding and `.start()`
is called, spawning the event loop + timer task; the header/log lines state it
is started with N automations and which trigger classes are active. (With A4A7
the running engine is genuinely functional, not theater.)
**Evidence:** the engine-behavior tests below run against the same
`AutomationEngine::start()` path now wired into the bin. **Grade: MEASURED.**
### §A4 — `Trigger::Time` hard-coded `false`, no timer (HIGH) — MEASURED
`trigger.rs::matches_sync` returned `false` for `Time` and there was **no timer
task** anywhere, so time automations could never fire.
**Real fix:** `AutomationEngine::start_timer` — a 1 Hz tokio interval that
compares each `time:` automation's `at` (`HH:MM` or `HH:MM:SS`) against the
local wall-clock second and fires it once per match (conditions still gate it).
`matches_sync` returning `false` for `Time` is now **correct and documented**
(it is a wall-clock trigger with no state-change context); a public
`fire_time_for_test` exposes the same path deterministically.
**Failing-on-old test** (`tests/engine_behaviors.rs`):
`time_trigger_fires_via_timer_path` (+ unit `time_at_matches_handles_hh_mm_and_hh_mm_ss`).
The method does not exist on the old engine. **Grade: MEASURED.**
### §A5 — `RunMode` documented as AtomicBool-enforced but unbounded-parallel (HIGH) — MEASURED
`engine.rs` doc claimed "RunMode::Single is enforced via a per-automation
AtomicBool" — but no such code existed and **every** trigger spawned an
unbounded parallel task regardless of `mode`.
**Real fix:** each registered automation carries a `running: Arc<AtomicBool>`.
`Single`/`IgnoreFirst` modes `compare_exchange` the flag before spawning and
**skip** the trigger if a run is already in flight, clearing it on completion;
`Parallel` (and, for now, `Restart`/`Queued`) spawn on every trigger.
**Failing-on-old tests** (`tests/engine_behaviors.rs`):
`single_mode_does_not_double_fire_on_rapid_triggers` (two rapid triggers while
the first run sleeps → exactly **1** run; old code fired **2**, verified) and
`parallel_mode_does_fire_concurrently` (→ 2). **Grade: MEASURED (Single/Parallel
honored; bounded `Queued`/`Restart`/`max` ordering → ACCEPTED-FUTURE, see below).**
### §A6 — `Action::Choose` ignored branches (HIGH) — MEASURED
`action.rs` discarded `choices` and always ran `default`.
**Real fix:** `ChoiceBranch::matches` deserialises each branch's
`serde_yaml::Value` conditions into `Condition` and evaluates them (AND
semantics, against an `EvalContext` now carried on `ExecutionContext`). `Choose`
runs the **first matching branch's** sequence and falls to `default` only if
none match.
**Failing-on-old tests** (`action.rs` inline):
`choose_runs_matching_branch_not_default` (matching branch runs, default does
NOT — old code ran default, verified) and
`choose_falls_to_default_when_no_branch_matches`. **Grade: MEASURED.**
### §A7 — Template conditions always false in the live engine (MEDIUM) — MEASURED
`condition.rs` returned `false` for `Template` whenever `template_env` was
`None`, and the engine built every `EvalContext` with `template_env: None`
(`EvalContext::new`), so `template:` conditions could never be true in
production — only in unit tests that hand-built a template env.
**Real fix:** the engine constructs one `TemplateEnvironment` over the state
machine and threads it into every `EvalContext` via
`EvalContext::with_templates` (event loop, timer task, and
`ExecutionContext` for `Choose` branches).
**Failing-on-old tests** (`tests/engine_behaviors.rs`):
`template_condition_evaluates_true_in_engine` (a `{{ is_state(...) }}` condition
gates an action true) and `template_condition_evaluates_false_blocks_action`.
On the old engine the action never ran (template always false, verified).
**Grade: MEASURED.**
### §B5 — Plugin manifest sig/hash "verified before execution" doc was false (LOW, honesty) — relabeled
`homecore-plugins/src/manifest.rs` documented `wasm_module_hash` as "verified
before execution" and carried `wasm_module_sig` / `publisher_key`, but these
fields are **never read** for verification (only ever set to `None` in tests).
**Fix (honest labeling — no false capability claimed):** the three fields are
re-doc'd **"(P4 — not yet enforced, ADR-161/B5)"** — parsed and round-tripped,
but no integrity/signature check happens before a plugin runs. No verification
code was added (that is P4); the doc now matches the code.
**Grade: doc-honesty (no behavior change).** *(Superseded by ADR-162 §P4:
the hash/signature gate is now implemented and enforced.)*
## Negative Results (NO-ACTION positives — audited, found correct, cited not edited)
These were checked and are genuinely sound/honest; cited as positives, **not**
touched:
- **CSPRNG correctness** — all IDs are `uuid::v4`; the rng/`randn` suspicion was
**REFUTED**. No weak-randomness issue exists.
- **CORS allowlist** (`app.rs`) — already hardened (explicit `AllowOrigin::list`,
no `permissive()`, `allow_credentials(false)`, env override). NO-ACTION.
- **No path traversal in `homecore-migrate`** — audited, clean.
- **No secrets in logs** — audited, clean.
- **HAP pairing stub** — honestly disclaimed as a surface stub; not over-claimed.
- **`InProcessRuntime` "no sandbox" disclaimer** — honest; left as-is.
## Deferred Backlog (Nothing Dropped)
- **Plugin authority-isolation (P5)** — ~~`homecore_permissions` claims are parsed
but not enforced at the host-call boundary.~~ **DONE — ADR-162 §P5.**
`hc_state_set` now consults a `PermissionSet` distilled from the manifest;
an undeclared write returns a typed `-3` to the guest.
- **Plugin signature/hash verification (P4)** — ~~implement the
`wasm_module_hash`/`wasm_module_sig`/`publisher_key` gate that B5 now honestly
says is absent.~~ **DONE — ADR-162 §P4.** `WasmtimeRuntime::load_plugin` now
SHA-256-checks the module, Ed25519-verifies the signature against
`publisher_key`, and enforces a `PluginPolicy` trust allowlist
(secure-default rejects unsigned/untrusted/tampered modules).
- **HAP real pairing (P2)** — SRP/HKDF pairing + encrypted sessions; current
bridge is an accessory-mapping surface. **ACCEPTED-FUTURE (honestly stubbed).**
- **`RunMode::Queued`/`Restart`/`max` ordering** — ~~`Single`/`Parallel` are
honored; bounded queueing, restart-kill, and `max` concurrency are not yet
wired (every non-Single mode is parallel).~~ **DONE — ADR-162 §A5.** Restart
aborts the in-flight task, Queued serializes via a per-automation async mutex,
and `max: N` caps concurrency via a per-automation semaphore.
- **Automation YAML load-at-boot** — the engine starts empty; a YAML loader is
P-next. The bin log states "0 automations registered" honestly.
## Reproduction (MEASURED)
```bash
cd v2
cargo test -p homecore-api -p homecore-server -p homecore-automation -p homecore-hap --no-default-features
cargo test -p homecore-plugins --features wasmtime
cargo build --workspace --no-default-features
```
Result at time of writing (all 0 failed):
- **homecore-api** — **25 passed** (lib 18; `server_bin_auth` 3; `ws_handshake` 4)
- **homecore-automation** — **42 passed** (lib 37; `engine_behaviors` 5)
- **homecore-hap** — **17 passed**
- **homecore-server** — bin, **0 tests**
- (**homecore-plugins** — **15 passed**: lib 12; integration 3)
- Full workspace `cargo build --workspace --no-default-features` succeeds.
## Consequences
- The WebSocket path can no longer be entered with a forged token — it enforces
the same `LongLivedTokenStore` whitelist as REST (A1).
- WS clients now actually receive `result`/`pong`/`event` frames (A2).
- The `homecore-api` dev bin defaults to loopback and honors `HOMECORE_TOKENS`
(A8); it is no longer an open `0.0.0.0` accept-any endpoint by default.
- The automation engine is started for real and its time triggers, `Single`
run-mode, `Choose` branches, and `template:` conditions all function — no doc
claims a capability the code lacks (A3A7).
- The plugin manifest no longer claims signature verification it does not
perform (B5).
- Files kept under the 500-line guideline (`engine.rs` 462; behavioral tests
moved to `tests/engine_behaviors.rs`).