14 KiB
ADR-161: HOMECORE Server Layer — WebSocket Auth Bypass, Reply-Theater & Documented-but-No-Op Automation (Security & Honest Labeling)
- Status: accepted
- Date: 2026-06-12
- Deciders: ruv
- Tags: homecore, http-ws-boundary, websocket-auth-bypass, security, automation-engine, documented-no-op, prove-everything, soundness, honest-labeling
- Amends: ADR-130 (HOMECORE-API WS protocol), ADR-129 (HOMECORE-AUTO automation engine), ADR-128 (plugin manifest)
Context
Beyond-SOTA sweep Milestone 7, over the HOMECORE server/network layer
crates only — homecore-api, homecore-server, homecore-automation,
homecore-hap, homecore-plugins — executed under the project's
prove-everything / anti-"AI-slop" directive.
Headline — the library cores are real, but the network boundary was unsound
The same audit pattern as ADR-160 held for the library logic: the automation trigger/condition/template/action evaluators, the REST handlers, the HAP mapping, and the plugin manifest parser are real, tested code — not stubs. That is the anti-slop positive and it is cited here as such.
What the audit found was not fake business logic but an unsound trust boundary plus documented-but-no-op features:
- A CRITICAL WebSocket authentication bypass — the WS handshake accepted any non-empty token, ignoring the provisioned token whitelist the REST path enforces.
- Reply-theater — WS command responses were computed, then logged and
discarded; no
result/pong/eventever reached the client. - Documented-but-idle automation — the engine was constructed and dropped
(never started); time triggers,
RunMode,Choosebranches, and template conditions were each documented as working but were no-ops in the live path.
This is a worse class than ADR-160's over-naming: here the doc claimed a capability the code did not deliver (auth enforcement, reply transport, running automations). The fix is implement where feasible, honestly relabel where not — never leave a false doc. Every fix is pinned by a test that fails on the old code.
Grading vocabulary (ADR-152 / ADR-158 / ADR-160):
- MEASURED — reproduced in this worktree, command + failing-on-old test recorded.
- NO-ACTION (already-honest/already-hardened) — audited, found correct, cited as a positive.
- ACCEPTED-FUTURE — deliberately deferred, nothing dropped.
Decision — Fixes Landed
§A1 — WebSocket auth bypass (CRITICAL, security) — MEASURED
homecore-api/src/ws.rs handshake checked only token.trim().is_empty() and
sent auth_ok for any non-empty token. It never called
state.tokens().is_valid() — the check the REST path uses via
auth::BearerAuth. With a provisioned HOMECORE_TOKENS whitelist, any
attacker-chosen non-empty token got full WS access (read all states, call any
service, subscribe to all events).
Real fix: the handshake now calls
state.tokens().is_valid(&token).await (the same store + method as REST).
A wrong token receives auth_invalid and the socket closes. DEV (allow_any)
mode still accepts any non-empty bearer with a warn, so smoke tests keep
working; the empty token is rejected inside is_valid.
Failing-on-old test (tests/ws_handshake.rs):
wrong_token_is_rejected — provisions a real (non-dev) store with one good
token, sends a DIFFERENT non-empty token over the WS handshake, asserts
auth_invalid. On the old source the client received
{"type":"auth_ok",…} (verified: the test panics on old ws.rs with
left: "auth_ok", right: "auth_invalid"). Companion: correct_token_is_accepted.
Grade: MEASURED. This is the milestone headline.
§A2 — WS replies never transmitted (HIGH, functional) — MEASURED
ws.rs::Connection::run moved the socket into a recv-only task; the only
consumer of the response mpsc just did debug!("ws emit: {msg}") and dropped
every message. No command reply ever reached the wire.
Real fix: the socket is split with futures_util::StreamExt::split. A
dedicated writer task drains the response channel onto sink.send(...)
(text frames; a __pong:<n> sentinel maps to a Pong control frame); the reader
task parses commands concurrently. On reader exit the senders drop and the
writer task ends cleanly.
Failing-on-old tests: result_reply_is_received (connect → auth →
get_states → assert a result reply is RECEIVED within 5s) and
ping_pong_reply_is_received. Both time out on the old source (verified:
Elapsed panic). Grade: MEASURED.
§A8 — homecore-api bin: no env-token path, network-exposed (HIGH, security) — MEASURED
homecore-api/src/bin/server.rs bound 0.0.0.0:8123 with
SharedState::new() → allow_any_non_empty() and no HOMECORE_TOKENS
path (unlike homecore-server), so a provisioned operator had no way to lock
it down.
Real fix: the bin now mirrors homecore-server's provisioning — prefer the
HOMECORE_TOKENS whitelist (LongLivedTokenStore::from_env()), fall back to an
explicitly warn-logged DEV mode only when unset. It also defaults the bind
address to 127.0.0.1 (loopback) so a bare cargo run is not
network-exposed, with HOMECORE_BIND to opt into LAN.
Failing-on-old test (tests/server_bin_auth.rs):
provisioned_bin_rejects_wrong_bearer reproduces the bin's exact provisioning
path (a populated, non-dev store) and asserts a wrong bearer → 401;
from_env_path_enforces_whitelist proves from_env() is not dev mode and
enforces the list. The old bin's allow_any_non_empty() accepted the wrong
bearer. Grade: MEASURED.
§A3 — Automation engine never started (HIGH) — MEASURED
homecore-server/src/main.rs did let _automation_engine = AutomationEngine::new(...)
then dropped it immediately, while the header doc claimed "Automation engine
subscribed to the state machine."
Real fix: the engine is now built into a long-lived binding and .start()
is called, spawning the event loop + timer task; the header/log lines state it
is started with N automations and which trigger classes are active. (With A4–A7
the running engine is genuinely functional, not theater.)
Evidence: the engine-behavior tests below run against the same
AutomationEngine::start() path now wired into the bin. Grade: MEASURED.
§A4 — Trigger::Time hard-coded false, no timer (HIGH) — MEASURED
trigger.rs::matches_sync returned false for Time and there was no timer
task anywhere, so time automations could never fire.
Real fix: AutomationEngine::start_timer — a 1 Hz tokio interval that
compares each time: automation's at (HH:MM or HH:MM:SS) against the
local wall-clock second and fires it once per match (conditions still gate it).
matches_sync returning false for Time is now correct and documented
(it is a wall-clock trigger with no state-change context); a public
fire_time_for_test exposes the same path deterministically.
Failing-on-old test (tests/engine_behaviors.rs):
time_trigger_fires_via_timer_path (+ unit time_at_matches_handles_hh_mm_and_hh_mm_ss).
The method does not exist on the old engine. Grade: MEASURED.
§A5 — RunMode documented as AtomicBool-enforced but unbounded-parallel (HIGH) — MEASURED
engine.rs doc claimed "RunMode::Single is enforced via a per-automation
AtomicBool" — but no such code existed and every trigger spawned an
unbounded parallel task regardless of mode.
Real fix: each registered automation carries a running: Arc<AtomicBool>.
Single/IgnoreFirst modes compare_exchange the flag before spawning and
skip the trigger if a run is already in flight, clearing it on completion;
Parallel (and, for now, Restart/Queued) spawn on every trigger.
Failing-on-old tests (tests/engine_behaviors.rs):
single_mode_does_not_double_fire_on_rapid_triggers (two rapid triggers while
the first run sleeps → exactly 1 run; old code fired 2, verified) and
parallel_mode_does_fire_concurrently (→ 2). Grade: MEASURED (Single/Parallel
honored; bounded Queued/Restart/max ordering → ACCEPTED-FUTURE, see below).
§A6 — Action::Choose ignored branches (HIGH) — MEASURED
action.rs discarded choices and always ran default.
Real fix: ChoiceBranch::matches deserialises each branch's
serde_yaml::Value conditions into Condition and evaluates them (AND
semantics, against an EvalContext now carried on ExecutionContext). Choose
runs the first matching branch's sequence and falls to default only if
none match.
Failing-on-old tests (action.rs inline):
choose_runs_matching_branch_not_default (matching branch runs, default does
NOT — old code ran default, verified) and
choose_falls_to_default_when_no_branch_matches. Grade: MEASURED.
§A7 — Template conditions always false in the live engine (MEDIUM) — MEASURED
condition.rs returned false for Template whenever template_env was
None, and the engine built every EvalContext with template_env: None
(EvalContext::new), so template: conditions could never be true in
production — only in unit tests that hand-built a template env.
Real fix: the engine constructs one TemplateEnvironment over the state
machine and threads it into every EvalContext via
EvalContext::with_templates (event loop, timer task, and
ExecutionContext for Choose branches).
Failing-on-old tests (tests/engine_behaviors.rs):
template_condition_evaluates_true_in_engine (a {{ is_state(...) }} condition
gates an action true) and template_condition_evaluates_false_blocks_action.
On the old engine the action never ran (template always false, verified).
Grade: MEASURED.
§B5 — Plugin manifest sig/hash "verified before execution" doc was false (LOW, honesty) — relabeled
homecore-plugins/src/manifest.rs documented wasm_module_hash as "verified
before execution" and carried wasm_module_sig / publisher_key, but these
fields are never read for verification (only ever set to None in tests).
Fix (honest labeling — no false capability claimed): the three fields are re-doc'd "(P4 — not yet enforced, ADR-161/B5)" — parsed and round-tripped, but no integrity/signature check happens before a plugin runs. No verification code was added (that is P4); the doc now matches the code. Grade: doc-honesty (no behavior change). (Superseded by ADR-162 §P4: the hash/signature gate is now implemented and enforced.)
Negative Results (NO-ACTION positives — audited, found correct, cited not edited)
These were checked and are genuinely sound/honest; cited as positives, not touched:
- CSPRNG correctness — all IDs are
uuid::v4; the rng/randnsuspicion was REFUTED. No weak-randomness issue exists. - CORS allowlist (
app.rs) — already hardened (explicitAllowOrigin::list, nopermissive(),allow_credentials(false), env override). NO-ACTION. - No path traversal in
homecore-migrate— audited, clean. - No secrets in logs — audited, clean.
- HAP pairing stub — honestly disclaimed as a surface stub; not over-claimed.
InProcessRuntime"no sandbox" disclaimer — honest; left as-is.
Deferred Backlog (Nothing Dropped)
- Plugin authority-isolation (P5) —
DONE — ADR-162 §P5.homecore_permissionsclaims are parsed but not enforced at the host-call boundary.hc_state_setnow consults aPermissionSetdistilled from the manifest; an undeclared write returns a typed-3to the guest. - Plugin signature/hash verification (P4) —
implement theDONE — ADR-162 §P4.wasm_module_hash/wasm_module_sig/publisher_keygate that B5 now honestly says is absent.WasmtimeRuntime::load_pluginnow SHA-256-checks the module, Ed25519-verifies the signature againstpublisher_key, and enforces aPluginPolicytrust allowlist (secure-default rejects unsigned/untrusted/tampered modules). - HAP real pairing (P2) — SRP/HKDF pairing + encrypted sessions; current bridge is an accessory-mapping surface. ACCEPTED-FUTURE (honestly stubbed).
RunMode::Queued/Restart/maxordering —DONE — ADR-162 §A5. Restart aborts the in-flight task, Queued serializes via a per-automation async mutex, andSingle/Parallelare honored; bounded queueing, restart-kill, andmaxconcurrency are not yet wired (every non-Single mode is parallel).max: Ncaps concurrency via a per-automation semaphore.- Automation YAML load-at-boot — the engine starts empty; a YAML loader is P-next. The bin log states "0 automations registered" honestly.
Reproduction (MEASURED)
cd v2
cargo test -p homecore-api -p homecore-server -p homecore-automation -p homecore-hap --no-default-features
cargo test -p homecore-plugins --features wasmtime
cargo build --workspace --no-default-features
Result at time of writing (all 0 failed):
- homecore-api — 25 passed (lib 18;
server_bin_auth3;ws_handshake4) - homecore-automation — 42 passed (lib 37;
engine_behaviors5) - homecore-hap — 17 passed
- homecore-server — bin, 0 tests
- (homecore-plugins — 15 passed: lib 12; integration 3)
- Full workspace
cargo build --workspace --no-default-featuressucceeds.
Consequences
- The WebSocket path can no longer be entered with a forged token — it enforces
the same
LongLivedTokenStorewhitelist as REST (A1). - WS clients now actually receive
result/pong/eventframes (A2). - The
homecore-apidev bin defaults to loopback and honorsHOMECORE_TOKENS(A8); it is no longer an open0.0.0.0accept-any endpoint by default. - The automation engine is started for real and its time triggers,
Singlerun-mode,Choosebranches, andtemplate:conditions all function — no doc claims a capability the code lacks (A3–A7). - The plugin manifest no longer claims signature verification it does not perform (B5).
- Files kept under the 500-line guideline (
engine.rs462; behavioral tests moved totests/engine_behaviors.rs).