diff --git a/CHANGELOG.md b/CHANGELOG.md index e32a5fb1..4bd369b0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] +### Known Issues +- **`model.safetensors` published at `huggingface.co/ruvnet/wifi-densepose-pretrained` is malformed** and rejected by every strict safetensors reader (Rust `safetensors` crate, Candle, Python `safetensors.torch.load_file`). The 8-byte header-length prefix declares 1464 bytes but the JSON document ends at byte 1461 — the trailing 3 padding bytes are literal `\x00` instead of the spec-mandated `0x20` (space). Strict readers fail with `SafetensorError: trailing characters at line 1 column 1462`. Lenient readers (the JS `SafeTensorsReader` in `vendor/ruvector` and the hand-rolled `load_safetensors` in `scripts/export-onnx.py`) accept the file because they strip trailing NULs before `JSON.parse`. **Bundle needs to be re-exported and re-published** once the upstream writer is patched. Full byte-level analysis, repro, and proposed upstream fix in [`docs/huggingface/SAFETENSORS-HEADER-BUG.md`](docs/huggingface/SAFETENSORS-HEADER-BUG.md). Origin: `SafeTensorsWriter.build()` in `vendor/ruvector/npm/packages/ruvllm/src/export.js` (separate repo, `ruvnet/ruvector`) leaves the padding zone of a zero-initialised `Uint8Array` untouched after copying in the JSON bytes — three trainer scripts (`train-wiflow.js`, `train-ruvllm.js`, `train-camera-free.js`) go through this code path. + +### Added +- **Workaround utility for the safetensors header bug**: `scripts/fix-safetensors-header.py` loads any `.safetensors` file, detects `\x00` padding in the header zone, and rewrites it in-place with `0x20` (space) padding. Declared header length, JSON content, and every tensor byte are preserved — only the padding bytes flip from NUL to space, so the tensor-data hash is unchanged. Idempotent; supports `--dry-run`. Lets users patch the broken HuggingFace artifact locally until the upstream writer is fixed and the model is re-uploaded. + ### Security - **ESP32 OTA upload now fails closed when no PSK is provisioned** (#596 audit finding — critical, **breaking change for unprovisioned nodes**). `ota_check_auth()` previously returned `true` when `s_ota_psk[0] == '\0'`, so a freshly-flashed node would accept attacker-controlled firmware over plain HTTP on port 8032 from any host on the WiFi. No Secure Boot V2, no signed-image verification — a single LAN call could brick or backdoor a node. The fix rejects every OTA upload until a PSK is written to NVS (the OTA HTTP server still starts so operators can run `provision.py --ota-psk ` over USB-CDC without reflashing). **Operators affected**: any deployment that relied on the unauthenticated OTA endpoint working out of the box now needs to provision a PSK before subsequent OTA pushes will succeed. Boot-time `ESP_LOGW` makes the new posture visible. - **Path-traversal vulnerabilities patched in five sensing-server endpoints** (closes #615 — critical). New `wifi_densepose_sensing_server::path_safety::safe_id()` enforces `[A-Za-z0-9._-]` only (no leading `.`, max 64 chars) before any user-controlled identifier reaches a `format!()` building a filesystem path. Applied at: diff --git a/docs/huggingface/SAFETENSORS-HEADER-BUG.md b/docs/huggingface/SAFETENSORS-HEADER-BUG.md new file mode 100644 index 00000000..c1050e76 --- /dev/null +++ b/docs/huggingface/SAFETENSORS-HEADER-BUG.md @@ -0,0 +1,214 @@ +# Safetensors Header Padding Bug — `ruvnet/wifi-densepose-pretrained` + +**Status:** Open. Affects the `model.safetensors` file currently published at +[`huggingface.co/ruvnet/wifi-densepose-pretrained`](https://huggingface.co/ruvnet/wifi-densepose-pretrained). +Workaround available — see [Workaround](#workaround) below. + +## TL;DR + +The header in our published `model.safetensors` is padded to an 8-byte boundary +with literal `\x00` bytes instead of the `0x20` (space) padding the +[safetensors spec](https://github.com/huggingface/safetensors#format) requires. +Strict readers — including the Rust `safetensors` crate, Candle, and the Python +`safetensors.torch.load_file` helper that wraps the Rust binding — reject the +file with `SafetensorError: trailing characters at line 1 column 1462`. Lenient +readers (e.g. the hand-rolled parsers in `scripts/export-onnx.py` and the JS +`SafeTensorsReader` in `vendor/ruvector/.../export.js`) accept it because they +strip trailing NULs before `JSON.parse`. + +## Byte-level evidence + +Inspecting the file downloaded from the HF repo: + +| Offset | Bytes | Meaning | +|--------|-------|---------| +| `0..8` | `b8 05 00 00 00 00 00 00` | `u64 little-endian` declared header length = **1464** | +| `8..1469` | `{"...":{...}}` (1461 JSON bytes) | The actual JSON header terminates at byte **1461** | +| `1469..1472` | `00 00 00` | **Three NUL bytes** padding the JSON up to the declared 1464 | +| `1472..EOF` | `...` | Tensor data section | + +`1461 % 8 == 5`, so the writer pads 3 bytes to reach the next 8-byte boundary +(1464). The padding bytes are left as `\x00` because the writer zero-initializes +the buffer up front and never overwrites the padding zone. + +## What the spec actually says + +[https://github.com/huggingface/safetensors#format](https://github.com/huggingface/safetensors#format) + +> 8 bytes: N, an unsigned little-endian 64-bit integer, containing the size of +> the header. +> +> N bytes: a JSON UTF-8 string representing the header. The header data MUST +> begin with a `{` character (0x7B). The header data MAY be trailing padded with +> whitespace (0x20). + +Whitespace = `0x20` (space). NUL (`0x00`) is not whitespace, and the strict +parsers correctly refuse to ignore it. + +## Where the bug originates + +The bad header is produced by `SafeTensorsWriter.build()` in +[`vendor/ruvector/npm/packages/ruvllm/src/export.js`](../../vendor/ruvector/npm/packages/ruvllm/src/export.js) +(part of the vendored `ruvnet/ruvector` submodule, source at +[https://github.com/ruvnet/ruvector](https://github.com/ruvnet/ruvector)), +specifically lines 95-105: + +```js +// Pad header to 8-byte alignment +const headerPadding = (8 - (headerBytes.length % 8)) % 8; +const paddedHeaderLength = headerBytes.length + headerPadding; +// ... +const totalLength = 8 + paddedHeaderLength + offset; +const buffer = new Uint8Array(totalLength); // zero-initialised +const view = new DataView(buffer.buffer); +view.setBigUint64(0, BigInt(paddedHeaderLength), true); +buffer.set(headerBytes, 8); // padding zone untouched +``` + +`new Uint8Array(totalLength)` zero-fills the buffer, then only the JSON bytes +are copied in. The padding region between `headerBytes.length` and +`paddedHeaderLength` is never overwritten, so it stays `\x00`. + +The corresponding `SafeTensorsReader.parseHeader()` in the same file masks the +bug by stripping trailing NULs (`headerJson.replace(/\0+$/, '')`) before +`JSON.parse` — round-tripping through the same writer/reader pair therefore +succeeds, and the bug only surfaces in third-party strict readers. + +Three trainer scripts go through this exact code path: + +- `scripts/train-wiflow.js` — `SafeTensorsWriter` → `model.safetensors` (line 933) +- `scripts/train-ruvllm.js` — same (line 1541) +- `scripts/train-camera-free.js` — same (line 2276) +- `scripts/train-wiflow-supervised.js` — same import (line 60) + +The HF publisher (`scripts/publish-huggingface.py`) just uploads whatever files +sit in `dist/models/`; it does not generate or modify the `.safetensors` bytes, +so the fix is **not** in this repo's publishing script. + +The Python writer used by `scripts/train-count.py::write_safetensors` (lines +128-167) produces `count_v1.safetensors` and is independent of the JS writer. +It writes the JSON header at exactly its UTF-8 byte length with no padding, +which is also spec-compliant (the spec allows no padding), so that writer is +**not** affected. + +## Affected consumers + +| Reader | Behaviour | +|--------|-----------| +| Rust `safetensors::SafeTensors::deserialize` (`safetensors 0.4.x` / `0.5.x` / `0.7.x`) | **Rejects** with `Error while deserializing header: invalid JSON in header: trailing characters at line 1 column 1462` | +| Candle (`candle_core::safetensors::load`, uses the Rust crate) | **Rejects** with the same error | +| Python `safetensors.torch.load_file` (wraps the Rust crate) | **Rejects** with `SafetensorError: trailing characters at line 1 column 1462` | +| Python `safetensors.safe_open` | **Rejects** with the same error | +| HuggingFace Hub safetensors metadata indexer | Marks the file as malformed in the repo's metadata view | +| `scripts/export-onnx.py::load_safetensors` (our hand-rolled reader) | **Accepts** — slices `f.read(header_len)` and `JSON.parse`s after Python silently tolerates trailing NULs in a `bytes`→`str` decode followed by `json.loads`. Strictly speaking this works only because the JSON tokenizer reaches end of input mid-payload; some interpreter versions raise here. | +| `SafeTensorsReader.parseHeader()` (JS, in the vendored ruvllm) | **Accepts** — strips trailing NULs explicitly | + +## Repro + +A 10-line script that reproduces the exact strict failure mode against a +synthetic file constructed the same way the buggy writer does: + +```python +import json, struct, tempfile, os +from safetensors import safe_open + +tensors = {"lora.A": {"dtype": "F32", "shape": [4, 4], "data_offsets": [0, 64]}, + "lora.B": {"dtype": "F32", "shape": [4, 4], "data_offsets": [64, 128]}} +hdr = json.dumps(tensors).encode("utf-8") +pad = (8 - len(hdr) % 8) % 8 # mimic the JS writer +buf = bytearray(8 + len(hdr) + pad + 128) # zero-initialised, like new Uint8Array(...) +buf[0:8] = struct.pack("` flag | binary RVF (`RVFS` magic) | ⚠️ Does **not** accept the JSONL file yet — see gap below | **Known gap (tracked):** `v2/crates/wifi-densepose-sensing-server/src/rvf_container.rs` only parses the binary RVF segment format (magic `0x52564653`). Pointing `--model` at `model.rvf.jsonl` causes the progressive loader to error with `invalid magic at offset 0: expected 0x52564653, got 0x7974227B` (`0x7974227B` is the ASCII bytes `{"ty…` from the JSONL header), and the live pipeline degrades to null output rather than falling back to heuristic mode. Until a JSONL adapter lands (or the model is re-published as binary RVF), run the sensing-server **without** `--model` and consume the HF weights from Python or the training pipeline. ```bash -# Works today — Python side (training, evaluation, embedding extraction): +# Step 1 (REQUIRED until republish): patch the broken safetensors header in place. +# The published file pads the 8-byte-aligned header with NUL bytes instead of the +# spec-required 0x20 spaces, so strict readers reject it with +# `SafetensorError: trailing characters at line 1 column 1462`. The fix only +# touches padding bytes; tensor data and declared header length are unchanged. +# See docs/huggingface/SAFETENSORS-HEADER-BUG.md for the full analysis. +python scripts/fix-safetensors-header.py \ + models/wifi-densepose-pretrained/model.safetensors + +# Step 2: load with the strict Python reader (training, evaluation, embedding extraction). python -c " from safetensors.torch import load_file state = load_file('models/wifi-densepose-pretrained/model.safetensors')