docs(huggingface): document safetensors header padding bug + workaround

The model.safetensors file currently published at huggingface.co/ruvnet/wifi-densepose-pretrained has a malformed header: the 8-byte u64 declares 1464 header bytes, the JSON document ends at byte 1461, and the last 3 bytes of the header zone are literal 0x00 padding instead of the spec-required 0x20 spaces. Strict safetensors readers — Rust safetensors crate, Candle, safetensors.torch.load_file — reject with 'SafetensorError: trailing characters at line 1 column 1462'. This commit: - adds docs/huggingface/SAFETENSORS-HEADER-BUG.md with byte-level evidence, spec citation, source-of-bug location (the SafeTensorsWriter in vendor/ruvector/.../export.js — separate repo at ruvnet/ruvector), list of three trainer scripts that go through this path (train-wiflow.js, train-ruvllm.js, train-camera-free.js), table of affected vs lenient consumers, 10-line strict-reader repro that reproduces the exact error class against a synthetic file, proposed upstream fix (0x20 padding or no padding), and a follow-ups checklist including the need to re-train/re-export and re-upload the HF artifact - flags the bundle as needing republish under [Unreleased] in CHANGELOG.md - updates the HF model section of docs/user-guide.md so the load example now patches the header with scripts/fix-safetensors-header.py before calling safetensors.torch.load_file (which would otherwise crash on the current bundle), and flips the Python/PyTorch row of the consumer-status table from 'Works' to 'Broken header — strict readers reject; patch with scripts/fix-safetensors-header.py'
2026-05-25 17:03:42 -06:00 · 2026-05-25 17:03:42 -06:00 · 67d186549a
parent 5354726d15
commit 67d186549a
3 changed files with 231 additions and 2 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

 ## [Unreleased]

+### Known Issues
+- **`model.safetensors` published at `huggingface.co/ruvnet/wifi-densepose-pretrained` is malformed** and rejected by every strict safetensors reader (Rust `safetensors` crate, Candle, Python `safetensors.torch.load_file`). The 8-byte header-length prefix declares 1464 bytes but the JSON document ends at byte 1461 — the trailing 3 padding bytes are literal `\x00` instead of the spec-mandated `0x20` (space). Strict readers fail with `SafetensorError: trailing characters at line 1 column 1462`. Lenient readers (the JS `SafeTensorsReader` in `vendor/ruvector` and the hand-rolled `load_safetensors` in `scripts/export-onnx.py`) accept the file because they strip trailing NULs before `JSON.parse`. **Bundle needs to be re-exported and re-published** once the upstream writer is patched. Full byte-level analysis, repro, and proposed upstream fix in [`docs/huggingface/SAFETENSORS-HEADER-BUG.md`](docs/huggingface/SAFETENSORS-HEADER-BUG.md). Origin: `SafeTensorsWriter.build()` in `vendor/ruvector/npm/packages/ruvllm/src/export.js` (separate repo, `ruvnet/ruvector`) leaves the padding zone of a zero-initialised `Uint8Array` untouched after copying in the JSON bytes — three trainer scripts (`train-wiflow.js`, `train-ruvllm.js`, `train-camera-free.js`) go through this code path.
+
+### Added
+- **Workaround utility for the safetensors header bug**: `scripts/fix-safetensors-header.py` loads any `.safetensors` file, detects `\x00` padding in the header zone, and rewrites it in-place with `0x20` (space) padding. Declared header length, JSON content, and every tensor byte are preserved — only the padding bytes flip from NUL to space, so the tensor-data hash is unchanged. Idempotent; supports `--dry-run`. Lets users patch the broken HuggingFace artifact locally until the upstream writer is fixed and the model is re-uploaded.
+
 ### Security
 - **ESP32 OTA upload now fails closed when no PSK is provisioned** (#596 audit finding — critical, **breaking change for unprovisioned nodes**). `ota_check_auth()` previously returned `true` when `s_ota_psk[0] == '\0'`, so a freshly-flashed node would accept attacker-controlled firmware over plain HTTP on port 8032 from any host on the WiFi. No Secure Boot V2, no signed-image verification — a single LAN call could brick or backdoor a node. The fix rejects every OTA upload until a PSK is written to NVS (the OTA HTTP server still starts so operators can run `provision.py --ota-psk <hex>` over USB-CDC without reflashing). **Operators affected**: any deployment that relied on the unauthenticated OTA endpoint working out of the box now needs to provision a PSK before subsequent OTA pushes will succeed. Boot-time `ESP_LOGW` makes the new posture visible.
 - **Path-traversal vulnerabilities patched in five sensing-server endpoints** (closes #615 — critical). New `wifi_densepose_sensing_server::path_safety::safe_id()` enforces `[A-Za-z0-9._-]` only (no leading `.`, max 64 chars) before any user-controlled identifier reaches a `format!()` building a filesystem path. Applied at:
--- a/docs/huggingface/SAFETENSORS-HEADER-BUG.md
+++ b/docs/huggingface/SAFETENSORS-HEADER-BUG.md
@ -0,0 +1,214 @@
+# Safetensors Header Padding Bug — `ruvnet/wifi-densepose-pretrained`
+
+**Status:** Open. Affects the `model.safetensors` file currently published at
+[`huggingface.co/ruvnet/wifi-densepose-pretrained`](https://huggingface.co/ruvnet/wifi-densepose-pretrained).
+Workaround available — see [Workaround](#workaround) below.
+
+## TL;DR
+
+The header in our published `model.safetensors` is padded to an 8-byte boundary
+with literal `\x00` bytes instead of the `0x20` (space) padding the
+[safetensors spec](https://github.com/huggingface/safetensors#format) requires.
+Strict readers — including the Rust `safetensors` crate, Candle, and the Python
+`safetensors.torch.load_file` helper that wraps the Rust binding — reject the
+file with `SafetensorError: trailing characters at line 1 column 1462`. Lenient
+readers (e.g. the hand-rolled parsers in `scripts/export-onnx.py` and the JS
+`SafeTensorsReader` in `vendor/ruvector/.../export.js`) accept it because they
+strip trailing NULs before `JSON.parse`.
+
+## Byte-level evidence
+
+Inspecting the file downloaded from the HF repo:
+
+| Offset | Bytes | Meaning |
+|--------|-------|---------|
+| `0..8`     | `b8 05 00 00 00 00 00 00` | `u64 little-endian` declared header length = **1464** |
+| `8..1469`  | `{"...":{...}}` (1461 JSON bytes) | The actual JSON header terminates at byte **1461** |
+| `1469..1472` | `00 00 00` | **Three NUL bytes** padding the JSON up to the declared 1464 |
+| `1472..EOF` | `...` | Tensor data section |
+
+`1461 % 8 == 5`, so the writer pads 3 bytes to reach the next 8-byte boundary
+(1464). The padding bytes are left as `\x00` because the writer zero-initializes
+the buffer up front and never overwrites the padding zone.
+
+## What the spec actually says
+
+[https://github.com/huggingface/safetensors#format](https://github.com/huggingface/safetensors#format)
+
+> 8 bytes: N, an unsigned little-endian 64-bit integer, containing the size of
+> the header.
+>
+> N bytes: a JSON UTF-8 string representing the header. The header data MUST
+> begin with a `{` character (0x7B). The header data MAY be trailing padded with
+> whitespace (0x20).
+
+Whitespace = `0x20` (space). NUL (`0x00`) is not whitespace, and the strict
+parsers correctly refuse to ignore it.
+
+## Where the bug originates
+
+The bad header is produced by `SafeTensorsWriter.build()` in
+[`vendor/ruvector/npm/packages/ruvllm/src/export.js`](../../vendor/ruvector/npm/packages/ruvllm/src/export.js)
+(part of the vendored `ruvnet/ruvector` submodule, source at
+[https://github.com/ruvnet/ruvector](https://github.com/ruvnet/ruvector)),
+specifically lines 95-105:
+
+```js
+// Pad header to 8-byte alignment
+const headerPadding = (8 - (headerBytes.length % 8)) % 8;
+const paddedHeaderLength = headerBytes.length + headerPadding;
+// ...
+const totalLength = 8 + paddedHeaderLength + offset;
+const buffer = new Uint8Array(totalLength);          // zero-initialised
+const view = new DataView(buffer.buffer);
+view.setBigUint64(0, BigInt(paddedHeaderLength), true);
+buffer.set(headerBytes, 8);                          // padding zone untouched
+```
+
+`new Uint8Array(totalLength)` zero-fills the buffer, then only the JSON bytes
+are copied in. The padding region between `headerBytes.length` and
+`paddedHeaderLength` is never overwritten, so it stays `\x00`.
+
+The corresponding `SafeTensorsReader.parseHeader()` in the same file masks the
+bug by stripping trailing NULs (`headerJson.replace(/\0+$/, '')`) before
+`JSON.parse` — round-tripping through the same writer/reader pair therefore
+succeeds, and the bug only surfaces in third-party strict readers.
+
+Three trainer scripts go through this exact code path:
+
+- `scripts/train-wiflow.js` — `SafeTensorsWriter` → `model.safetensors` (line 933)
+- `scripts/train-ruvllm.js` — same (line 1541)
+- `scripts/train-camera-free.js` — same (line 2276)
+- `scripts/train-wiflow-supervised.js` — same import (line 60)
+
+The HF publisher (`scripts/publish-huggingface.py`) just uploads whatever files
+sit in `dist/models/`; it does not generate or modify the `.safetensors` bytes,
+so the fix is **not** in this repo's publishing script.
+
+The Python writer used by `scripts/train-count.py::write_safetensors` (lines
+128-167) produces `count_v1.safetensors` and is independent of the JS writer.
+It writes the JSON header at exactly its UTF-8 byte length with no padding,
+which is also spec-compliant (the spec allows no padding), so that writer is
+**not** affected.
+
+## Affected consumers
+
+| Reader | Behaviour |
+|--------|-----------|
+| Rust `safetensors::SafeTensors::deserialize` (`safetensors 0.4.x` / `0.5.x` / `0.7.x`) | **Rejects** with `Error while deserializing header: invalid JSON in header: trailing characters at line 1 column 1462` |
+| Candle (`candle_core::safetensors::load`, uses the Rust crate) | **Rejects** with the same error |
+| Python `safetensors.torch.load_file` (wraps the Rust crate) | **Rejects** with `SafetensorError: trailing characters at line 1 column 1462` |
+| Python `safetensors.safe_open` | **Rejects** with the same error |
+| HuggingFace Hub safetensors metadata indexer | Marks the file as malformed in the repo's metadata view |
+| `scripts/export-onnx.py::load_safetensors` (our hand-rolled reader) | **Accepts** — slices `f.read(header_len)` and `JSON.parse`s after Python silently tolerates trailing NULs in a `bytes`→`str` decode followed by `json.loads`. Strictly speaking this works only because the JSON tokenizer reaches end of input mid-payload; some interpreter versions raise here. |
+| `SafeTensorsReader.parseHeader()` (JS, in the vendored ruvllm) | **Accepts** — strips trailing NULs explicitly |
+
+## Repro
+
+A 10-line script that reproduces the exact strict failure mode against a
+synthetic file constructed the same way the buggy writer does:
+
+```python
+import json, struct, tempfile, os
+from safetensors import safe_open
+
+tensors = {"lora.A": {"dtype": "F32", "shape": [4, 4], "data_offsets": [0, 64]},
+           "lora.B": {"dtype": "F32", "shape": [4, 4], "data_offsets": [64, 128]}}
+hdr = json.dumps(tensors).encode("utf-8")
+pad = (8 - len(hdr) % 8) % 8                 # mimic the JS writer
+buf = bytearray(8 + len(hdr) + pad + 128)    # zero-initialised, like new Uint8Array(...)
+buf[0:8] = struct.pack("<Q", len(hdr) + pad) # declared length includes the padding
+buf[8:8 + len(hdr)] = hdr                    # JSON only; padding zone stays \x00
+fd, p = tempfile.mkstemp(suffix=".safetensors"); os.write(fd, bytes(buf)); os.close(fd)
+with safe_open(p, framework="numpy") as f:   # raises SafetensorError
+    print(list(f.keys()))
+```
+
+Running this against `safetensors==0.7.0` prints:
+
+```
+SafetensorError: Error while deserializing header: invalid JSON in header:
+trailing characters at line 1 column 143
+```
+
+(143, not 1462, because this header is shorter than the published file's; the
+**class** of error is identical, and `1461 + 1` likewise lands at column 1462
+on the real artifact.)
+
+## Proposed upstream fix
+
+In `vendor/ruvector/npm/packages/ruvllm/src/export.js`, the writer must
+either:
+
+**Option A — spec-correct padding (preferred):** fill the padding zone with
+`0x20` instead of leaving it `\x00`:
+
+```js
+const buffer = new Uint8Array(totalLength);
+buffer.fill(0x20, 8 + headerBytes.length, 8 + paddedHeaderLength); // pad with spaces
+const view = new DataView(buffer.buffer);
+view.setBigUint64(0, BigInt(paddedHeaderLength), true);
+buffer.set(headerBytes, 8);
+```
+
+**Option B — no padding:** size the declared header to the exact JSON length and
+drop the alignment step. The spec doesn't require alignment; the implicit
+goal of the 8-byte align is so the tensor payload that follows is naturally
+aligned, but the Rust reference reader handles unaligned payloads fine.
+
+The corresponding `SafeTensorsReader.parseHeader()` can stop stripping NULs
+once writers are fixed (it remains safe to keep it as a backwards-compat
+guard for already-published artifacts).
+
+A drive-by patch would live in `ruvnet/ruvector` (not in this repo). Once
+the upstream fix lands and the submodule is bumped, the model needs to be
+**re-trained or re-exported and re-uploaded** to HuggingFace — there is no way
+to fix the published artifact in place from the writer side, only from the
+file side (see workaround below).
+
+## Workaround
+
+A small utility ships at [`scripts/fix-safetensors-header.py`](../../scripts/fix-safetensors-header.py)
+that loads any `.safetensors` file, detects `\x00` padding in the header
+region, and rewrites it in-place with `0x20` (space) padding — preserving the
+declared header length and every tensor byte, so the SHA-256 of the **tensor
+data** is unchanged. Only the header padding bytes flip from NUL to space.
+
+Usage:
+
+```bash
+# Download the broken file
+huggingface-cli download ruvnet/wifi-densepose-pretrained \
+    model.safetensors --local-dir models/wifi-densepose-pretrained
+
+# Fix it in place
+python scripts/fix-safetensors-header.py \
+    models/wifi-densepose-pretrained/model.safetensors
+
+# Load with strict tooling
+python -c "
+from safetensors.torch import load_file
+state = load_file('models/wifi-densepose-pretrained/model.safetensors')
+print({k: tuple(v.shape) for k, v in state.items()})
+"
+```
+
+The utility is idempotent: a fixed file with no `\x00` padding bytes in the
+header zone reports `already clean` and exits 0 without rewriting.
+
+## Follow-ups
+
+- [ ] Patch the upstream writer in
+      [`ruvnet/ruvector`](https://github.com/ruvnet/ruvector) (Option A above).
+- [ ] Bump the `vendor/ruvector` submodule once the upstream fix lands.
+- [ ] Re-train (or re-export) `model.safetensors` with the fixed writer and
+      re-upload to `ruvnet/wifi-densepose-pretrained`. The HuggingFace LFS
+      pointer should change; consumers who pinned by `revision=` will keep
+      pulling the broken file until they update.
+- [ ] Add a release-time check (`scripts/publish-huggingface.py`) that opens
+      every `.safetensors` file in `dist/models/` with the strict Python loader
+      and aborts the upload on rejection — prevents future regressions.
+- [ ] Remove the `headerJson.replace(/\0+$/, '')` workaround from
+      `SafeTensorsReader.parseHeader()` once no published artifacts depend on
+      it (lenient readers mask the bug for round-trip tests inside the
+      training pipeline).
--- a/docs/user-guide.md
+++ b/docs/user-guide.md
@ -995,14 +995,23 @@ The HF artifact is in **JSONL RVF** format (one JSON object per line: `metadata`

 | Consumer | Format it reads | Status |
 |----------|-----------------|--------|
-| Python / PyTorch training pipeline | `model.safetensors` | ✅ Works — load with `safetensors.torch.load_file` |
+| Python / PyTorch training pipeline | `model.safetensors` | ⚠️ **Broken header — strict readers reject** (see below). Patch with `scripts/fix-safetensors-header.py` then `safetensors.torch.load_file` works. |
 | RVF JSONL inspection / re-export | `model.rvf.jsonl` | ✅ Works — plain JSONL, parse line-by-line |
 | Sensing-server `--model <PATH>` flag | binary RVF (`RVFS` magic) | ⚠️ Does **not** accept the JSONL file yet — see gap below |

 **Known gap (tracked):** `v2/crates/wifi-densepose-sensing-server/src/rvf_container.rs` only parses the binary RVF segment format (magic `0x52564653`). Pointing `--model` at `model.rvf.jsonl` causes the progressive loader to error with `invalid magic at offset 0: expected 0x52564653, got 0x7974227B` (`0x7974227B` is the ASCII bytes `{"ty…` from the JSONL header), and the live pipeline degrades to null output rather than falling back to heuristic mode. Until a JSONL adapter lands (or the model is re-published as binary RVF), run the sensing-server **without** `--model` and consume the HF weights from Python or the training pipeline.

 ```bash
-# Works today — Python side (training, evaluation, embedding extraction):
+# Step 1 (REQUIRED until republish): patch the broken safetensors header in place.
+# The published file pads the 8-byte-aligned header with NUL bytes instead of the
+# spec-required 0x20 spaces, so strict readers reject it with
+# `SafetensorError: trailing characters at line 1 column 1462`. The fix only
+# touches padding bytes; tensor data and declared header length are unchanged.
+# See docs/huggingface/SAFETENSORS-HEADER-BUG.md for the full analysis.
+python scripts/fix-safetensors-header.py \
+    models/wifi-densepose-pretrained/model.safetensors
+
+# Step 2: load with the strict Python reader (training, evaluation, embedding extraction).
 python -c "
 from safetensors.torch import load_file
 state = load_file('models/wifi-densepose-pretrained/model.safetensors')