mirror of https://git.sr.ht/~stygianentity/bincode
Finally got around to updating the spec based on feedback (#741)
* Finally got around to updating the spec based on feedback * Fixed failing spec test
This commit is contained in:
parent
4488a6496a
commit
6a316617f4
195
docs/spec.md
195
docs/spec.md
|
|
@ -1,22 +1,72 @@
|
||||||
# Serialization specification
|
# Serialization Specification
|
||||||
|
|
||||||
*NOTE*: Serialization is done by `bincode_derive` by default. If you enable the `serde` flag, serialization with `serde-derive` is supported as well. `serde-derive` has the same guarantees as `bincode_derive` for now.
|
_NOTE_: This specification is primarily defined in the context of Rust, but aims to be implementable across different programming languages.
|
||||||
|
|
||||||
Related issue: <https://github.com/serde-rs/serde/issues/1756#issuecomment-689682123>
|
## Definitions
|
||||||
|
|
||||||
## Endian
|
- **Variant**: A specific constructor or case of an enum type.
|
||||||
|
- **Variant Payload**: The associated data of a specific enum variant.
|
||||||
|
- **Discriminant**: A unique identifier for an enum variant, typically represented as an integer.
|
||||||
|
- **Basic Types**: Primitive types that have a direct, well-defined binary representation.
|
||||||
|
|
||||||
By default `bincode` will serialize values in little endian encoding. This can be overwritten in the `Config`.
|
## Endianness
|
||||||
|
|
||||||
## Basic types
|
By default, this serialization format uses little-endian byte order for basic numeric types. This means multi-byte values are encoded with their least significant byte first.
|
||||||
|
|
||||||
Boolean types are encoded with 1 byte for each boolean type, with `0` being `false`, `1` being true. Whilst deserializing every other value will throw an error.
|
Endianness can be configured with the following methods, allowing for big-endian serialization when required:
|
||||||
|
|
||||||
All basic numeric types will be encoded based on the configured [IntEncoding](#intencoding).
|
- [`with_big_endian`](https://docs.rs/bincode/2.0.0-rc/bincode/config/struct.Configuration.html#method.with_big_endian)
|
||||||
|
- [`with_little_endian`](https://docs.rs/bincode/2.0.0-rc/bincode/config/struct.Configuration.html#method.with_little_endian)
|
||||||
|
|
||||||
All floating point types will take up exactly 4 (for `f32`) or 8 (for `f64`) bytes.
|
### Byte Order Considerations
|
||||||
|
|
||||||
|
- Multi-byte values (integers, floats) are affected by endianness
|
||||||
|
- Single-byte values (u8, i8) are not affected
|
||||||
|
- Struct and collection serialization order is not changed by endianness
|
||||||
|
|
||||||
|
## Basic Types
|
||||||
|
|
||||||
|
### Boolean Encoding
|
||||||
|
|
||||||
|
- Encoded as a single byte
|
||||||
|
- `false` is represented by `0`
|
||||||
|
- `true` is represented by `1`
|
||||||
|
- During deserialization, values other than 0 and 1 will result in an error [`DecodeError::InvalidBooleanValue`](https://docs.rs/bincode/2.0.0-rc/bincode/error/enum.DecodeError.html#variant.InvalidBooleanValue)
|
||||||
|
|
||||||
|
### Numeric Types
|
||||||
|
|
||||||
|
- Encoded based on the configured [IntEncoding](#intencoding)
|
||||||
|
- Signed integers use 2's complement representation
|
||||||
|
- Floating point types use IEEE 754-2008 standard
|
||||||
|
- `f32`: 4 bytes (binary32)
|
||||||
|
- `f64`: 8 bytes (binary64)
|
||||||
|
|
||||||
|
#### Floating Point Special Values
|
||||||
|
|
||||||
|
- Subnormal numbers are preserved
|
||||||
|
- Also known as denormalized numbers
|
||||||
|
- Maintain their exact bit representation
|
||||||
|
- `NaN` values are preserved
|
||||||
|
- Both quiet and signaling `NaN` are kept as-is
|
||||||
|
- Bit pattern of `NaN` is maintained exactly
|
||||||
|
- No normalization or transformation of special values occurs
|
||||||
|
- Serialization and deserialization do not alter the bit-level representation
|
||||||
|
- Consistent with IEEE 754-2008 standard for floating-point arithmetic
|
||||||
|
|
||||||
|
### Character Encoding
|
||||||
|
|
||||||
|
- `char` is encoded as a 32-bit unsigned integer representing its Unicode Scalar Value
|
||||||
|
- Valid Unicode Scalar Value range:
|
||||||
|
- 0x0000 to 0xD7FF (Basic Multilingual Plane)
|
||||||
|
- 0xE000 to 0x10FFFF (Supplementary Planes)
|
||||||
|
- Surrogate code points (0xD800 to 0xDFFF) are not valid
|
||||||
|
- Invalid Unicode characters can be acquired via unsafe code, this is handled as:
|
||||||
|
- during serialization: data is written as-is
|
||||||
|
- during deserialization: an error is raised [`DecodeError::InvalidCharEncoding`](https://docs.rs/bincode/2.0.0-rc/bincode/error/enum.DecodeError.html#variant.InvalidCharEncoding)
|
||||||
|
- No additional metadata or encoding scheme beyond the raw code point value
|
||||||
|
|
||||||
All tuples have no additional bytes, and are encoded in their specified order, e.g.
|
All tuples have no additional bytes, and are encoded in their specified order, e.g.
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
let tuple = (u32::min_value(), i32::max_value()); // 8 bytes
|
let tuple = (u32::min_value(), i32::max_value()); // 8 bytes
|
||||||
let encoded = bincode::encode_to_vec(tuple, bincode::config::legacy()).unwrap();
|
let encoded = bincode::encode_to_vec(tuple, bincode::config::legacy()).unwrap();
|
||||||
|
|
@ -27,9 +77,11 @@ assert_eq!(encoded.as_slice(), &[
|
||||||
```
|
```
|
||||||
|
|
||||||
## IntEncoding
|
## IntEncoding
|
||||||
|
|
||||||
Bincode currently supports 2 different types of `IntEncoding`. With the default config, `VarintEncoding` is selected.
|
Bincode currently supports 2 different types of `IntEncoding`. With the default config, `VarintEncoding` is selected.
|
||||||
|
|
||||||
### VarintEncoding
|
### VarintEncoding
|
||||||
|
|
||||||
Encoding an unsigned integer v (of any type excepting u8/i8) works as follows:
|
Encoding an unsigned integer v (of any type excepting u8/i8) works as follows:
|
||||||
|
|
||||||
1. If `u < 251`, encode it as a single byte with that value.
|
1. If `u < 251`, encode it as a single byte with that value.
|
||||||
|
|
@ -87,6 +139,7 @@ assert_eq!(encoded.as_slice(), &[
|
||||||
```
|
```
|
||||||
|
|
||||||
### Options
|
### Options
|
||||||
|
|
||||||
`Option<T>` is always serialized using a single byte for the discriminant, even in `Fixint` encoding (which normally uses a `u32` for discriminant).
|
`Option<T>` is always serialized using a single byte for the discriminant, even in `Fixint` encoding (which normally uses a `u32` for discriminant).
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
|
|
@ -105,17 +158,29 @@ assert_eq!(encoded.as_slice(), &[
|
||||||
|
|
||||||
# Collections
|
# Collections
|
||||||
|
|
||||||
Collections are encoded with their length value first, following by each entry of the collection. The length value is based on your `IntEncoding`.
|
## General Collection Serialization
|
||||||
|
|
||||||
**note**: fixed array length may not have their `len` encoded. See [Arrays](#arrays)
|
Collections are encoded with their length value first, followed by each entry of the collection. The length value is based on the configured `IntEncoding`.
|
||||||
|
|
||||||
|
### Serialization Considerations
|
||||||
|
|
||||||
|
- Length is always serialized first
|
||||||
|
- Entries are serialized in the order they are returned from the iterator implementation.
|
||||||
|
- Iteration order depends on the collection type
|
||||||
|
- Ordered collections (e.g., `Vec`): Iteration from lowest to highest index
|
||||||
|
- Unordered collections (e.g., `HashMap`): Implementation-defined iteration order
|
||||||
|
- Duplicate keys are not checked in bincode, but may be resulting in an error when decoding a container from a list of pairs.
|
||||||
|
|
||||||
|
### Handling of Specific Collection Types
|
||||||
|
|
||||||
|
#### Linear Collections (`Vec`, Arrays, etc.)
|
||||||
|
|
||||||
|
- Serialized by iterating from lowest to highest index
|
||||||
|
- Length prefixed
|
||||||
|
- Each item serialized sequentially
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
let list = vec![
|
let list = vec![0u8, 1u8, 2u8];
|
||||||
0u8,
|
|
||||||
1u8,
|
|
||||||
2u8
|
|
||||||
];
|
|
||||||
|
|
||||||
let encoded = bincode::encode_to_vec(list, bincode::config::legacy()).unwrap();
|
let encoded = bincode::encode_to_vec(list, bincode::config::legacy()).unwrap();
|
||||||
assert_eq!(encoded.as_slice(), &[
|
assert_eq!(encoded.as_slice(), &[
|
||||||
3, 0, 0, 0, 0, 0, 0, 0, // length of 3u64
|
3, 0, 0, 0, 0, 0, 0, 0, // length of 3u64
|
||||||
|
|
@ -125,29 +190,63 @@ assert_eq!(encoded.as_slice(), &[
|
||||||
]);
|
]);
|
||||||
```
|
```
|
||||||
|
|
||||||
This also applies to e.g. `HashMap`, where each entry is a [tuple](#basic-types) of the key and value.
|
#### Key-Value Collections (`HashMap`, etc.)
|
||||||
|
|
||||||
|
- Serialized as a sequence of key-value pairs
|
||||||
|
- Iteration order is implementation-defined
|
||||||
|
- Each entry is a tuple of (key, value)
|
||||||
|
|
||||||
|
### Special Collection Considerations
|
||||||
|
|
||||||
|
- Bincode will serialize the entries based on the iterator order.
|
||||||
|
- Deserialization is deterministic but the collection implementation might not guarantee the same order as serialization.
|
||||||
|
|
||||||
|
**Note**: Fixed-length arrays do not have their length encoded. See [Arrays](#arrays) for details.
|
||||||
|
|
||||||
# String and &str
|
# String and &str
|
||||||
|
|
||||||
Both `String` and `&str` are treated as a `Vec<u8>`. See [Collections](#collections) for more information.
|
## Encoding Principles
|
||||||
|
|
||||||
|
- Strings are encoded as UTF-8 byte sequences
|
||||||
|
- No null terminator is added
|
||||||
|
- No Byte Order Mark (BOM) is written
|
||||||
|
- Unicode non-characters are preserved
|
||||||
|
|
||||||
|
### Encoding Details
|
||||||
|
|
||||||
|
- Length is encoded first using the configured `IntEncoding`
|
||||||
|
- Raw UTF-8 bytes follow the length
|
||||||
|
- Supports the full range of valid UTF-8 sequences
|
||||||
|
- `U+0000` and other code points can appear freely within the string
|
||||||
|
|
||||||
|
### Unicode Handling
|
||||||
|
|
||||||
|
- During serialization, the string is encoded as a sequence of the given bytes.
|
||||||
|
- Rust strings are UTF-8 encoded by default, but this is not enforced by bincode
|
||||||
|
- No normalization or transformation of text
|
||||||
|
- If an invalid UTF-8 sequence is encountered during decoding, an [`DecodeError::Utf8`](https://docs.rs/bincode/2.0.0-rc/bincode/error/enum.DecodeError.html#variant.Utf8) error is raised
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
let str = "Hello"; // Could also be `String::new(...)`
|
let str = "Hello 🌍"; // Mixed ASCII and Unicode
|
||||||
|
|
||||||
let encoded = bincode::encode_to_vec(str, bincode::config::legacy()).unwrap();
|
let encoded = bincode::encode_to_vec(str, bincode::config::legacy()).unwrap();
|
||||||
assert_eq!(encoded.as_slice(), &[
|
assert_eq!(encoded.as_slice(), &[
|
||||||
5, 0, 0, 0, 0, 0, 0, 0, // length of the string, 5 bytes
|
10, 0, 0, 0, 0, 0, 0, 0, // length of the string, 10 bytes
|
||||||
b'H', b'e', b'l', b'l', b'o'
|
b'H', b'e', b'l', b'l', b'o', b' ', 0xF0, 0x9F, 0x8C, 0x8D // UTF-8 encoded string
|
||||||
]);
|
]);
|
||||||
```
|
```
|
||||||
|
|
||||||
|
### Comparison with Other Types
|
||||||
|
|
||||||
|
- Treated similarly to `Vec<u8>` in serialization
|
||||||
|
- See [Collections](#collections) for more information about length and entry encoding
|
||||||
|
|
||||||
# Arrays
|
# Arrays
|
||||||
|
|
||||||
Array length is never encoded.
|
Array length is never encoded.
|
||||||
|
|
||||||
Note that `&[T]` is encoded as a [Collection](#collections).
|
Note that `&[T]` is encoded as a [Collection](#collections).
|
||||||
|
|
||||||
|
|
||||||
```rust
|
```rust
|
||||||
let arr: [u8; 5] = [10, 20, 30, 40, 50];
|
let arr: [u8; 5] = [10, 20, 30, 40, 50];
|
||||||
let encoded = bincode::encode_to_vec(arr, bincode::config::legacy()).unwrap();
|
let encoded = bincode::encode_to_vec(arr, bincode::config::legacy()).unwrap();
|
||||||
|
|
@ -184,3 +283,53 @@ assert_eq!(encoded.as_slice(), &[
|
||||||
]);
|
]);
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## TupleEncoding
|
||||||
|
|
||||||
|
Tuple fields are serialized in first-to-last declaration order, with no additional metadata.
|
||||||
|
|
||||||
|
- No length prefix is added
|
||||||
|
- Fields are encoded sequentially
|
||||||
|
- No padding or alignment adjustments are made
|
||||||
|
- Order of serialization is deterministic and matches the tuple's declaration order
|
||||||
|
|
||||||
|
## StructEncoding
|
||||||
|
|
||||||
|
Struct fields are serialized in first-to-last declaration order, with no metadata representing field names.
|
||||||
|
|
||||||
|
- No length prefix is added
|
||||||
|
- Fields are encoded sequentially
|
||||||
|
- No padding or alignment adjustments are made
|
||||||
|
- Order of serialization is deterministic and matches the struct's field declaration order
|
||||||
|
- Both named and unnamed fields are serialized identically
|
||||||
|
|
||||||
|
## EnumEncoding
|
||||||
|
|
||||||
|
Enum variants are encoded with a discriminant followed by optional variant payload.
|
||||||
|
|
||||||
|
### Discriminant Allocation
|
||||||
|
|
||||||
|
- Discriminants are automatically assigned by the derive macro in declaration order
|
||||||
|
- First variant starts at 0
|
||||||
|
- Subsequent variants increment by 1
|
||||||
|
- Explicit discriminant indices are currently not supported
|
||||||
|
- Discriminant is always represented as a `u32` during serialization. See [Discriminant Representation](#discriminant-representation) for more details.
|
||||||
|
- Maintains the original enum variant semantics during encoding
|
||||||
|
|
||||||
|
### Variant Payload Encoding
|
||||||
|
|
||||||
|
- Tuple variants: Fields serialized in declaration order
|
||||||
|
- Struct variants: Fields serialized in declaration order
|
||||||
|
- Unit variants: No additional data encoded
|
||||||
|
|
||||||
|
### Discriminant Representation
|
||||||
|
|
||||||
|
- Always encoded as a `u32`
|
||||||
|
- Encoding method depends on the configured `IntEncoding`
|
||||||
|
- `VarintEncoding`: Variable-length encoding
|
||||||
|
- `FixintEncoding`: Fixed 4-byte representation
|
||||||
|
|
||||||
|
### Handling of Variant Payloads
|
||||||
|
|
||||||
|
- Payload is serialized immediately after the discriminant
|
||||||
|
- No additional metadata about field names or types
|
||||||
|
- Payload structure matches the variant's definition
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue