From 5a7b339ed44f9464af04483547bbad0aa328ccf6 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Kat=20March=C3=A1n?= Date: Tue, 12 Dec 2023 22:10:26 -0800 Subject: [PATCH] Constrain code points to unicode scalar values Fixes: https://github.com/kdl-org/kdl/issues/207 --- CHANGELOG.md | 4 ++++ SPEC.md | 20 +++++++++++--------- 2 files changed, 15 insertions(+), 9 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 98e1561..927a048 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -39,6 +39,10 @@ * A statement in the spec prose that said "It is reasonable for an implementation to ignore null values altogether when deserializing". This is no longer encouraged or desired. +* Code points have been constrained to [Unicode Scalar + Values](https://unicode.org/glossary/#unicode_scalar_value) only, including + values used in string escapes (`\u{}`). All KDL documents and string values + should be valid UTF-8 now, as was intended. ### KQL diff --git a/SPEC.md b/SPEC.md index facb0b1..c0f665f 100644 --- a/SPEC.md +++ b/SPEC.md @@ -92,13 +92,15 @@ foo 1 key="val" 3 { ### Identifier -A bare Identifier is composed of any Unicode codepoint other than [non-initial -characters](#non-initial-characters), followed by any number of Unicode code -points other than [non-identifier characters](#non-identifier-characters), so -long as this doesn't produce something confusable for a [Number](#number). For -example, both a [Number](#number) and an Identifier can start with `-`, but -when an Identifier starts with `-` the second character cannot be a digit. -This is precicely specified in the [Full Grammar](#full-grammar) below. +A bare Identifier is composed of any [Unicode Scalar +Value](https://unicode.org/glossary/#unicode_scalar_value) other than +[non-initial characters](#non-initial-characters), followed by any number of +Unicode Scalar Values other than [non-identifier +characters](#non-identifier-characters), so long as this doesn't produce +something confusable for a [Number](#number). For example, both a +[Number](#number) and an Identifier can start with `-`, but when an Identifier +starts with `-` the second character cannot be a digit. This is precicely +specified in the [Full Grammar](#full-grammar) below. When Identifiers are used as the values in [Arguments](#argument) and [Properties](#property), they are treated as strings, just like they are with @@ -342,7 +344,7 @@ interpreted as described in the following table: | Quotation Mark (Double Quote) | `\"` | `U+0022` | | Backspace | `\b` | `U+0008` | | Form Feed | `\f` | `U+000C` | -| Unicode Escape | `\u{(1-6 hex chars)}` | Code point described by hex characters, up to `10FFFF` | +| Unicode Escape | `\u{(1-6 hex chars)}` | Code point described by hex characters, as long as it represents a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value) | | Whitespace Escape | See below | N/A | ##### Escaped Whitespace @@ -504,7 +506,7 @@ They may be represented in Strings (but not Raw Strings) using `\u{}`. * Any codepoint with hexadecimal value `0x20` or below (various control characters). * `0x7F` (the Delete control character). -* Any codepoint with hexadecimal value higher than `0x10FFFF`. +* Any codepoint that is not a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value). * `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls) ## Full Grammar