Constrain code points to unicode scalar values

Fixes: https://github.com/kdl-org/kdl/issues/207
2023-12-12 22:10:26 -08:00 · 2023-12-12 22:10:26 -08:00 · 5a7b339ed4
parent b42b6c80f0
commit 5a7b339ed4
2 changed files with 15 additions and 9 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -39,6 +39,10 @@
 * A statement in the spec prose that said "It is reasonable for an
  implementation to ignore null values altogether when deserializing". This is
  no longer encouraged or desired.
+* Code points have been constrained to [Unicode Scalar
+  Values](https://unicode.org/glossary/#unicode_scalar_value) only, including
+  values used in string escapes (`\u{}`). All KDL documents and string values
+  should be valid UTF-8 now, as was intended.

 ### KQL

--- a/SPEC.md
+++ b/SPEC.md
@ -92,13 +92,15 @@ foo 1 key="val" 3 {

 ### Identifier

-A bare Identifier is composed of any Unicode codepoint other than [non-initial
-characters](#non-initial-characters), followed by any number of Unicode code
-points other than [non-identifier characters](#non-identifier-characters), so
-long as this doesn't produce something confusable for a [Number](#number). For
-example, both a [Number](#number) and an Identifier can start with `-`, but
-when an Identifier starts with `-` the second character cannot be a digit.
-This is precicely specified in the [Full Grammar](#full-grammar) below.
+A bare Identifier is composed of any [Unicode Scalar
+Value](https://unicode.org/glossary/#unicode_scalar_value) other than
+[non-initial characters](#non-initial-characters), followed by any number of
+Unicode Scalar Values other than [non-identifier
+characters](#non-identifier-characters), so long as this doesn't produce
+something confusable for a [Number](#number). For example, both a
+[Number](#number) and an Identifier can start with `-`, but when an Identifier
+starts with `-` the second character cannot be a digit. This is precicely
+specified in the [Full Grammar](#full-grammar) below.

 When Identifiers are used as the values in [Arguments](#argument) and
 [Properties](#property), they are treated as strings, just like they are with
@ -342,7 +344,7 @@ interpreted as described in the following table:
 | Quotation Mark (Double Quote) | `\"`   | `U+0022` |
 | Backspace                     | `\b`   | `U+0008` |
 | Form Feed                     | `\f`   | `U+000C` |
-| Unicode Escape                | `\u{(1-6 hex chars)}` | Code point described by hex characters, up to `10FFFF` |
+| Unicode Escape                | `\u{(1-6 hex chars)}` | Code point described by hex characters, as long as it represents a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value) |
 | Whitespace Escape             | See below | N/A   |

 ##### Escaped Whitespace
@ -504,7 +506,7 @@ They may be represented in Strings (but not Raw Strings) using `\u{}`.

 * Any codepoint with hexadecimal value `0x20` or below (various control characters).
 * `0x7F` (the Delete control character).
-* Any codepoint with hexadecimal value higher than `0x10FFFF`.
+* Any codepoint that is not a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value).
 * `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)

 ## Full Grammar