disallow more code points and outright ban certain ones from KDL documents altogether

Fixes: https://github.com/kdl-org/kdl/issues/250
This commit is contained in:
Kat Marchán 2023-12-10 16:35:48 -08:00
parent 9c5a06117a
commit c30cc1cab4
No known key found for this signature in database
GPG Key ID: AEB529C08A3C7E9E
2 changed files with 36 additions and 7 deletions

View File

@ -16,6 +16,13 @@
characters).
* `,`, `<`, and `>` are not legal identifier characters. They were previously
reserved for KQL but this is no longer necessary.
* Code points under `0x20`, code points above `0x10FFFF`, Delete control
character (`0x7F`), and the [unicode "direction control"
characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
are now completely banned from appearing literally in KDL documents. They
can now only be represented in regular strings, and there's no facilities to
represent them in raw strings. This should be considered a security
improvement.
### KQL

36
SPEC.md
View File

@ -94,7 +94,7 @@ foo 1 key="val" 3 {
A bare Identifier is composed of any Unicode codepoint other than [non-initial
characters](#non-initial-characters), followed by any number of Unicode
codepoints other than [non-identifier characters](#non-identifier-characters),
code points other than [non-identifier characters](#non-identifier-characters),
so long as this doesn't produce something confusable for a [Number](#number),
[Boolean](#boolean), or [Null](#null). For example, both a [Number](#number)
and an Identifier can start with `-`, but when an Identifier starts with `-`
@ -122,9 +122,9 @@ of having an identifier look like a negative number.
The following characters cannot be used anywhere in a bare
[Identifier](#identifier):
* Any codepoint with hexadecimal value `0x20` or below.
* Any codepoint with hexadecimal value higher than `0x10FFFF`.
* Any of `\/(){};[]="`
* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
documents.
### Line Continuation
@ -137,6 +137,7 @@ characters and an optional single-line comment. It must be terminated by a
Following a line continuation, processing of a Node can continue as usual.
#### Example
```kdl
my-node 1 2 \ // comments are ok after \
3 4 // This is the actual end of the Node.
@ -309,6 +310,10 @@ String Value can encompass multiple lines without behaving like a Newline for
Strings _MUST_ be represented as UTF-8 values.
Strings _MUST NOT_ include the code points for [disallowed literal
code points](#disallowed-literal-code-points) directly. If needed, they can be
specified with their corresponding `\u{}` escape.
#### Escapes
In addition to literal code points, a number of "escapes" are supported.
@ -368,6 +373,11 @@ closed by a `"` followed by a _matching_ number of `#` characters. This means
that the string sequence `"` or `"#` and such must not match the closing `"`
with the same or more `#` characters as the opening `r`.
Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal
code-points](#disallowed-literal-code-points) as code points in their body.
Unlike with Strings, these cannot simply be escaped, and are thus
unrepresentable when using Raw Strings.
#### Example
```kdl
@ -470,6 +480,16 @@ lines](https://www.unicode.org/versions/Unicode13.0.0/ch05.pdf):
Note that for the purpose of new lines, CRLF is considered _a single newline_.
### Disallowed Literal Code Points
The following code points may not appear literally anywhere in the document.
They may be represented in Strings (but not Raw Strings) using `\u{}`.
* Any codepoint with hexadecimal value `0x20` or below (various control characters).
* `0x7F` (the Delete control character).
* Any codepoint with hexadecimal value higher than `0x10FFFF`.
* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
## Full Grammar
This is the full official grammar for KDL and should be considered
@ -498,15 +518,15 @@ bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - key
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?
identifier-char := unicode - line-space - [\\/(){};\[\]="]
identifier-char := unicode - line-space - [\\/(){};\[\]="] - disallowed-literal-code-points
keyword := boolean | 'null'
prop := identifier '=' value
prop := identifier '=' valuel
value := type? (string | number | keyword)
type := '(' identifier ')'
string := raw-string | escaped-string
escaped-string := '"' character* '"'
character := '\' escape | [^\"]
escaped-string := '"' string-character* '"'
string-character := '\' escape | [^\"] - disallowed-literal-code-points
escape := ["\\bfnrt] | 'u{' hex-digit{1, 6} '}' | (unicode-space | newline)+
hex-digit := [0-9a-fA-F]
@ -536,6 +556,8 @@ ws := bom | unicode-space | multi-line-comment
bom := '\u{FEFF}'
disallowed-literal-code-points := See Table (Disallowed Literal Code Points)
unicode-space := See Table (All White_Space unicode characters which are not `newline`)
single-line-comment := '//' ^newline* (newline | eof)