mirror of https://github.com/kdl-org/kdl.git
allow ,<> as identifier characters since they no longer need to be re… (#352)
* fix some confusion in grammar syntax, and actually specify the syntax itself Fixes: https://github.com/kdl-org/kdl/issues/345 * allow ,<> as identifier characters since they no longer need to be reserved * fix typo * disallow more code points and outright ban certain ones from KDL documents altogether (#353) Fixes: https://github.com/kdl-org/kdl/issues/250 * `r` prefix is no longer required for raw strings (#354) Fixes: https://github.com/kdl-org/kdl/issues/337
This commit is contained in:
parent
99abeef6d3
commit
e6356d5a03
23
CHANGELOG.md
23
CHANGELOG.md
|
|
@ -8,8 +8,27 @@
|
||||||
* Single line comments (`//`) can now be immediately followed by a newline.
|
* Single line comments (`//`) can now be immediately followed by a newline.
|
||||||
* All literal whitespace following a `\` in a string is now discarded.
|
* All literal whitespace following a `\` in a string is now discarded.
|
||||||
* Vertical tabs (`U+000B`) are now considered to be whitespace.
|
* Vertical tabs (`U+000B`) are now considered to be whitespace.
|
||||||
* Identifiers can't start with `r#`, so they're easy to distinguish from raw strings. (They already similarly can't start with a digit, or a sign+digit, so they're easy to distinguish from numbers.)
|
* Identifiers can't start with `r#`, so they're easy to distinguish from raw
|
||||||
* The grammar syntax itself has been described, and some confusing definitions in the grammar have been fixed accordingly (mostly related to escaped characters).
|
strings. (They already similarly can't start with a digit, or a sign+digit,
|
||||||
|
so they're easy to distinguish from numbers.)
|
||||||
|
* The grammar syntax itself has been described, and some confusing definitions
|
||||||
|
in the grammar have been fixed accordingly (mostly related to escaped
|
||||||
|
characters).
|
||||||
|
* `,`, `<`, and `>` are now legal identifier characters. They were previously
|
||||||
|
reserved for KQL but this is no longer necessary.
|
||||||
|
* Code points under `0x20`, code points above `0x10FFFF`, Delete control
|
||||||
|
character (`0x7F`), and the [unicode "direction control"
|
||||||
|
characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
|
||||||
|
are now completely banned from appearing literally in KDL documents. They
|
||||||
|
can now only be represented in regular strings, and there's no facilities to
|
||||||
|
represent them in raw strings. This should be considered a security
|
||||||
|
improvement.
|
||||||
|
* Raw strings no longer require an `r` prefix: they are now specified by using
|
||||||
|
`#""#`.
|
||||||
|
* `#` is an illegal initial identifier character, but is allowed in other
|
||||||
|
places in identifiers.
|
||||||
|
* Line continuations can be followed by an EOF now, instead of requiring a
|
||||||
|
newline (or comment). `node \<EOF>` is now a legal KDL document.
|
||||||
|
|
||||||
### KQL
|
### KQL
|
||||||
|
|
||||||
|
|
|
||||||
70
SPEC.md
70
SPEC.md
|
|
@ -94,7 +94,7 @@ foo 1 key="val" 3 {
|
||||||
|
|
||||||
A bare Identifier is composed of any Unicode codepoint other than [non-initial
|
A bare Identifier is composed of any Unicode codepoint other than [non-initial
|
||||||
characters](#non-initial-characters), followed by any number of Unicode
|
characters](#non-initial-characters), followed by any number of Unicode
|
||||||
codepoints other than [non-identifier characters](#non-identifier-characters),
|
code points other than [non-identifier characters](#non-identifier-characters),
|
||||||
so long as this doesn't produce something confusable for a [Number](#number),
|
so long as this doesn't produce something confusable for a [Number](#number),
|
||||||
[Boolean](#boolean), or [Null](#null). For example, both a [Number](#number)
|
[Boolean](#boolean), or [Null](#null). For example, both a [Number](#number)
|
||||||
and an Identifier can start with `-`, but when an Identifier starts with `-`
|
and an Identifier can start with `-`, but when an Identifier starts with `-`
|
||||||
|
|
@ -122,9 +122,9 @@ of having an identifier look like a negative number.
|
||||||
The following characters cannot be used anywhere in a bare
|
The following characters cannot be used anywhere in a bare
|
||||||
[Identifier](#identifier):
|
[Identifier](#identifier):
|
||||||
|
|
||||||
* Any codepoint with hexadecimal value `0x20` or below.
|
* Any of `\/(){};[]="`
|
||||||
* Any codepoint with hexadecimal value higher than `0x10FFFF`.
|
* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
|
||||||
* Any of `\/(){}<>;[]=,"`
|
documents.
|
||||||
|
|
||||||
### Line Continuation
|
### Line Continuation
|
||||||
|
|
||||||
|
|
@ -137,6 +137,7 @@ characters and an optional single-line comment. It must be terminated by a
|
||||||
Following a line continuation, processing of a Node can continue as usual.
|
Following a line continuation, processing of a Node can continue as usual.
|
||||||
|
|
||||||
#### Example
|
#### Example
|
||||||
|
|
||||||
```kdl
|
```kdl
|
||||||
my-node 1 2 \ // comments are ok after \
|
my-node 1 2 \ // comments are ok after \
|
||||||
3 4 // This is the actual end of the Node.
|
3 4 // This is the actual end of the Node.
|
||||||
|
|
@ -309,6 +310,10 @@ String Value can encompass multiple lines without behaving like a Newline for
|
||||||
|
|
||||||
Strings _MUST_ be represented as UTF-8 values.
|
Strings _MUST_ be represented as UTF-8 values.
|
||||||
|
|
||||||
|
Strings _MUST NOT_ include the code points for [disallowed literal
|
||||||
|
code points](#disallowed-literal-code-points) directly. If needed, they can be
|
||||||
|
specified with their corresponding `\u{}` escape.
|
||||||
|
|
||||||
#### Escapes
|
#### Escapes
|
||||||
|
|
||||||
In addition to literal code points, a number of "escapes" are supported.
|
In addition to literal code points, a number of "escapes" are supported.
|
||||||
|
|
@ -362,17 +367,27 @@ support `\`-escapes. They otherwise share the same properties as far as
|
||||||
literal [Newline](#newline) characters go, and the requirement of UTF-8
|
literal [Newline](#newline) characters go, and the requirement of UTF-8
|
||||||
representation.
|
representation.
|
||||||
|
|
||||||
Raw String literals are represented as `r`, followed by zero or more `#`
|
Raw String literals are represented with one or more `#` characters, followed
|
||||||
characters, followed by `"`, followed by any number of UTF-8 literals. The string is then
|
by `"`, followed by any number of UTF-8 literals. The string is then closed by
|
||||||
closed by a `"` followed by a _matching_ number of `#` characters. This means
|
a `"` followed by a _matching_ number of `#` characters. This means that the
|
||||||
that the string sequence `"` or `"#` and such must not match the closing `"`
|
string sequence `"` or `"#` and such must not match the closing `"` with the
|
||||||
with the same or more `#` characters as the opening `r`.
|
same or more `#` characters as the opening `#`, in the body of the string.
|
||||||
|
|
||||||
|
Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal
|
||||||
|
code-points](#disallowed-literal-code-points) as code points in their body.
|
||||||
|
Unlike with Strings, these cannot simply be escaped, and are thus
|
||||||
|
unrepresentable when using Raw Strings.
|
||||||
|
|
||||||
|
Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal
|
||||||
|
code-points](#disallowed-literal-code-points) as code points in their body.
|
||||||
|
Unlike with Strings, these cannot simply be escaped, and are thus
|
||||||
|
unrepresentable when using Raw Strings.
|
||||||
|
|
||||||
#### Example
|
#### Example
|
||||||
|
|
||||||
```kdl
|
```kdl
|
||||||
just-escapes r"\n will be literal"
|
just-escapes #"\n will be literal"#
|
||||||
quotes-and-escapes r#"hello\n\r\asd"world"#
|
quotes-and-escapes ##"hello\n\r\asd"#world"##
|
||||||
```
|
```
|
||||||
|
|
||||||
### Number
|
### Number
|
||||||
|
|
@ -470,6 +485,16 @@ lines](https://www.unicode.org/versions/Unicode13.0.0/ch05.pdf):
|
||||||
|
|
||||||
Note that for the purpose of new lines, CRLF is considered _a single newline_.
|
Note that for the purpose of new lines, CRLF is considered _a single newline_.
|
||||||
|
|
||||||
|
### Disallowed Literal Code Points
|
||||||
|
|
||||||
|
The following code points may not appear literally anywhere in the document.
|
||||||
|
They may be represented in Strings (but not Raw Strings) using `\u{}`.
|
||||||
|
|
||||||
|
* Any codepoint with hexadecimal value `0x20` or below (various control characters).
|
||||||
|
* `0x7F` (the Delete control character).
|
||||||
|
* Any codepoint with hexadecimal value higher than `0x10FFFF`.
|
||||||
|
* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
|
||||||
|
|
||||||
## Full Grammar
|
## Full Grammar
|
||||||
|
|
||||||
This is the full official grammar for KDL and should be considered
|
This is the full official grammar for KDL and should be considered
|
||||||
|
|
@ -494,25 +519,24 @@ node-children := '{' nodes '}'
|
||||||
node-terminator := single-line-comment | newline | ';' | eof
|
node-terminator := single-line-comment | newline | ';' | eof
|
||||||
|
|
||||||
identifier := string | bare-identifier
|
identifier := string | bare-identifier
|
||||||
bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword
|
bare-identifier := (unambiguous-ident | numberish-ident) - keyword
|
||||||
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
|
unambiguous-ident := (identifier-char - digit - sign - "#") identifier-char*
|
||||||
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
|
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
|
||||||
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?
|
identifier-char := unicode - line-space - [\\/(){};\[\]="] - disallowed-literal-code-points
|
||||||
identifier-char := unicode - line-space - [\\/(){}<>;\[\]=,"]
|
|
||||||
keyword := boolean | 'null'
|
keyword := boolean | 'null'
|
||||||
prop := identifier '=' value
|
prop := identifier '=' valuel
|
||||||
value := type? (string | number | keyword)
|
value := type? (string | number | keyword)
|
||||||
type := '(' identifier ')'
|
type := '(' identifier ')'
|
||||||
|
|
||||||
string := raw-string | escaped-string
|
string := raw-string | escaped-string
|
||||||
escaped-string := '"' character* '"'
|
escaped-string := '"' string-character* '"'
|
||||||
character := '\' escape | [^\"]
|
string-character := '\' escape | [^\"] - disallowed-literal-code-points
|
||||||
escape := ["\\bfnrt] | 'u{' hex-digit{1, 6} '}' | (unicode-space | newline)+
|
escape := ["\\bfnrt] | 'u{' hex-digit{1, 6} '}' | (unicode-space | newline)+
|
||||||
hex-digit := [0-9a-fA-F]
|
hex-digit := [0-9a-fA-F]
|
||||||
|
|
||||||
raw-string := 'r' raw-string-hash
|
raw-string := '#' raw-string-quotes '#' | '#' raw-string '#'
|
||||||
raw-string-hash := '#' raw-string-hash '#' | raw-string-quotes
|
raw-string-quotes := '"' (unicode - disallowed-literal-code-points) '"'
|
||||||
raw-string-quotes := '"' .* '"'
|
|
||||||
|
|
||||||
number := decimal | hex | octal | binary
|
number := decimal | hex | octal | binary
|
||||||
|
|
||||||
|
|
@ -528,7 +552,7 @@ binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')*
|
||||||
|
|
||||||
boolean := 'true' | 'false'
|
boolean := 'true' | 'false'
|
||||||
|
|
||||||
escline := '\\' ws* (single-line-comment | newline)
|
escline := '\\' ws* (single-line-comment | newline | eof)
|
||||||
|
|
||||||
newline := See Table (All line-break white_space)
|
newline := See Table (All line-break white_space)
|
||||||
|
|
||||||
|
|
@ -536,6 +560,8 @@ ws := bom | unicode-space | multi-line-comment
|
||||||
|
|
||||||
bom := '\u{FEFF}'
|
bom := '\u{FEFF}'
|
||||||
|
|
||||||
|
disallowed-literal-code-points := See Table (Disallowed Literal Code Points)
|
||||||
|
|
||||||
unicode-space := See Table (All White_Space unicode characters which are not `newline`)
|
unicode-space := See Table (All White_Space unicode characters which are not `newline`)
|
||||||
|
|
||||||
single-line-comment := '//' ^newline* (newline | eof)
|
single-line-comment := '//' ^newline* (newline | eof)
|
||||||
|
|
|
||||||
Loading…
Reference in New Issue