allow ,<> as identifier characters since they no longer need to be re… (#352)

* fix some confusion in grammar syntax, and actually specify the syntax itself Fixes: https://github.com/kdl-org/kdl/issues/345 * allow ,<> as identifier characters since they no longer need to be reserved * fix typo * disallow more code points and outright ban certain ones from KDL documents altogether (#353) Fixes: https://github.com/kdl-org/kdl/issues/250 * `r` prefix is no longer required for raw strings (#354) Fixes: https://github.com/kdl-org/kdl/issues/337
2023-12-12 20:27:37 -08:00 · 2023-12-12 20:27:37 -08:00 · e6356d5a03
parent 99abeef6d3
commit e6356d5a03
2 changed files with 69 additions and 24 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -8,8 +8,27 @@
 * Single line comments (`//`) can now be immediately followed by a newline.
 * All literal whitespace following a `\` in a string is now discarded.
 * Vertical tabs (`U+000B`) are now considered to be whitespace.
-* Identifiers can't start with `r#`, so they're easy to distinguish from raw strings. (They already similarly can't start with a digit, or a sign+digit, so they're easy to distinguish from numbers.)
+* Identifiers can't start with `r#`, so they're easy to distinguish from raw
-* The grammar syntax itself has been described, and some confusing definitions in the grammar have been fixed accordingly (mostly related to escaped characters).
+  strings. (They already similarly can't start with a digit, or a sign+digit,
  so they're easy to distinguish from numbers.)
 * The grammar syntax itself has been described, and some confusing definitions
  in the grammar have been fixed accordingly (mostly related to escaped
  characters).
 * `,`, `<`, and `>` are now legal identifier characters. They were previously
  reserved for KQL but this is no longer necessary.
 * Code points under `0x20`, code points above `0x10FFFF`, Delete control
  character (`0x7F`), and the [unicode "direction control"
  characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
  are now completely banned from appearing literally in KDL documents. They
  can now only be represented in regular strings, and there's no facilities to
  represent them in raw strings. This should be considered a security
  improvement.
 * Raw strings no longer require an `r` prefix: they are now specified by using
  `#""#`.
 * `#` is an illegal initial identifier character, but is allowed in other
  places in identifiers.
 * Line continuations can be followed by an EOF now, instead of requiring a
  newline (or comment). `node \<EOF>` is now a legal KDL document.
 ### KQL
--- a/SPEC.md
+++ b/SPEC.md
@ -94,7 +94,7 @@ foo 1 key="val" 3 {
 A bare Identifier is composed of any Unicode codepoint other than [non-initial
 characters](#non-initial-characters), followed by any number of Unicode
-codepoints other than [non-identifier characters](#non-identifier-characters),
+code points other than [non-identifier characters](#non-identifier-characters),
 so long as this doesn't produce something confusable for a [Number](#number),
 [Boolean](#boolean), or [Null](#null). For example, both a [Number](#number)
 and an Identifier can start with `-`, but when an Identifier starts with `-`
@ -122,9 +122,9 @@ of having an identifier look like a negative number.
 The following characters cannot be used anywhere in a bare
 [Identifier](#identifier):
-* Any codepoint with hexadecimal value `0x20` or below.
+* Any of `\/(){};[]="`
-* Any codepoint with hexadecimal value higher than `0x10FFFF`.
+* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
-* Any of `\/(){}<>;[]=,"`
+  documents.
 ### Line Continuation
@ -137,6 +137,7 @@ characters and an optional single-line comment. It must be terminated by a
 Following a line continuation, processing of a Node can continue as usual.
 #### Example
 ```kdl
 my-node 1 2 \  // comments are ok after \
        3 4    // This is the actual end of the Node.
@ -309,6 +310,10 @@ String Value can encompass multiple lines without behaving like a Newline for
 Strings _MUST_ be represented as UTF-8 values.
 Strings _MUST NOT_ include the code points for [disallowed literal
 code points](#disallowed-literal-code-points) directly. If needed, they can be
 specified with their corresponding `\u{}` escape.
 #### Escapes
 In addition to literal code points, a number of "escapes" are supported.
@ -362,17 +367,27 @@ support `\`-escapes. They otherwise share the same properties as far as
 literal [Newline](#newline) characters go, and the requirement of UTF-8
 representation.
-Raw String literals are represented as `r`, followed by zero or more `#`
+Raw String literals are represented with one or more `#` characters, followed
-characters, followed by `"`, followed by any number of UTF-8 literals. The string is then
+by `"`, followed by any number of UTF-8 literals. The string is then closed by
-closed by a `"` followed by a _matching_ number of `#` characters. This means
+a `"` followed by a _matching_ number of `#` characters. This means that the
-that the string sequence `"` or `"#` and such must not match the closing `"`
+string sequence `"` or `"#` and such must not match the closing `"` with the
-with the same or more `#` characters as the opening `r`.
+same or more `#` characters as the opening `#`, in the body of the string.
 Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal
 code-points](#disallowed-literal-code-points) as code points in their body.
 Unlike with Strings, these cannot simply be escaped, and are thus
 unrepresentable when using Raw Strings.
 Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal
 code-points](#disallowed-literal-code-points) as code points in their body.
 Unlike with Strings, these cannot simply be escaped, and are thus
 unrepresentable when using Raw Strings.
 #### Example
 ```kdl
-just-escapes r"\n will be literal"
+just-escapes #"\n will be literal"#
-quotes-and-escapes r#"hello\n\r\asd"world"#
+quotes-and-escapes ##"hello\n\r\asd"#world"##
 ```
 ### Number
@ -470,6 +485,16 @@ lines](https://www.unicode.org/versions/Unicode13.0.0/ch05.pdf):
 Note that for the purpose of new lines, CRLF is considered _a single newline_.
 ### Disallowed Literal Code Points
 The following code points may not appear literally anywhere in the document.
 They may be represented in Strings (but not Raw Strings) using `\u{}`.
 * Any codepoint with hexadecimal value `0x20` or below (various control characters).
 * `0x7F` (the Delete control character).
 * Any codepoint with hexadecimal value higher than `0x10FFFF`.
 * `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
 ## Full Grammar
 This is the full official grammar for KDL and should be considered
@ -494,25 +519,24 @@ node-children := '{' nodes '}'
 node-terminator := single-line-comment | newline | ';' | eof
 identifier := string | bare-identifier
-bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword
+bare-identifier := (unambiguous-ident | numberish-ident) - keyword
-unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
+unambiguous-ident := (identifier-char - digit - sign - "#") identifier-char*
 numberish-ident := sign ((identifier-char - digit) identifier-char*)?
-stringish-ident := "r" ((identifier-char - "#") identifier-char*)?
+identifier-char := unicode - line-space - [\\/(){};\[\]="] - disallowed-literal-code-points
-identifier-char := unicode - line-space - [\\/(){}<>;\[\]=,"]
+
 keyword := boolean | 'null'
-prop := identifier '=' value
+prop := identifier '=' valuel
 value := type? (string | number | keyword)
 type := '(' identifier ')'
 string := raw-string | escaped-string
-escaped-string := '"' character* '"'
+escaped-string := '"' string-character* '"'
-character := '\' escape | [^\"]
+string-character := '\' escape | [^\"] - disallowed-literal-code-points
 escape := ["\\bfnrt] | 'u{' hex-digit{1, 6} '}' | (unicode-space | newline)+
 hex-digit := [0-9a-fA-F]
-raw-string := 'r' raw-string-hash
+raw-string := '#' raw-string-quotes '#' | '#' raw-string '#'
-raw-string-hash := '#' raw-string-hash '#' | raw-string-quotes
+raw-string-quotes := '"' (unicode - disallowed-literal-code-points) '"'
 raw-string-quotes := '"' .* '"'
 number := decimal | hex | octal | binary
@ -528,7 +552,7 @@ binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')*
 boolean := 'true' | 'false'
-escline := '\\' ws* (single-line-comment | newline)
+escline := '\\' ws* (single-line-comment | newline | eof)
 newline := See Table (All line-break white_space)
@ -536,6 +560,8 @@ ws := bom | unicode-space | multi-line-comment
 bom := '\u{FEFF}'
 disallowed-literal-code-points := See Table (Disallowed Literal Code Points)
 unicode-space := See Table (All White_Space unicode characters which are not `newline`)
 single-line-comment := '//' ^newline* (newline | eof)