better organization of how we talk about identifiers/strings and comments

This commit is contained in:
Kat Marchán 2023-12-16 21:44:25 -08:00
parent 39b9fac0d3
commit 055de4e1be
No known key found for this signature in database
GPG Key ID: AEB529C08A3C7E9E
1 changed files with 119 additions and 95 deletions

214
SPEC.md
View File

@ -50,8 +50,8 @@ baz
### Node ### Node
Being a node-oriented language means that the real core component of any KDL Being a node-oriented language means that the real core component of any KDL
document is the "node". Every node must have a name, which is an document is the "node". Every node must have a name, which must be a
[Identifier](#identifier). [String](#string).
The name may be preceded by a [Type Annotation](#type-annotation) to further The name may be preceded by a [Type Annotation](#type-annotation) to further
clarify its type, particularly in relation to its parent node. (For example, clarify its type, particularly in relation to its parent node. (For example,
@ -75,9 +75,9 @@ By contrast, Property order _SHOULD NOT_ matter to implementations.
[Children](#children-block) should be used if an order-sensitive key/value [Children](#children-block) should be used if an order-sensitive key/value
data structure must be represented in KDL. data structure must be represented in KDL.
Nodes _MAY_ be prefixed with `/-` to "comment out" the entire node, including Nodes _MAY_ be prefixed with [Slashdash](#slashdash-comments) to "comment out"
its properties, arguments, and children, and make it act as plain whitespace, the entire node, including its properties, arguments, and children, and make
even if it spreads across multiple lines. it act as plain whitespace, even if it spreads across multiple lines.
Finally, a node is terminated by either a [Newline](#newline), a semicolon (`;`) Finally, a node is terminated by either a [Newline](#newline), a semicolon (`;`)
or the end of the file/stream (an `EOF`). or the end of the file/stream (an `EOF`).
@ -85,64 +85,12 @@ or the end of the file/stream (an `EOF`).
#### Example #### Example
```kdl ```kdl
foo 1 key="val" 3 { foo 1 key=val 3 {
bar bar
(role)baz 1 2 (role)baz 1 2
} }
``` ```
### Identifier
An Identifier is either a [Bare Identifier](#bare-identifier), which is an
unquoted string like `node` or `item`, a [String](#string), or a [Raw String](#raw-string).
There's no semantic difference between the kinds of identifier; this simply allows
for the use of quotes to have unusual identifiers that are inexpressible as bare identifiers.
### Bare Identifier
A Bare Identifier is composed of any [Unicode Scalar
Value](https://unicode.org/glossary/#unicode_scalar_value) other than
[non-initial characters](#non-initial-characters), followed by any number of
Unicode Scalar Values other than [non-identifier
characters](#non-identifier-characters), so long as this doesn't produce
something confusable for a [Number](#number). For example, both a
[Number](#number) and an Identifier can start with `-`, but when an Identifier
starts with `-` the second character cannot be a digit. This is precicely
specified in the [Full Grammar](#full-grammar) below.
When Identifiers are used as the values in [Arguments](#argument) and
[Properties](#property), they are treated as strings, just like they are with
node names and property keys.
Bare Identifiers are terminated by [Whitespace](#whitespace) or
[Newlines](#newline).
The literal identifiers `true`, `false`, and `null` are illegal Bare Identifiers,
and _MUST_ be treated as a syntax error.
### Non-initial characters
The following characters cannot be the first character in a
[Bare Identifier](#identifier):
* Any decimal digit (0-9)
* Any [non-identifier characters](#non-identifier-characters)
Additionally, the `-` character can only be used as an initial character if
the second character is *not* a digit. This allows identifiers to look like
`--this`, and removes the ambiguity of having an identifier look like a
negative number.
### Non-identifier characters
The following characters cannot be used anywhere in a [Bare Identifier](#identifier):
* Any of `(){}[]/\"#;`
* Any [Equals Sign](#equals-sign)
* Any [Whitespace](#whitespace) or [Newline](#newline).
* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
documents.
### Line Continuation ### Line Continuation
Line continuations allow [Nodes](#node) to be spread across multiple lines. Line continuations allow [Nodes](#node) to be spread across multiple lines.
@ -164,7 +112,7 @@ my-node 1 2 \ // comments are ok after \
### Property ### Property
A Property is a key/value pair attached to a [Node](#node). A Property is A Property is a key/value pair attached to a [Node](#node). A Property is
composed of an [Identifier](#identifier), followed immediately by an [equals composed of a [String](#string), followed immediately by an [equals
sign](#equals-sign), and then a [Value](#value). sign](#equals-sign), and then a [Value](#value).
Properties should be interpreted left-to-right, with rightmost properties with Properties should be interpreted left-to-right, with rightmost properties with
@ -234,11 +182,12 @@ parent { child1; child2; }
### Value ### Value
A value is either: an [Identifier](#identifier), a [String](#string), a A value is either: a [String](#string), a [Number](#number), a
[Number](#number), a [Boolean](#boolean), or [Null](#null). [Boolean](#boolean), or [Null](#null).
Values _MUST_ be either [Arguments](#argument) or values of Values _MUST_ be either [Arguments](#argument) or values of
[Properties](#property). [Properties](#property). Only [String](#string) values may be used as
[Node](#node) names or [Property](#property) keys.
Values (both as arguments and as properties) _MAY_ be prefixed by a single Values (both as arguments and as properties) _MAY_ be prefixed by a single
[Type Annotation](#type-annotation). [Type Annotation](#type-annotation).
@ -251,7 +200,7 @@ or as a _context-specific elaboration_ of the more generic type the node name
indicates. indicates.
Type annotations are written as a set of `(` and `)` with a single Type annotations are written as a set of `(` and `)` with a single
[Identifier](#identifier) in it. It may contain Whitespace after the `(` and before [String](#string) in it. It may contain Whitespace after the `(` and before
the `)`, and may be separated from its target by Whitespace. the `)`, and may be separated from its target by Whitespace.
KDL does not specify any restrictions on what implementations might do with KDL does not specify any restrictions on what implementations might do with
@ -331,40 +280,64 @@ node prop=(regex).*
### String ### String
Strings in KDL represent textual [Values](#value), or unusual identifiers. A Strings in KDL represent textual UTF-8 [Values](#value). A String is either an
String is either a [Quoted String](#quoted-string) or a [Identifier String](#identifier-string), a [Quoted String](#quoted-string) or
[Raw String](#raw-string). Quoted Strings may include escaped characters, while a [Raw String](#raw-string). Quoted Strings may include escaped characters,
Raw Strings always contain only the literal characters that are present. while Raw Strings always contain only the literal characters that are present.
Identifier Strings don't user delimiters.
Strings _MUST_ be represented as UTF-8 values. Strings _MUST_ be represented as UTF-8 values.
Strings _MUST NOT_ include the code points for [disallowed literal Strings _MUST NOT_ include the code points for [disallowed literal code
code points](#disallowed-literal-code-points) directly. If needed, they can be points](#disallowed-literal-code-points) directly. Quoted Strings may include
specified with their corresponding `\u{}` escape. these code points as _values_ by representing them with their corresponding
`\u{...}` escape.
### Multi-line Strings ### Identifier String
Strings may span multiple lines with literal Newlines, in which case the An Identifier String (sometimes referred to as just an "identifier") is
resulting String is "dedented" according to the line with the fewest number of composed of any [Unicode Scalar
Whitespace characters preceding the first non-Whitespace character. That is, Value](https://unicode.org/glossary/#unicode_scalar_value) other than
the number of literal Whitespace characters in the least-indented line in the String [non-initial characters](#non-initial-characters), followed by any number of
body is subtracted from the Whitespace of all other lines. Unicode Scalar Values other than [non-identifier
characters](#non-identifier-characters), so long as this doesn't produce
something confusable for a [Number](#number). For example, both a
[Number](#number) and an Identifier can start with `-`, but when an Identifier
starts with `-` the second character cannot be a digit. This is precicely
specified in the [Full Grammar](#full-grammar) below.
Multi-line strings _MUST_ have a single [Newline](#newline) immediately When Identifiers are used as the values in [Arguments](#argument) and
following their opening `"`, after which they may have any number of newlines. [Properties](#property), they are treated as strings, just like they are with
Finally, there must be a Newline, followed by any number of Whitespace, before node names and property keys.
the closing `"`.
The first Newline, the last Newline, along with Whitespace following the last Identifier Strings are terminated by [Whitespace](#whitespace) or
Newline, are not included in the value of the String. The first and last [Newlines](#newline).
Newline can be the same character (that is, empty multi-line strings are
legal).
Furthermore, any lines in the string body that only contain literal whitespace The literal identifiers `true`, `false`, and `null` are illegal Identifier
are stripped to only contain the single Newline character. Strings, and _MUST_ be treated as a syntax error.
Strings with literal Newlines that do not immediately start with a Newline and #### Non-initial characters
whose final `"` is not preceeded by whitespace and a Newline are illegal.
The following characters cannot be the first character in an
[Identifier String](#identifier-string):
* Any decimal digit (0-9)
* Any [non-identifier characters](#non-identifier-characters)
Additionally, the `-` character can only be used as an initial character if
the second character is *not* a digit. This allows identifiers to look like
`--this`, and removes the ambiguity of having an identifier look like a
negative number.
#### Non-identifier characters
The following characters cannot be used anywhere in a [Identifier String](#identifier-string):
* Any of `(){}[]/\"#;`
* Any [Equals Sign](#equals-sign)
* Any [Whitespace](#whitespace) or [Newline](#newline).
* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
documents.
### Quoted String ### Quoted String
@ -377,7 +350,8 @@ purposes.
Like Strings, Quoted Strings _MUST NOT_ include any of the [disallowed literal Like Strings, Quoted Strings _MUST NOT_ include any of the [disallowed literal
code-points](#disallowed-literal-code-points) as code points in their body. code-points](#disallowed-literal-code-points) as code points in their body.
Quoted Strings also follow the Multi-line rules specified in [String](#string). Quoted Strings also follow the Multi-line rules specified in [Multi-line
String](#multi-line-strings).
#### Escapes #### Escapes
@ -441,9 +415,9 @@ a `"` followed by a _matching_ number of `#` characters. This means that the
string sequence `"` or `"#` and such must not match the closing `"` with the string sequence `"` or `"#` and such must not match the closing `"` with the
same or more `#` characters as the opening `#`, in the body of the string. same or more `#` characters as the opening `#`, in the body of the string.
Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal Like other Strings, Raw Strings _MUST NOT_ include any of the [disallowed
code-points](#disallowed-literal-code-points) as code points in their body. literal code-points](#disallowed-literal-code-points) as code points in their
Unlike with Strings, these cannot simply be escaped, and are thus body. Unlike with Quoted Strings, these cannot simply be escaped, and are thus
unrepresentable when using Raw Strings. unrepresentable when using Raw Strings.
#### Example #### Example
@ -476,6 +450,31 @@ This is the base indentation
bar bar
``` ```
### Multi-line Strings
Quoted and Raw Strings may span multiple lines with literal Newlines, in which
case the resulting String is "dedented" according to the line with the fewest
number of Whitespace characters preceding the first non-Whitespace character.
That is, the number of literal Whitespace characters in the least-indented
line in the String body is subtracted from the Whitespace of all other lines.
Multi-line strings _MUST_ have a single [Newline](#newline) immediately
following their opening `"`, after which they may have any number of newlines.
Finally, there must be a Newline, followed by any number of Whitespace, before
the closing `"`.
The first Newline, the last Newline, along with Whitespace following the last
Newline, are not included in the value of the String. The first and last
Newline can be the same character (that is, empty multi-line strings are
legal).
Furthermore, any lines in the string body that only contain literal whitespace
are stripped to only contain the single Newline character.
Strings with literal Newlines that do not immediately start with a Newline and
whose final `"` is not preceeded by whitespace and a Newline are illegal.
### Number ### Number
Numbers in KDL represent numerical [Values](#value). There is no logical distinction in KDL Numbers in KDL represent numerical [Values](#value). There is no logical distinction in KDL
@ -545,6 +544,11 @@ space](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt):
| Medium Mathematical Space | `U+205F` | | Medium Mathematical Space | `U+205F` |
| Ideographic Space | `U+3000` | | Ideographic Space | `U+3000` |
#### Single-line comments
Any text after `//`, until the next literal [Newline](#newline) is "commented
out", and is considered to be [Whitespace](#whitespace).
#### Multi-line comments #### Multi-line comments
In addition to single-line comments using `//`, comments can also be started In addition to single-line comments using `//`, comments can also be started
@ -552,6 +556,23 @@ with `/*` and ended with `*/`. These comments can span multiple lines. They
are allowed in all positions where [Whitespace](#whitespace) is allowed and are allowed in all positions where [Whitespace](#whitespace) is allowed and
can be nested. can be nested.
#### Slashdash comments
Finally, a special kind of comment called a "slashdash", denoted by `/-`, can
be used to comment out entire _components_ of a KDL document logically, and
have those elements be treated as whitespace.
Slashdash comments can be used before:
* A [Node](#node) name (or its type annotation): the entire Node is
treated as Whitespace, including all props, args, and children.
* A node [Argument](#argument) (or its type annotation), in which case
the Argument value is treated as Whitespace.
* A [Property](#property) key, in which case the entire property, both
key and value, is treated as Whitespace.
* A [Children Block](#children-block), in which case the entire block,
including all children within, is treated as Whitespace.
### Newline ### Newline
The following characters [should be treated as new The following characters [should be treated as new
@ -574,10 +595,13 @@ Note that for the purpose of new lines, CRLF is considered _a single newline_.
The following code points may not appear literally anywhere in the document. The following code points may not appear literally anywhere in the document.
They may be represented in Strings (but not Raw Strings) using `\u{}`. They may be represented in Strings (but not Raw Strings) using `\u{}`.
* Any codepoint with hexadecimal value `0x20` or below (various control characters). * Any codepoint with hexadecimal value `0x20` or below (various control
characters).
* `0x7F` (the Delete control character). * `0x7F` (the Delete control character).
* Any codepoint that is not a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value). * Any codepoint that is not a [Unicode Scalar
* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls) Value](https://unicode.org/glossary/#unicode_scalar_value).
* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control"
characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
## Full Grammar ## Full Grammar