better organization of how we talk about identifiers/strings and comments

This commit is contained in:
Kat Marchán 2023-12-16 21:44:25 -08:00
parent 39b9fac0d3
commit 055de4e1be
No known key found for this signature in database
GPG Key ID: AEB529C08A3C7E9E
1 changed files with 119 additions and 95 deletions

214
SPEC.md
View File

@ -50,8 +50,8 @@ baz
### Node
Being a node-oriented language means that the real core component of any KDL
document is the "node". Every node must have a name, which is an
[Identifier](#identifier).
document is the "node". Every node must have a name, which must be a
[String](#string).
The name may be preceded by a [Type Annotation](#type-annotation) to further
clarify its type, particularly in relation to its parent node. (For example,
@ -75,9 +75,9 @@ By contrast, Property order _SHOULD NOT_ matter to implementations.
[Children](#children-block) should be used if an order-sensitive key/value
data structure must be represented in KDL.
Nodes _MAY_ be prefixed with `/-` to "comment out" the entire node, including
its properties, arguments, and children, and make it act as plain whitespace,
even if it spreads across multiple lines.
Nodes _MAY_ be prefixed with [Slashdash](#slashdash-comments) to "comment out"
the entire node, including its properties, arguments, and children, and make
it act as plain whitespace, even if it spreads across multiple lines.
Finally, a node is terminated by either a [Newline](#newline), a semicolon (`;`)
or the end of the file/stream (an `EOF`).
@ -85,64 +85,12 @@ or the end of the file/stream (an `EOF`).
#### Example
```kdl
foo 1 key="val" 3 {
foo 1 key=val 3 {
bar
(role)baz 1 2
}
```
### Identifier
An Identifier is either a [Bare Identifier](#bare-identifier), which is an
unquoted string like `node` or `item`, a [String](#string), or a [Raw String](#raw-string).
There's no semantic difference between the kinds of identifier; this simply allows
for the use of quotes to have unusual identifiers that are inexpressible as bare identifiers.
### Bare Identifier
A Bare Identifier is composed of any [Unicode Scalar
Value](https://unicode.org/glossary/#unicode_scalar_value) other than
[non-initial characters](#non-initial-characters), followed by any number of
Unicode Scalar Values other than [non-identifier
characters](#non-identifier-characters), so long as this doesn't produce
something confusable for a [Number](#number). For example, both a
[Number](#number) and an Identifier can start with `-`, but when an Identifier
starts with `-` the second character cannot be a digit. This is precicely
specified in the [Full Grammar](#full-grammar) below.
When Identifiers are used as the values in [Arguments](#argument) and
[Properties](#property), they are treated as strings, just like they are with
node names and property keys.
Bare Identifiers are terminated by [Whitespace](#whitespace) or
[Newlines](#newline).
The literal identifiers `true`, `false`, and `null` are illegal Bare Identifiers,
and _MUST_ be treated as a syntax error.
### Non-initial characters
The following characters cannot be the first character in a
[Bare Identifier](#identifier):
* Any decimal digit (0-9)
* Any [non-identifier characters](#non-identifier-characters)
Additionally, the `-` character can only be used as an initial character if
the second character is *not* a digit. This allows identifiers to look like
`--this`, and removes the ambiguity of having an identifier look like a
negative number.
### Non-identifier characters
The following characters cannot be used anywhere in a [Bare Identifier](#identifier):
* Any of `(){}[]/\"#;`
* Any [Equals Sign](#equals-sign)
* Any [Whitespace](#whitespace) or [Newline](#newline).
* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
documents.
### Line Continuation
Line continuations allow [Nodes](#node) to be spread across multiple lines.
@ -164,7 +112,7 @@ my-node 1 2 \ // comments are ok after \
### Property
A Property is a key/value pair attached to a [Node](#node). A Property is
composed of an [Identifier](#identifier), followed immediately by an [equals
composed of a [String](#string), followed immediately by an [equals
sign](#equals-sign), and then a [Value](#value).
Properties should be interpreted left-to-right, with rightmost properties with
@ -234,11 +182,12 @@ parent { child1; child2; }
### Value
A value is either: an [Identifier](#identifier), a [String](#string), a
[Number](#number), a [Boolean](#boolean), or [Null](#null).
A value is either: a [String](#string), a [Number](#number), a
[Boolean](#boolean), or [Null](#null).
Values _MUST_ be either [Arguments](#argument) or values of
[Properties](#property).
[Properties](#property). Only [String](#string) values may be used as
[Node](#node) names or [Property](#property) keys.
Values (both as arguments and as properties) _MAY_ be prefixed by a single
[Type Annotation](#type-annotation).
@ -251,7 +200,7 @@ or as a _context-specific elaboration_ of the more generic type the node name
indicates.
Type annotations are written as a set of `(` and `)` with a single
[Identifier](#identifier) in it. It may contain Whitespace after the `(` and before
[String](#string) in it. It may contain Whitespace after the `(` and before
the `)`, and may be separated from its target by Whitespace.
KDL does not specify any restrictions on what implementations might do with
@ -331,40 +280,64 @@ node prop=(regex).*
### String
Strings in KDL represent textual [Values](#value), or unusual identifiers. A
String is either a [Quoted String](#quoted-string) or a
[Raw String](#raw-string). Quoted Strings may include escaped characters, while
Raw Strings always contain only the literal characters that are present.
Strings in KDL represent textual UTF-8 [Values](#value). A String is either an
[Identifier String](#identifier-string), a [Quoted String](#quoted-string) or
a [Raw String](#raw-string). Quoted Strings may include escaped characters,
while Raw Strings always contain only the literal characters that are present.
Identifier Strings don't user delimiters.
Strings _MUST_ be represented as UTF-8 values.
Strings _MUST NOT_ include the code points for [disallowed literal
code points](#disallowed-literal-code-points) directly. If needed, they can be
specified with their corresponding `\u{}` escape.
Strings _MUST NOT_ include the code points for [disallowed literal code
points](#disallowed-literal-code-points) directly. Quoted Strings may include
these code points as _values_ by representing them with their corresponding
`\u{...}` escape.
### Multi-line Strings
### Identifier String
Strings may span multiple lines with literal Newlines, in which case the
resulting String is "dedented" according to the line with the fewest number of
Whitespace characters preceding the first non-Whitespace character. That is,
the number of literal Whitespace characters in the least-indented line in the String
body is subtracted from the Whitespace of all other lines.
An Identifier String (sometimes referred to as just an "identifier") is
composed of any [Unicode Scalar
Value](https://unicode.org/glossary/#unicode_scalar_value) other than
[non-initial characters](#non-initial-characters), followed by any number of
Unicode Scalar Values other than [non-identifier
characters](#non-identifier-characters), so long as this doesn't produce
something confusable for a [Number](#number). For example, both a
[Number](#number) and an Identifier can start with `-`, but when an Identifier
starts with `-` the second character cannot be a digit. This is precicely
specified in the [Full Grammar](#full-grammar) below.
Multi-line strings _MUST_ have a single [Newline](#newline) immediately
following their opening `"`, after which they may have any number of newlines.
Finally, there must be a Newline, followed by any number of Whitespace, before
the closing `"`.
When Identifiers are used as the values in [Arguments](#argument) and
[Properties](#property), they are treated as strings, just like they are with
node names and property keys.
The first Newline, the last Newline, along with Whitespace following the last
Newline, are not included in the value of the String. The first and last
Newline can be the same character (that is, empty multi-line strings are
legal).
Identifier Strings are terminated by [Whitespace](#whitespace) or
[Newlines](#newline).
Furthermore, any lines in the string body that only contain literal whitespace
are stripped to only contain the single Newline character.
The literal identifiers `true`, `false`, and `null` are illegal Identifier
Strings, and _MUST_ be treated as a syntax error.
Strings with literal Newlines that do not immediately start with a Newline and
whose final `"` is not preceeded by whitespace and a Newline are illegal.
#### Non-initial characters
The following characters cannot be the first character in an
[Identifier String](#identifier-string):
* Any decimal digit (0-9)
* Any [non-identifier characters](#non-identifier-characters)
Additionally, the `-` character can only be used as an initial character if
the second character is *not* a digit. This allows identifiers to look like
`--this`, and removes the ambiguity of having an identifier look like a
negative number.
#### Non-identifier characters
The following characters cannot be used anywhere in a [Identifier String](#identifier-string):
* Any of `(){}[]/\"#;`
* Any [Equals Sign](#equals-sign)
* Any [Whitespace](#whitespace) or [Newline](#newline).
* Any [disallowed literal code points](#disallowed-literal-code-points) in KDL
documents.
### Quoted String
@ -377,7 +350,8 @@ purposes.
Like Strings, Quoted Strings _MUST NOT_ include any of the [disallowed literal
code-points](#disallowed-literal-code-points) as code points in their body.
Quoted Strings also follow the Multi-line rules specified in [String](#string).
Quoted Strings also follow the Multi-line rules specified in [Multi-line
String](#multi-line-strings).
#### Escapes
@ -441,9 +415,9 @@ a `"` followed by a _matching_ number of `#` characters. This means that the
string sequence `"` or `"#` and such must not match the closing `"` with the
same or more `#` characters as the opening `#`, in the body of the string.
Like Strings, Raw Strings _MUST NOT_ include any of the [disallowed literal
code-points](#disallowed-literal-code-points) as code points in their body.
Unlike with Strings, these cannot simply be escaped, and are thus
Like other Strings, Raw Strings _MUST NOT_ include any of the [disallowed
literal code-points](#disallowed-literal-code-points) as code points in their
body. Unlike with Quoted Strings, these cannot simply be escaped, and are thus
unrepresentable when using Raw Strings.
#### Example
@ -476,6 +450,31 @@ This is the base indentation
bar
```
### Multi-line Strings
Quoted and Raw Strings may span multiple lines with literal Newlines, in which
case the resulting String is "dedented" according to the line with the fewest
number of Whitespace characters preceding the first non-Whitespace character.
That is, the number of literal Whitespace characters in the least-indented
line in the String body is subtracted from the Whitespace of all other lines.
Multi-line strings _MUST_ have a single [Newline](#newline) immediately
following their opening `"`, after which they may have any number of newlines.
Finally, there must be a Newline, followed by any number of Whitespace, before
the closing `"`.
The first Newline, the last Newline, along with Whitespace following the last
Newline, are not included in the value of the String. The first and last
Newline can be the same character (that is, empty multi-line strings are
legal).
Furthermore, any lines in the string body that only contain literal whitespace
are stripped to only contain the single Newline character.
Strings with literal Newlines that do not immediately start with a Newline and
whose final `"` is not preceeded by whitespace and a Newline are illegal.
### Number
Numbers in KDL represent numerical [Values](#value). There is no logical distinction in KDL
@ -545,6 +544,11 @@ space](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt):
| Medium Mathematical Space | `U+205F` |
| Ideographic Space | `U+3000` |
#### Single-line comments
Any text after `//`, until the next literal [Newline](#newline) is "commented
out", and is considered to be [Whitespace](#whitespace).
#### Multi-line comments
In addition to single-line comments using `//`, comments can also be started
@ -552,6 +556,23 @@ with `/*` and ended with `*/`. These comments can span multiple lines. They
are allowed in all positions where [Whitespace](#whitespace) is allowed and
can be nested.
#### Slashdash comments
Finally, a special kind of comment called a "slashdash", denoted by `/-`, can
be used to comment out entire _components_ of a KDL document logically, and
have those elements be treated as whitespace.
Slashdash comments can be used before:
* A [Node](#node) name (or its type annotation): the entire Node is
treated as Whitespace, including all props, args, and children.
* A node [Argument](#argument) (or its type annotation), in which case
the Argument value is treated as Whitespace.
* A [Property](#property) key, in which case the entire property, both
key and value, is treated as Whitespace.
* A [Children Block](#children-block), in which case the entire block,
including all children within, is treated as Whitespace.
### Newline
The following characters [should be treated as new
@ -574,10 +595,13 @@ Note that for the purpose of new lines, CRLF is considered _a single newline_.
The following code points may not appear literally anywhere in the document.
They may be represented in Strings (but not Raw Strings) using `\u{}`.
* Any codepoint with hexadecimal value `0x20` or below (various control characters).
* Any codepoint with hexadecimal value `0x20` or below (various control
characters).
* `0x7F` (the Delete control character).
* Any codepoint that is not a [Unicode Scalar Value](https://unicode.org/glossary/#unicode_scalar_value).
* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control" characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
* Any codepoint that is not a [Unicode Scalar
Value](https://unicode.org/glossary/#unicode_scalar_value).
* `0x2066-2069` and `0x202A-202E`, the [unicode "direction control"
characters](https://www.w3.org/International/questions/qa-bidi-unicode-controls)
## Full Grammar