kdl/draft-marchan-kdl2.md

35 KiB

title abbrev docname submissionType category ipr area venue workgroup keyword stand_alone smart_quotes pi author normative informative
The KDL Document Language KDL draft-marchan-kdl2-latest independent exp trust200902 General
github home
kdl-org/kdl https://kdl.dev/
KDL Community
Document-Language
Configuration
yes no
toc
sortrefs
symrefs
name ins organization
Katerina Zoé Marchán Salvá K. Marchán Microsoft

If the last line wasn't indented as far, it won't dedent the rest of the lines as much:

multi-line """
        foo
    This is no longer on the left edge
            bar
  """

This example's string value will be:

      foo
  This is no longer on the left edge
          bar

Equivalent to " foo\n This is no longer on the left edge\n bar".


Empty lines can contain any whitespace, or none at all, and will be reflected as empty in the value:

multi-line """
    Indented a bit

    A second indented paragraph.
    """

This example's string value will be:

Indented a bit.

A second indented paragraph.

Equivalent to "Indented a bit.\n\nA second indented paragraph."


The following yield syntax errors:

multi-line """can't be single line"""
multi-line """
  closing quote with non-whitespace prefix"""
multi-line """stuff
  """
// Every line must share the exact same prefix as the closing line.
multi-line """[\n]
[tab]a[\n]
[space][space]b[\n]
[space][tab][\n]
[tab]"""

Interaction with Whitespace Escapes

Multi-line strings support the same mechanism for escaping whitespace as Quoted Strings.

When processing a Multi-line String, implementations MUST dedent the string after resolving all whitespace escapes, but before resolving other backslash escapes. This means a whitespace escape that attempts to escape the final line's newline and/or whitespace prefix can be invalid: if removing escaped whitespace places the closing """ on a line with non-whitespace characters, this escape is invalid.

For example, the following example is illegal:

  """
  foo
  bar\
  """

  // equivalent to
  """
  foo
  bar"""

while the following example is allowed

  """
  foo \
bar
  baz
  \   """

  // equivalent to
  """
  foo bar
  baz
  """

Raw String

Both Quoted and Multi-Line Strings have Raw String variants, which are identical in syntax except they do not support \-escapes. This includes line-continuation escapes (\ + ws collapsing to nothing). They otherwise share the same properties as far as literal Newline ({{newline}}) characters go, multi-line rules, and the requirement of UTF-8 representation.

The Raw String variants are indicated by preceding the strings's opening quotes with one or more # characters. The string is then closed by its normal closing quotes, followed by a matching number of # characters. This means that the string may contain any combination of " and # characters other than its closing delimiter (e.g., if a raw string starts with ##", it can contain " or "#, but not "## or "###).

Like other Strings, Raw Strings MUST NOT include any of the disallowed literal code-points as code points in their body. Unlike with Quoted Strings, these cannot simply be escaped, and are thus unrepresentable when using Raw Strings.

Example

just-escapes #"\n will be literal"#

The string contains the literal characters \n will be literal.

quotes-and-escapes ##"hello\n\r\asd"#world"##

The string contains the literal characters hello\n\r\asd"#world

raw-multi-line #"""
    Here's a """
        multiline string
        """
    without escapes.
    """#

The string contains the value

Here's a """
    multiline string
    """
without escapes.

or equivalently, "Here's a \"\"\"\n multiline string\n \"\"\"\nwithout escapes." as a Quoted String.

Number

Numbers in KDL represent numerical Values. There is no logical distinction in KDL between real numbers, integers, and floating point numbers. It's up to individual implementations to determine how to represent KDL numbers.

There are five syntaxes for Numbers: Keywords, Decimal, Hexadecimal, Octal, and Binary.

  • All non-Keyword numbers may optionally start with one of - or +, which determine whether they'll be positive or negative.
  • Binary numbers start with 0b and only allow 0 and 1 as digits, which may be separated by _. They represent numbers in radix 2.
  • Octal numbers start with 0o and only allow digits between 0 and 7, which may be separated by _. They represent numbers in radix 8.
  • Hexadecimal numbers start with 0x and allow digits between 0 and 9, as well as letters A through F, in either lower or upper case, which may be separated by _. They represent numbers in radix 16.
  • Decimal numbers are a bit more special:
    • They have no radix prefix.
    • They use digits 0 through 9, which may be separated by _.
    • They may optionally include a decimal separator ., followed by more digits, which may again be separated by _.
    • They may optionally be followed by E or e, an optional - or +, and more digits, to represent an exponent value.

Note that, similar to JSON and some other languages, numbers without an integer digit (such as .1) are illegal. They must be written with at least one integer digit, like 0.1. (These patterns are also disallowed from Identifier Strings, to avoid confusion.)

Keyword Numbers

There are three special "keyword" numbers included in KDL to accomodate the widespread use of IEEE 754 floats:

  • #inf - floating point positive infinity.
  • #-inf - floating point negative infinity.
  • #nan - floating point NaN/Not a Number.

To go along with this and prevent foot guns, the bare Identifier Strings inf, -inf, and nan are considered illegal identifiers and should yield a syntax error.

The existence of these keywords does not imply that any numbers be represented as IEEE 754 floats. These are simply for clarity and convenience for any implementation that chooses to represent their numbers in this way.

Boolean

A boolean Value ({{value}}) is either the symbol #true or #false. These SHOULD be represented by implementation as boolean logical values, or some approximation thereof.

Example

my-node #true value=#false

Null

The symbol #null represents a null Value ({{value}}). It's up to the implementation to decide how to represent this, but it generally signals the "absence" of a value.

Example

my-node #null key=#null

Whitespace

The following characters should be treated as non-Newline white space:

Name Code Pt
Character Tabulation U+0009
Space U+0020
No-Break Space U+00A0
Ogham Space Mark U+1680
En Quad U+2000
Em Quad U+2001
En Space U+2002
Em Space U+2003
Three-Per-Em Space U+2004
Four-Per-Em Space U+2005
Six-Per-Em Space U+2006
Figure Space U+2007
Punctuation Space U+2008
Thin Space U+2009
Hair Space U+200A
Narrow No-Break Space U+202F
Medium Mathematical Space U+205F
Ideographic Space U+3000

Single-line comments

Any text after //, until the next literal Newline ({{newline}}) is "commented out", and is considered to be Whitespace ({{whitespace}}).

Multi-line comments

In addition to single-line comments using //, comments can also be started with /* and ended with */. These comments can span multiple lines. They are allowed in all positions where Whitespace ({{whitespace}}) is allowed and can be nested.

Slashdash comments

Finally, a special kind of comment called a "slashdash", denoted by /-, can be used to comment out entire components of a KDL document logically, and have those elements not be included as part of the parsed document data.

Slashdash comments can be used before the following, including before their type annotations, if present:

  • A Node ({{node}}): the entire Node is treated as Whitespace, including all props, args, and children.
  • An Argument ({{argument}}): the Argument value is treated as Whitespace.
  • A Property ({{property}}) key: the entire property, including both key and value, is treated as Whitespace. A slashdash of just the property value is not allowed.
  • A Children Block: the entire block, including all children within, is treated as Whitespace. Only other children blocks, whether slashdashed or not, may follow a slashdashed children block.

A slashdash may be be followed by any amount of whitespace, including newlines and comments (other than other slashdashes), before the element that it comments out.

Newline

The following character sequences should be treated as new lines:

Acronym Name Code Pt
CRLF Carriage Return and Line Feed U+000D + U+000A
CR Carriage Return U+000D
LF Line Feed U+000A
NEL Next Line U+0085
VT Vertical tab U+000B
FF Form Feed U+000C
LS Line Separator U+2028
PS Paragraph Separator U+2029

Note that for the purpose of new lines, the specific sequence CRLF is considered a single newline.

Disallowed Literal Code Points

The following code points may not appear literally anywhere in the document. They may be represented in Strings (but not Raw Strings) using Unicode Escapes (\u{...}, except for non Unicode Scalar Value, which can't be represented even as escapes).

  • The codepoints U+0000-0008 or the codepoints U+000E-001F (various control characters).
  • U+007F (the Delete control character).
  • Any codepoint that is not a Unicode Scalar Value (U+D800-DFFF).
  • U+200E-200F, U+202A-202E, and U+2066-2069, the unicode "direction control" characters
  • U+FEFF, aka Zero-width Non-breaking Space (ZWNBSP)/Byte Order Mark (BOM), except as the first code point in a document.

Full Grammar

This is the full official grammar for KDL and should be considered authoritative if something seems to disagree with the text above. The grammar language syntax is defined below.

document := bom? version? nodes

// Nodes
nodes := (line-space* node)* line-space*

base-node := slashdash? type? node-space* string
    (node-space+ slashdash? node-prop-or-arg)*
    // slashdashed node-children must always be after props and args.
    (node-space+ slashdash node-children)*
    (node-space+ node-children)?
    (node-space+ slashdash node-children)*
    node-space*
node := base-node node-terminator
final-node := base-node node-terminator?

// Entries
node-prop-or-arg := prop | value
node-children := '{' nodes final-node? '}'
node-terminator := single-line-comment | newline | ';' | eof

prop := string node-space* '=' node-space* value
value := type? node-space* (string | number | keyword)
type := '(' node-space* string node-space* ')'

// Strings
string := identifier-string | quoted-string | raw-stringidentifier-string := unambiguous-ident | signed-ident | dotted-ident
unambiguous-ident :=
    ((identifier-char - digit - sign - '.') identifier-char*)
    - disallowed-keyword-strings
signed-ident :=
    sign ((identifier-char - digit - '.') identifier-char*)?
dotted-ident :=
    sign? '.' ((identifier-char - digit) identifier-char*)?
identifier-char :=
    unicode - unicode-space - newline - [\\/(){};\[\]"#=]
    - disallowed-literal-code-points
disallowed-keyword-identifiers :=
    'true' | 'false' | 'null' | 'inf' | '-inf' | 'nan'

quoted-string :=
    '"' single-line-string-body '"' |
    '"""' newline
    (multi-line-string-body newline)?
    (unicode-space | ws-escape)* '"""'
single-line-string-body := (string-character - newline)*
multi-line-string-body := (('"' | '""')? string-character)*
string-character :=
    '\\' (["\\bfnrts] |
    'u{' hex-unicode '}') |
    ws-escape |
    [^\\"] - disallowed-literal-code-points
ws-escape := '\\' (unicode-space | newline)+
hex-digit := [0-9a-fA-F]
hex-unicode := hex-digit{1, 6} - surrogates
surrogates := [dD][8-9a-fA-F]hex-digit{2}
// U+D800-DFFF: D  8         00
//              D  F         FF

raw-string := '#' raw-string-quotes '#' | '#' raw-string '#'
raw-string-quotes :=
    '"' single-line-raw-string-body '"' |
    '"""' newline
    (multi-line-raw-string-body newline)?
    unicode-space* '"""'
single-line-raw-string-body :=
    '' |
    (single-line-raw-string-char - '"')
        single-line-raw-string-char*? |
    '"' (single-line-raw-string-char - '"')
        single-line-raw-string-char*?
single-line-raw-string-char :=
    unicode - newline - disallowed-literal-code-points
multi-line-raw-string-body :=
    (unicode - disallowed-literal-code-points)*?

// Numbers
number := keyword-number | hex | octal | binary | decimal

decimal := sign? integer ('.' integer)? exponent?
exponent := ('e' | 'E') sign? integer
integer := digit (digit | '_')*
digit := [0-9]
sign := '+' | '-'

hex := sign? '0x' hex-digit (hex-digit | '_')*
octal := sign? '0o' [0-7] [0-7_]*
binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')*

// Keywords and booleans.
keyword := boolean | '#null'
keyword-number := '#inf' | '#-inf' | '#nan'
boolean := '#true' | '#false'

// Specific code points
bom := '\u{FEFF}'
disallowed-literal-code-points :=
    See Table (Disallowed Literal Code Points)
unicode := Any Unicode Scalar Value
unicode-space := See Table
    (All White_Space unicode characters which are not `newline`)

// Comments
single-line-comment := '//' ^newline* (newline | eof)
multi-line-comment := '/*' commented-block
commented-block :=
    '*/' | (multi-line-comment | '*' | '/' | [^*/]+) commented-block
slashdash := '/-' line-space*

// Whitespace
ws := unicode-space | multi-line-comment
escline := '\\' ws* (single-line-comment | newline | eof)
newline := See Table (All Newline White_Space)
// Whitespace where newlines are allowed.
line-space := node-space | newline | single-line-comment
// Whitespace within nodes,
// where newline-ish things must be esclined.
node-space := ws* escline ws* | ws+

// Version marker
version :=
    '/-' unicode-space* 'kdl-version' unicode-space+ ('1' | '2')
    unicode-space* newline

Grammar language

The grammar language syntax is a combination of ABNF with some regex spice thrown in. Specifically:

  • Single quotes (') are used to denote literal text. \ within a literal string is used for escaping other single-quotes, for initiating unicode characters using hex values (\u{FEFF}), and for escaping \ itself (\\).
  • * is used for "zero or more", + is used for "one or more", and ? is used for "zero or one". Per standard regex semantics, * and + are greedy; they match as many instances as possible without failing the match.
  • *? (used only in raw strings) indicates a non-greedy match; it matches as few instances as possible without failing the match.
  • is a cut point. It always matches and consumes no characters, but once matched, the parser is not allowed to backtrack past that point in the source. If a parser would rewind past the cut point, it must instead fail the overall parse, as if it had run out of options. (This is only used with the raw-string production, to ensure the first instance of the appropriate closing quote sequence is guaranteed to be the end of the raw string, rather than allowing it to potentially consume more of the document unexpectedly.)
  • () can be used to group matches that must be matched together.
  • a | b means a or b, whichever matches first. If multiple items are before a |, they are a single group. a b c | d is equivalent to (a b c) | d.
  • [] are used for regex-style character matches, where any character between the brackets will be a single match. \ is used to escape \, [, and ]. They also support character ranges (0-9), and negation (^)
  • - is used for "except for" or "minus" whatever follows it. For example, a - 'x' means "any a, except something that matches the literal 'x'".
  • The prefix ^ means "something that does not match" whatever follows it. For example, ^foo means "must not match foo".
  • A single definition may be split over multiple lines. Newlines are treated as spaces.
  • // followed by text on its own line is used as comment syntax.