kdl/zkat/suffixes/draft-marchan-kdl2.txt



KDL Community                                                 K. Marchán
                                                               Microsoft
                                                        KDL Contributors
                                                           17 April 2025


                       The KDL Document Language
                       draft-marchan-kdl2-latest

Abstract

   KDL is a node-oriented document language.  Its niche and purpose
   overlaps with XML, and as do many of its semantics.  You can use KDL
   both as a configuration language, and a data exchange or storage
   format, if you so choose.

   This is the formal specification for KDL, including the intended data
   model and the grammar.

   This document describes an unreleased minor change to KDL.  For the
   latest oficial version of the language, see https://kdl.dev/spec.

About This Document

   This note is to be removed before publishing as an RFC.

   Status information for this document may be found at
   https://datatracker.ietf.org/doc/draft-marchan-kdl2/.

   information can be found at https://kdl.dev/.

   Source for this draft and an issue tracker can be found at
   https://github.com/kdl-org/kdl.

License

   This work is licensed under Creative Commons Attribution-ShareAlike
   4.0 International.  To view a copy of this license, visit
   https://creativecommons.org/licenses/by-sa/4.0/

Table of Contents

   1.  Compatibility
   2.  Introduction
   3.  Components
     3.1.  Document
       3.1.1.  Example
     3.2.  Node
       3.2.1.  Example
     3.3.  Line Continuation
       3.3.1.  Example
     3.4.  Property
     3.5.  Argument
       3.5.1.  Example
     3.6.  Children Block
       3.6.1.  Example
     3.7.  Value
     3.8.  Type Annotation
       3.8.1.  Suffix Type Annotation
       3.8.2.  Reserved Type Annotations for Numbers Without Decimals
       3.8.3.  Reserved Type Annotations for Numbers With Decimals:
       3.8.4.  Reserved Type Annotations for Strings:
       3.8.5.  Examples
     3.9.  String
     3.10. Identifier String
       3.10.1.  Non-initial characters
       3.10.2.  Non-identifier characters
     3.11. Quoted String
       3.11.1.  Escapes
     3.12. Multi-line String
       3.12.1.  Newline Normalization
       3.12.2.  Examples
       3.12.3.  Interaction with Whitespace Escapes
     3.13. Raw String
       3.13.1.  Example
     3.14. Number
       3.14.1.  Keyword Numbers
     3.15. Boolean
       3.15.1.  Example
     3.16. Null
       3.16.1.  Example
     3.17. Whitespace
       3.17.1.  Single-line comments
       3.17.2.  Multi-line comments
       3.17.3.  Slashdash comments
     3.18. Newline
     3.19. Disallowed Literal Code Points
   4.  Full Grammar
     4.1.  Grammar language
   Authors' Addresses

1.  Compatibility

   KDL 2.0 is designed such that for any given KDL document written as
   KDL 1.0 (./SPEC_v1.md) or KDL 2.0, the parse will either fail
   completely, or, if the parse succeeds, the data represented by a v1
   or v2 parser will be identical.  This means that it's safe to use a
   fallback parsing strategy in order to support both v1 and v2
   simultaneously.  For example, node "foo" is a valid node in both
   versions, and should be represented identically by parsers.

   A version marker /- kdl-version 2 (or 1) _MAY_ be added to the
   beginning of a KDL document, optionally preceded by the BOM, and
   parsers _MAY_ use that as a hint as to which version to parse the
   document as.

2.  Introduction

   KDL is a node-oriented document language.  Its niche and purpose
   overlaps with XML, and as do many of its semantics.  You can use KDL
   both as a configuration language, and a data exchange or storage
   format, if you so choose.

   The bulk of this document is dedicated to a long-form description of
   all Components (Section 3) of a KDL document.  There is also a much
   more terse Grammar (Section 4) at the end of the document that covers
   most of the rules, with some semantic exceptions involving the data
   model.

   KDL is designed to be easy to read _and_ easy to implement.

   In this document, references to "left" or "right" refer to directions
   in the _data stream_ towards the beginning or end, respectively; in
   other words, the directions if the data stream were only ASCII text.
   They do not refer to the writing direction of text, which can flow in
   either direction, depending on the characters used.

3.  Components

3.1.  Document

   The toplevel concept of KDL is a Document.  A Document is composed of
   zero or more Nodes (Section 3.2), separated by newlines, semicolons,
   and whitespace, and eventually terminated by an EOF.

   All KDL documents MUST be encoded in UTF-8 and conform to the
   specifications in this document.

3.1.1.  Example

   The following is a document composed of two toplevel nodes:

   foo {
       bar
   }
   baz

3.2.  Node

   Being a node-oriented language means that the real core component of
   any KDL document is the "node".  Every node must have a name, which
   must be a String (Section 3.9).

   The name may be preceded by a Type Annotation (Section 3.8) to
   further clarify its type, particularly in relation to its parent
   node.  (For example, clarifying that a particular date child node is
   for the _publication_ date, rather than the last-modified date, with
   (published)date.)

   Following the name are zero or more Arguments (Section 3.5) or
   Properties (Section 3.4), separated by either whitespace
   (Section 3.17) or a slash-escaped line continuation (Section 3.3).
   Arguments and Properties may be interspersed in any order, much like
   is common with positional arguments vs options in command line tools.
   Collectively, Arguments and Properties may be referred to as
   "Entries".

   Children (Section 3.6) can be placed after the name and the optional
   Entries, possibly separated by either whitespace or a slash-escaped
   line continuation.

   Arguments are ordered relative to each other and that order must be
   preserved in order to maintain the semantics.  Properties between
   Arguments do not affect Argument ordering.

   By contrast, Properties _SHOULD NOT_ be assumed to be presented in a
   given order.  Children (Section 3.6) should be used if an order-
   sensitive key/value data structure must be represented in KDL.  Cf.
   JSON objects preserving key order.

   Nodes _MAY_ be prefixed with Slashdash (Section 3.17.3) to "comment
   out" the entire node, including its properties, arguments, and
   children, and make it act as plain whitespace, even if it spreads
   across multiple lines.

   Finally, a node is terminated by either a Newline (Section 3.18), a
   semicolon (;), the end of its parent's child block (}) or the end of
   the file/stream (an EOF).

3.2.1.  Example

   // `foo` will have an Argument value list like `[1, 3]`.
   foo 1 key=val 3 {
       bar
       (role)baz 1 2
   }

3.3.  Line Continuation

   Line continuations allow Nodes (Section 3.2) to be spread across
   multiple lines.

   A line continuation is a \ character followed by zero or more
   whitespace items (including multiline comments) and an optional
   single-line comment.  It must be terminated by a Newline
   (Section 3.18) (including the Newline that is part of single-line
   comments).

   Following a line continuation, processing of a Node can continue as
   usual.

3.3.1.  Example

   my-node 1 2 \  // comments are ok after \
           3 4    // This is the actual end of the Node.

3.4.  Property

   A Property is a key/value pair attached to a Node (Section 3.2).  A
   Property is composed of a String (Section 3.9), followed immediately
   by an equals sign (=, U+003D), and then a Value (Section 3.7).

   Properties should be interpreted left-to-right, with rightmost
   properties with identical names overriding earlier properties.  That
   is:

   node a=1 a=2

   In this example, the node's a value must be 2, not 1.

   No other guarantees about order should be expected by implementers.
   Deserialized representations may iterate over properties in any order
   and still be spec-compliant.

   Properties _MAY_ be prefixed with /- to "comment out" the entire
   token and make it act as plain whitespace, even if it spreads across
   multiple lines.

3.5.  Argument

   An Argument is a bare Value (Section 3.7) attached to a Node
   (Section 3.2), with no associated key.  It shares the same space as
   Properties (Section 3.4), and may be interleaved with them.

   A Node may have any number of Arguments, which should be evaluated
   left to right.  KDL implementations _MUST_ preserve the order of
   Arguments relative to each other (not counting Properties).

   Arguments _MAY_ be prefixed with /- to "comment out" the entire token
   and make it act as plain whitespace, even if it spreads across
   multiple lines.

3.5.1.  Example

   my-node 1 2 3 a b c

3.6.  Children Block

   A children block is a block of Nodes (Section 3.2), surrounded by {
   and }. They are an optional part of nodes, and create a hierarchy of
   KDL nodes.

   Regular node termination rules apply, which means multiple nodes can
   be included in a single-line children block, as long as they're all
   terminated by ;.

3.6.1.  Example

   parent {
       child1
       child2
   }

   parent { child1; child2 }

3.7.  Value

   A value is either: a String (Section 3.9), a Number (Section 3.14), a
   Boolean (Section 3.15), or Null (Section 3.16).

   Values _MUST_ be either Arguments (Section 3.5) or values of
   Properties (Section 3.4).  Only String (Section 3.9) values may be
   used as Node (Section 3.2) names or Property (Section 3.4) keys.

   Values (both as arguments and in properties) _MAY_ be prefixed by a
   single Type Annotation (Section 3.8).

3.8.  Type Annotation

   A type annotation is a prefix to any Node Name (Section 3.2) or Value
   (Section 3.7) that includes a _suggestion_ of what type the value is
   _intended_ to be treated as, or as a _context-specific elaboration_
   of the more generic type the node name indicates.

   Type annotations are written as a set of ( and ) with a single String
   (Section 3.9) in it.  It may contain Whitespace after the ( and
   before the ), and may be separated from its target by Whitespace.

   KDL does not specify any restrictions on what implementations might
   do with these annotations.  They are free to ignore them, or use them
   to make decisions about how to interpret a value.

3.8.1.  Suffix Type Annotation

   When a (Section 3.7) is a (Section 3.14), it's possible to attach the
   type annotation as a "suffix", instead of prepending it between ( and
   ).  This makes it possible to, for example, write 10px, 10.5%,
   512GiB, etc., which are equivalent to (px)10, (%)5, and (GiB)512,
   respectively.

   An implementation that finds BOTH a parenthesized and a suffix
   (Section 3.8) on the same (Section 3.14) MUST yield a syntax error.

   Suffixes MUST BE plain (Section 3.10)s.  No other (Section 3.9) is
   acceptable.

   There are two kinds of (Section 3.8.1) available: (Section 3.8.1.1)s
   and (Section 3.8.1.2).

3.8.1.1.  Bare Suffix Type Annotation

   When a (Section 3.7) is a decimal (Section 3.14) WITHOUT exponential
   syntax (1e+5 etc) (and ONLY a decimal), it's possible to attach the
   type annotation as a suffix directly to the number, without any
   additional syntax.

   They also come with some additional rules (like only being available
   for decimals), in order to prevent potential ambiguity or footguns
   with the syntax.  This is generally acceptable, as type annotations
   in particular tend to be application-defined and limited in scope,
   rather than arbitrary user data.  In designing this feature, it was
   determined that the value for various real-world DSLs outweighed the
   complexity of the following rules.

   As such, to remove ambiguity, the suffix (Section 3.10) MUST NOT
   start with any of the following patterns, all of which MUST yield
   syntax errors (if they can be distinguished from other syntaxes at
   all):

   *  ., ,, or _

   *  [a-zA-Z][0-9_] (to disambiguate all non-decimals, with breathing
      room)

   *  [eE][+-]?[0-9] (to disambiguate exponentials)

   *  [xX][a-fA-F] (to disambiguate hexadecimals)

   All other (Section 3.10)s can be safely appended to decimal numbers,
   so long as the decimal does not include an exponential component.

   If the desired suffix would violate any of the above rules, either
   regular parenthetical (Section 3.8)s, or (Section 3.8.1.2)s may be
   used.

3.8.1.2.  Explicit Suffix Type Annotation

   Any (Section 3.14) may have a # attached to it, followed by any valid
   (Section 3.10).  This is an explicit (Section 3.8.1) syntax without
   any of the relatively complex requirements of (Section 3.8.1.1),
   which can be a useful escape hatch.  For example: 10.0#u8 is invalid
   syntax without the # prefix.

   Note again that, unlike (Section 3.8.1.1)s, Explicit Suffixes may be
   used with ALL (Section 3.14) formats (hexadecimal, decimal, octal,
   and binary).  For example, 0x1234#u16 is valid.

3.8.2.  Reserved Type Annotations for Numbers Without Decimals

   Additionally, the following type annotations MAY be recognized by KDL
   parsers and, if used, SHOULD interpret these types as follows.

   Signed integers of various sizes (the number is the bit size):

   *  i8

   *  i16

   *  i32

   *  i64

   *  i128

   Unsigned integers of various sizes (the number is the bit size):

   *  u8

   *  u16

   *  u32

   *  u64

   *  u128

   Platform-dependent integer types, both signed and unsigned:

   *  isize

   *  usize

3.8.3.  Reserved Type Annotations for Numbers With Decimals:

   IEEE 754 floating point numbers, both single (32) and double (64)
   precision:

   *  f32

   *  f64

   IEEE 754-2008 decimal floating point numbers

   *  decimal64

   *  decimal128

3.8.4.  Reserved Type Annotations for Strings:

   *  date-time: ISO8601 date/time format.

   *  time: "Time" section of ISO8601.

   *  date: "Date" section of ISO8601.

   *  duration: ISO8601 duration format.

   *  decimal: IEEE 754-2008 decimal string format.

   *  currency: ISO 4217 currency code.

   *  country-2: ISO 3166-1 alpha-2 country code.

   *  country-3: ISO 3166-1 alpha-3 country code.

   *  country-subdivision: ISO 3166-2 country subdivision code.

   *  email: RFC5322 email address.

   *  idn-email: RFC6531 internationalized email address.

   *  hostname: RFC1123 internet hostname (only ASCII segments)

   *  idn-hostname: RFC5890 internationalized internet hostname (only xn
      ---prefixed ASCII "punycode" segments, or non-ASCII segments)

   *  ipv4: RFC2673 dotted-quad IPv4 address.

   *  ipv6: RFC2373 IPv6 address.

   *  url: RFC3986 URI.

   *  url-reference: RFC3986 URI Reference.

   *  irl: RFC3987 Internationalized Resource Identifier.

   *  irl-reference: RFC3987 Internationalized Resource Identifier
      Reference.

   *  url-template: RFC6570 URI Template.

   *  uuid: RFC4122 UUID.

   *  regex: Regular expression.  Specific patterns may be
      implementation-dependent.

   *  base64: A Base64-encoded string, denoting arbitrary binary data.

   *  base85: An Ascii85 (https://en.wikipedia.org/wiki/Ascii85)-encoded
      string, denoting arbitrary binary data.

3.8.5.  Examples

   node (u8)123
   node 123#i64
   node prop=(regex).*
   (published)date "1970-01-01"
   (contributor)person name="Foo McBar"

3.9.  String

   Strings in KDL represent textual UTF-8 Values (Section 3.7).  A
   String is either an Identifier String (Section 3.10) (like foo), a
   Quoted String (Section 3.11) (like "foo") or a Multi-Line String
   (Section 3.12).  Both Quoted and Multiline strings come in normal and
   Raw String (Section 3.13) variants (like #"foo"#):

   *  Identifier Strings let you write short, "single-word" strings with
      a minimum of syntax

   *  Quoted Strings let you write strings "like normal", with
      whitespace and escapes.

   *  Multi-Line Strings let you write strings across multiple lines and
      with indentation that's not part of the string value.

   *  Raw Strings don't allow any escapes, allowing you to not worry
      about the string's content containing anything that might look
      like an escape.

   Strings _MUST_ be represented as UTF-8 values.

   Strings _MUST NOT_ include the code points for disallowed literal
   code points (Section 3.19) directly.  Quoted and Multi-Line Strings
   may include these code points as _values_ by representing them with
   their corresponding \u{...} escape.

3.10.  Identifier String

   An Identifier String (sometimes referred to as just an "identifier")
   is composed of any Unicode Scalar Value (https://unicode.org/
   glossary/#unicode_scalar_value) other than non-initial characters
   (Section 3.10.1), followed by any number of Unicode Scalar Values
   other than non-identifier characters (Section 3.10.2).

   A handful of patterns are disallowed, to avoid confusion with other
   values:

   *  idents that appear to start with a Number (Section 3.14) (like
      1.0v2 or -1em) or the "almost a number" pattern of a decimal point
      without a leading digit (like .1).

   *  idents that are the language keywords (inf, -inf, nan, true,
      false, and null) without their leading #.

   Identifiers that match these patterns _MUST_ be treated as a syntax
   error; such values can only be written as quoted or raw strings.  The
   precise details of the identifier syntax is specified in the Full
   Grammar in Section 4.

3.10.1.  Non-initial characters

   The following characters cannot be the first character in an
   Identifier String (Section 3.10):

   *  Any decimal digit (0-9)

   *  Any non-identifier characters (Section 3.10.2)

   Additionally, the following initial characters impose limitations on
   subsequent characters:

   *  the + and - characters can only be used as an initial character if
      the second character is _not_ a digit.  If the second character is
      ., then the third character must _not_ be a digit.

   *  the . character can only be used as an initial character if the
      second character is _not_ a digit.

   This allows identifiers to look like --this or .md, and removes the
   ambiguity of having an identifier look like a number.

3.10.2.  Non-identifier characters

   The following characters cannot be used anywhere in a Identifier
   String (Section 3.10):

   *  Any of (){}[]/\"#;=

   *  Any Whitespace (Section 3.17) or Newline (Section 3.18).

   *  Any disallowed literal code points (Section 3.19) in KDL
      documents.

3.11.  Quoted String

   A Quoted String is delimited by " on either side of any number of
   literal string characters except unescaped " and \.

   Literal Newline (Section 3.18) characters can only be included if
   they are Escaped Whitespace (Section 3.11.1.1), which discards them
   from the string value.  Actually including a newline in the value
   requires using a newline escape sequence, like \n, or using a Multi-
   Line String (Section 3.12) which is actually designed for strings
   stretching across multiple lines.

   Like Identifier Strings, Quoted Strings _MUST NOT_ include any of the
   disallowed literal code-points (Section 3.19) as code points in their
   body.

   Quoted Strings have a Raw String (Section 3.13) variant, which
   disallows escapes.

3.11.1.  Escapes

   In addition to literal code points, a number of "escapes" are
   supported in Quoted Strings.  "Escapes" are the character \ followed
   by another character, and are interpreted as described in the
   following table:

   +==============+=========+=========================================+
   | Name         | Escape  | Code Pt                                 |
   +==============+=========+=========================================+
   | Line Feed    | \n      | U+000A                                  |
   +--------------+---------+-----------------------------------------+
   | Carriage     | \r      | U+000D                                  |
   | Return       |         |                                         |
   +--------------+---------+-----------------------------------------+
   | Character    | \t      | U+0009                                  |
   | Tabulation   |         |                                         |
   | (Tab)        |         |                                         |
   +--------------+---------+-----------------------------------------+
   | Reverse      | \\      | U+005C                                  |
   | Solidus      |         |                                         |
   | (Backslash)  |         |                                         |
   +--------------+---------+-----------------------------------------+
   | Quotation    | \"      | U+0022                                  |
   | Mark (Double |         |                                         |
   | Quote)       |         |                                         |
   +--------------+---------+-----------------------------------------+
   | Backspace    | \b      | U+0008                                  |
   +--------------+---------+-----------------------------------------+
   | Form Feed    | \f      | U+000C                                  |
   +--------------+---------+-----------------------------------------+
   | Space        | \s      | U+0020                                  |
   +--------------+---------+-----------------------------------------+
   | Unicode      | \u{(1-6 | Code point described by hex characters, |
   | Escape       | hex     | as long as it represents a Unicode      |
   |              | chars)} | Scalar Value (https://unicode.org/      |
   |              |         | glossary/#unicode_scalar_value)         |
   +--------------+---------+-----------------------------------------+
   | Whitespace   | See     | N/A                                     |
   | Escape       | below   |                                         |
   +--------------+---------+-----------------------------------------+

                                 Table 1

3.11.1.1.  Escaped Whitespace

   In addition to escaping individual characters, \ can also escape
   whitespace.  When a \ is followed by one or more literal whitespace
   characters, the \ and all of that whitespace are discarded.  For
   example,

   "Hello World"

   and

   "Hello \    World"

   are semantically identical.  See whitespace (Section 3.17) and
   newlines (Section 3.18) for how whitespace is defined.

   Note that only literal whitespace is escaped; whitespace escapes (\n
   and such) are retained.  For example, these strings are all
   semantically identical:

   "Hello\       \nWorld"

       "Hello\n\
       World"

   "Hello\nWorld"

   """
     Hello
     World
     """

3.11.1.2.  Invalid escapes

   Except as described in the escapes table, above, \ _MUST NOT_ precede
   any other characters in a string.

3.12.  Multi-line String

   Multi-Line Strings support multiple lines with literal, non-escaped
   Newlines.  They must use a special multi-line syntax, and they
   automatically "dedent" the string, allowing its value to be indented
   to a visually matching level as desired.

   A Multi-Line String is opened and closed by _three_ double-quote
   characters, like """.  Its first line _MUST_ immediately start with a
   Newline (Section 3.18) after its opening """.  Its final line _MUST_
   contain only whitespace before the closing """.  All in-between lines
   that contain non-newline, non-whitespace characters _MUST_ start with
   _at least_ the exact same whitespace as the final line (precisely
   matching codepoints, not merely counting characters or "size"); they
   may contain additional whitespace following this prefix.  The lines
   in between may contain unescaped " (but no unescaped """ as this
   would close the string).

   The value of the Multi-Line String omits the first and last Newline,
   the Whitespace of the last line, and the matching Whitespace prefix
   on all intermediate lines.  The first and last Newline can be the
   same character (that is, empty multi-line strings are legal).

   In other words, the final line specifies the whitespace prefix that
   will be removed from all other lines.

   Multi-line Strings that do not immediately start with a Newline and
   whose final """ is not preceeded by optional whitespace and a Newline
   are illegal.  This also means that """ may not be used for a single-
   line String (e.g. """foo""").

3.12.1.  Newline Normalization

   Literal Newline sequences in Multi-line Strings must be normalized to
   a single U+000A (LF) during deserialization.  This means, for
   example, that CR LF becomes a single LF during parsing.

   This normalization does not apply to non-literal Newlines entered
   using escape sequences.  That is:

   multi-line """
       \r\n[CRLF]
       foo[CRLF]
       """

   becomes:

   single-line "\r\n\nfoo"

   For clarity: this normalization applies to each individual Newline
   sequence.  That is, the literal sequence CRLF CRLF becomes LF LF, not
   LF.

3.12.2.  Examples

3.12.2.1.  Indented multi-line string

   multi-line """
           foo
       This is the base indentation
               bar
       """

   This example's string value will be:

       foo
   This is the base indentation
           bar

   which is equivalent to

   "    foo\nThis is the base indentation\n        bar"

   when written as a single-line string.

3.12.2.2.  Shorter last-line indent

   If the last line wasn't indented as far, it won't dedent the rest of
   the lines as much:

   multi-line """
           foo
       This is no longer on the left edge
               bar
     """

   This example's string value will be:

         foo
     This is no longer on the left edge
             bar

   Equivalent to

   "      foo\n  This is no longer on the left edge\n          bar"

3.12.2.3.  Empty lines

   Empty lines can contain any whitespace, or none at all, and will be
   reflected as empty in the value:

   multi-line """
       Indented a bit

       A second indented paragraph.
       """

   This example's string value will be:

   Indented a bit.

   A second indented paragraph.

   Equivalent to

   "Indented a bit.\n\nA second indented paragraph."

3.12.2.4.  Syntax errors

   The following yield *syntax errors*:

   multi-line """can't be single line"""

   multi-line """
     closing quote with non-whitespace prefix"""

   multi-line """stuff
     """

   // Every line must share the exact same prefix as the closing line.
   multi-line """[\n]
   [tab]a[\n]
   [space][space]b[\n]
   [space][tab][\n]
   [tab]"""

3.12.3.  Interaction with Whitespace Escapes

   Multi-line strings support the same mechanism for escaping whitespace
   as Quoted Strings.

   When processing a Multi-line String, implementations MUST dedent the
   string _after_ resolving all whitespace escapes, but _before_
   resolving other backslash escapes.  This means a whitespace escape
   that attempts to escape the final line's newline and/or whitespace
   prefix can be invalid: if removing escaped whitespace places the
   closing """ on a line with non-whitespace characters, this escape is
   invalid.

   For example, the following example is illegal:

     """
     foo
     bar\
     """

     // equivalent to
     """
     foo
     bar"""

   while the following example is allowed

     """
     foo \
   bar
     baz
     \   """

     // equivalent to
     """
     foo bar
     baz
     """

3.13.  Raw String

   Both Quoted (Section 3.11) and Multi-Line Strings (Section 3.12) have
   Raw String variants, which are identical in syntax except they do not
   support \-escapes.  This includes line-continuation escapes (\ + ws
   collapsing to nothing).  They otherwise share the same properties as
   far as literal Newline (Section 3.18) characters go, multi-line
   rules, and the requirement of UTF-8 representation.

   The Raw String variants are indicated by preceding the strings's
   opening quotes with one or more # characters.  The string is then
   closed by its normal closing quotes, followed by a _matching_ number
   of # characters.  This means that the string may contain any
   combination of " and # characters other than its closing delimiter
   (e.g., if a raw string starts with ##", it can contain " or "#, but
   not "## or "###).

   Like other Strings, Raw Strings _MUST NOT_ include any of the
   disallowed literal code-points (Section 3.19) as code points in their
   body.  Unlike with Quoted Strings, these cannot simply be escaped,
   and are thus unrepresentable when using Raw Strings.

3.13.1.  Example

   just-escapes #"\n will be literal"#

   The string contains the literal characters \n will be literal.

   quotes-and-escapes ##"hello\n\r\asd"#world"##

   The string contains the literal characters hello\n\r\asd"#world

   raw-multi-line #"""
       Here's a """
           multiline string
           """
       without escapes.
       """#

   The string contains the value

   Here's a """
       multiline string
       """
   without escapes.

   or equivalently,

   "Here's a \"\"\"\n    multiline string\n    \"\"\"\nwithout escapes."

   as a Quoted String.

3.14.  Number

   Numbers in KDL represent numerical Values (Section 3.7).  There is no
   logical distinction in KDL between real numbers, integers, and
   floating point numbers.  It's up to individual implementations to
   determine how to represent KDL numbers.

   There are five syntaxes for Numbers: Keywords, Decimal, Hexadecimal,
   Octal, and Binary.

   *  All non-Keyword (Section 3.14.1) numbers may optionally start with
      one of - or +, which determine whether they'll be positive or
      negative.

   *  Binary numbers start with 0b and only allow 0 and 1 as digits,
      which may be separated by _. They represent numbers in radix 2.

   *  Octal numbers start with 0o and only allow digits between 0 and 7,
      which may be separated by _. They represent numbers in radix 8.

   *  Hexadecimal numbers start with 0x and allow digits between 0 and
      9, as well as letters A through F, in either lower or upper case,
      which may be separated by _. They represent numbers in radix 16.

   *  Decimal numbers are a bit more special:

      -  They have no radix prefix.

      -  They use digits 0 through 9, which may be separated by _.

      -  They may optionally include a decimal separator ., followed by
         more digits, which may again be separated by _.

      -  They may optionally be followed by E or e, an optional - or +,
         and more digits, to represent an exponent value.

   Note that, similar to JSON and some other languages, numbers without
   an integer digit (such as .1) are illegal.  They must be written with
   at least one integer digit, like 0.1.  (These patterns are also
   disallowed from Identifier Strings (Section 3.10), to avoid
   confusion.)

3.14.1.  Keyword Numbers

   There are three special "keyword" numbers included in KDL to
   accomodate the widespread use of IEEE 754
   (https://en.wikipedia.org/wiki/IEEE_754) floats:

   *  #inf - floating point positive infinity.

   *  #-inf - floating point negative infinity.

   *  #nan - floating point NaN/Not a Number.

   To go along with this and prevent foot guns, the bare Identifier
   Strings (Section 3.10) inf, -inf, and nan are considered illegal
   identifiers and should yield a syntax error.

   The existence of these keywords does not imply that any numbers be
   represented as IEEE 754 floats.  These are simply for clarity and
   convenience for any implementation that chooses to represent their
   numbers in this way.

3.15.  Boolean

   A boolean Value (Section 3.7) is either the symbol #true or #false.
   These _SHOULD_ be represented by implementation as boolean logical
   values, or some approximation thereof.

3.15.1.  Example

   my-node #true value=#false

3.16.  Null

   The symbol #null represents a null Value (Section 3.7).  It's up to
   the implementation to decide how to represent this, but it generally
   signals the "absence" of a value.

3.16.1.  Example

   my-node #null key=#null

3.17.  Whitespace

   The following characters should be treated as non-Newline
   (Section 3.18) white space
   (https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt):

                  +===========================+=========+
                  | Name                      | Code Pt |
                  +===========================+=========+
                  | Character Tabulation      | U+0009  |
                  +---------------------------+---------+
                  | Space                     | U+0020  |
                  +---------------------------+---------+
                  | No-Break Space            | U+00A0  |
                  +---------------------------+---------+
                  | Ogham Space Mark          | U+1680  |
                  +---------------------------+---------+
                  | En Quad                   | U+2000  |
                  +---------------------------+---------+
                  | Em Quad                   | U+2001  |
                  +---------------------------+---------+
                  | En Space                  | U+2002  |
                  +---------------------------+---------+
                  | Em Space                  | U+2003  |
                  +---------------------------+---------+
                  | Three-Per-Em Space        | U+2004  |
                  +---------------------------+---------+
                  | Four-Per-Em Space         | U+2005  |
                  +---------------------------+---------+
                  | Six-Per-Em Space          | U+2006  |
                  +---------------------------+---------+
                  | Figure Space              | U+2007  |
                  +---------------------------+---------+
                  | Punctuation Space         | U+2008  |
                  +---------------------------+---------+
                  | Thin Space                | U+2009  |
                  +---------------------------+---------+
                  | Hair Space                | U+200A  |
                  +---------------------------+---------+
                  | Narrow No-Break Space     | U+202F  |
                  +---------------------------+---------+
                  | Medium Mathematical Space | U+205F  |
                  +---------------------------+---------+
                  | Ideographic Space         | U+3000  |
                  +---------------------------+---------+

                                  Table 2

3.17.1.  Single-line comments

   Any text after //, until the next literal Newline (Section 3.18) is
   "commented out", and is considered to be Whitespace (Section 3.17).

3.17.2.  Multi-line comments

   In addition to single-line comments using //, comments can also be
   started with /* and ended with */. These comments can span multiple
   lines.  They are allowed in all positions where Whitespace
   (Section 3.17) is allowed and can be nested.

3.17.3.  Slashdash comments

   Finally, a special kind of comment called a "slashdash", denoted by
   /-, can be used to comment out entire _components_ of a KDL document
   logically, and have those elements not be included as part of the
   parsed document data.

   Slashdash comments can be used before the following, including before
   their type annotations, if present:

   *  A Node (Section 3.2): the entire Node is treated as Whitespace,
      including all props, args, and children.

   *  An Argument (Section 3.5): the Argument value is treated as
      Whitespace.

   *  A Property (Section 3.4) key: the entire property, including both
      key and value, is treated as Whitespace.  A slashdash of just the
      property value is not allowed.

   *  A Children Block (Section 3.6): the entire block, including all
      children within, is treated as Whitespace.  Only other children
      blocks, whether slashdashed or not, may follow a slashdashed
      children block.

   A slashdash may be be followed by any amount of whitespace, including
   newlines and comments (other than other slashdashes), before the
   element that it comments out.

3.18.  Newline

   The following character sequences should be treated as new lines
   (https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-
   5/#G41643):

       +=========+===============================+=================+
       | Acronym | Name                          | Code Pt         |
       +=========+===============================+=================+
       | CRLF    | Carriage Return and Line Feed | U+000D + U+000A |
       +---------+-------------------------------+-----------------+
       | CR      | Carriage Return               | U+000D          |
       +---------+-------------------------------+-----------------+
       | LF      | Line Feed                     | U+000A          |
       +---------+-------------------------------+-----------------+
       | NEL     | Next Line                     | U+0085          |
       +---------+-------------------------------+-----------------+
       | VT      | Vertical tab                  | U+000B          |
       +---------+-------------------------------+-----------------+
       | FF      | Form Feed                     | U+000C          |
       +---------+-------------------------------+-----------------+
       | LS      | Line Separator                | U+2028          |
       +---------+-------------------------------+-----------------+
       | PS      | Paragraph Separator           | U+2029          |
       +---------+-------------------------------+-----------------+

                                  Table 3

   Note that for the purpose of new lines, the specific sequence CRLF is
   considered _a single newline_.

3.19.  Disallowed Literal Code Points

   The following code points may not appear literally anywhere in the
   document.  They may be represented in Strings (but not Raw Strings)
   using Unicode Escapes (Section 3.11.1) (\u{...}, except for non
   Unicode Scalar Value, which can't be represented even as escapes).

   *  The codepoints U+0000-0008 or the codepoints U+000E-001F (various
      control characters).

   *  U+007F (the Delete control character).

   *  Any codepoint that is not a Unicode Scalar Value
      (https://unicode.org/glossary/#unicode_scalar_value)
      (U+D800-DFFF).

   *  U+200E-200F, U+202A-202E, and U+2066-2069, the unicode "direction
      control" characters (https://www.w3.org/International/questions/
      qa-bidi-unicode-controls)

   *  U+FEFF, aka Zero-width Non-breaking Space (ZWNBSP)/Byte Order Mark
      (BOM), except as the first code point in a document.

4.  Full Grammar

   This is the full official grammar for KDL and should be considered
   authoritative if something seems to disagree with the text above.
   The grammar language syntax is defined in Section 4.1.

   document := bom? version? nodes

   // Nodes
   nodes := (line-space* node)* line-space*

   base-node := slashdash? type? node-space* string
       (node-space* (node-space | slashdash) node-prop-or-arg)*
       // slashdashed node-children must always be after props and args.
       (node-space* slashdash node-children)*
       (node-space* node-children)?
       (node-space* slashdash node-children)*
       node-space*
   node := base-node node-terminator
   final-node := base-node node-terminator?

   // Entries
   node-prop-or-arg := prop | value
   node-children := '{' nodes final-node? '}'
   node-terminator := single-line-comment | newline | ';' | eof

   prop := string node-space* '=' node-space* value
   value := type? node-space* (string | number | keyword)
   type := '(' node-space* string node-space* ')'

   // Strings
   string := identifier-string | quoted-string | raw-string ¶

   identifier-string := unambiguous-ident | signed-ident | dotted-ident
   unambiguous-ident :=
       ((identifier-char - digit - sign - '.') identifier-char*)
       - disallowed-keyword-strings
   signed-ident :=
       sign ((identifier-char - digit - '.') identifier-char*)?
   dotted-ident :=
       sign? '.' ((identifier-char - digit) identifier-char*)?
   identifier-char :=
       unicode - unicode-space - newline - [\\/(){};\[\]"#=]
       - disallowed-literal-code-points
   disallowed-keyword-identifiers :=
       'true' | 'false' | 'null' | 'inf' | '-inf' | 'nan'

   quoted-string :=
       '"' single-line-string-body '"' |
       '"""' newline
       (multi-line-string-body newline)?
       (unicode-space | ws-escape)* '"""'
   single-line-string-body := (string-character - newline)*
   multi-line-string-body := (('"' | '""')? string-character)*
   string-character :=
       '\\' (["\\bfnrts] |
       'u{' hex-unicode '}') |
       ws-escape |
       [^\\"] - disallowed-literal-code-points
   ws-escape := '\\' (unicode-space | newline)+
   hex-digit := [0-9a-fA-F]
   hex-unicode := hex-digit{1, 6} - surrogate - above-max-scalar
   surrogate := [0]{0, 2} [dD] [8-9a-fA-F] hex-digit{2}
   //  U+D800-DFFF:         D   8          00
   //                       D   F          FF
   above-max-scalar = [2-9a-fA-F] hex-digit{5} |
       [1] [1-9a-fA-F] hex-digit{4}


   raw-string := '#' raw-string-quotes '#' | '#' raw-string '#'
   raw-string-quotes :=
       '"' single-line-raw-string-body '"' |
       '"""' newline
       (multi-line-raw-string-body newline)?
       unicode-space* '"""'
   single-line-raw-string-body :=
       '' |
       (single-line-raw-string-char - '"')
           single-line-raw-string-char*? |
       '"' (single-line-raw-string-char - '"')
           single-line-raw-string-char*?
   single-line-raw-string-char :=
       unicode - newline - disallowed-literal-code-points
   multi-line-raw-string-body :=
       (unicode - disallowed-literal-code-points)*?

   // Numbers
   number := keyword-number | hex | octal | binary | decimal

   decimal := sign? integer ('.' integer)? (
       // NOTE: This grammar does not explicitly guard against having both
       // parenthesized and type suffixes.
       bare-type-suffix |
       explicit-type-suffix |
       (exponent explicit-type-suffix?)
     )?
   exponent := ('e' | 'E') sign? integer
   integer := digit (digit | '_')*
   digit := [0-9]
   sign := '+' | '-'

   bare-type-suffix := bare-type-suffix-initial identifier-char*
   bare-type-suffix-initial := identifier-char
       - '.' - ',' - '_'
       - ([a-zA-Z] [0-9_])
       - (('e' | 'E') sign? digit)
       - (('x' | 'X') [a-fA-F])
   explicit-type-suffix := '#' identifier-string

   hex := sign? '0x' hex-digit (hex-digit | '_')*
   octal := sign? '0o' [0-7] [0-7_]*
   binary := sign? '0b' ('0' | '1') ('0' | '1' | '_')*

   // Keywords and booleans.
   keyword := boolean | '#null'
   keyword-number := '#inf' | '#-inf' | '#nan'
   boolean := '#true' | '#false'

   // Specific code points
   bom := '\u{FEFF}'
   disallowed-literal-code-points :=
       See Table (Disallowed Literal Code Points)
   unicode := Any Unicode Scalar Value
   unicode-space := See Table
       (All White_Space unicode characters which are not `newline`)

   // Comments
   single-line-comment := '//' ^newline* (newline | eof)
   multi-line-comment := '/*' commented-block
   commented-block :=
       '*/' | (multi-line-comment | '*' | '/' | [^*/]+) commented-block
   slashdash := '/-' line-space*

   // Whitespace
   ws := unicode-space | multi-line-comment
   escline := '\\' ws* (single-line-comment | newline | eof)
   newline := See Table (All Newline White_Space)
   // Whitespace where newlines are allowed.
   line-space := node-space | newline | single-line-comment
   // Whitespace within nodes,
   // where newline-ish things must be esclined.
   node-space := ws* escline ws* | ws+

   // Version marker
   version :=
       '/-' unicode-space* 'kdl-version' unicode-space+ ('1' | '2')
       unicode-space* newline

4.1.  Grammar language

   The grammar language syntax is a combination of ABNF with some regex
   spice thrown in.  Specifically:

   *  Single quotes (') are used to denote literal text. \ within a
      literal string is used for escaping other single-quotes, for
      initiating unicode characters using hex values (\u{FEFF}), and for
      escaping \ itself (\\).

   *  * is used for "zero or more", + is used for "one or more", and ?
      is used for "zero or one".  Per standard regex semantics, * and +
      are _greedy_; they match as many instances as possible without
      failing the match.

   *  *? (used only in raw strings) indicates a _non-greedy_ match; it
      matches as _few_ instances as possible without failing the match.

   *  ¶ is a _cut point_. It always matches and consumes no characters,
      but once matched, the parser is not allowed to backtrack past that
      point in the source.  If a parser would rewind past the cut point,
      it must instead fail the overall parse, as if it had run out of
      options.  (This is only used with the raw-string production, to
      ensure the first instance of the appropriate closing quote
      sequence is guaranteed to be the end of the raw string, rather
      than allowing it to potentially consume more of the document
      unexpectedly.)

   *  () can be used to group matches that must be matched together.

   *  a | b means a or b, whichever matches first.  If multiple items
      are before a |, they are a single group. a b c | d is equivalent
      to (a b c) | d.

   *  [] are used for regex-style character matches, where any character
      between the brackets will be a single match. \ is used to escape
      \, [, and ].  They also support character ranges (0-9), and
      negation (^)

   *  - is used for "except for" or "minus" whatever follows it.  For
      example, a - 'x' means "any a, except something that matches the
      literal 'x'".

   *  The prefix ^ means "something that does not match" whatever
      follows it.  For example, ^foo means "must not match foo".

   *  A single definition may be split over multiple lines.  Newlines
      are treated as spaces.

   *  // followed by text on its own line is used as comment syntax.

Authors' Addresses

   Katerina Zoé Marchán Salvá
   Microsoft


   The KDL Contributors