# KDL Spec This is the kinda-formal specification for KDL, including the intended data model and the grammar. ## Introduction KDL is a node-oriented document language. Its niche and purpose overlaps with XML, and as do many of its semantics. You can use KDL both as a configuration language, and a data exchange or storage format, if you so choose. ## Components ### Document The toplevel concept of KDL is a Document. A Document is composed of one or more [Nodes](#node), separated by newlines and whitespace, and eventually terminated by an EOF. #### Example The following is a document composed of two toplevel nodes: ```kdl foo { bar } baz ``` ### Node Being a node-oriented language means that the real core component of any KDL document is the "node". Every node must have a name, which is either a legal [Identifier](#identifier), or a quoted [String](#string). Following the name are zero or more [Values](#value) or [Properties](#property), separated by either [whitespace](#whitespace) or [a slash-escaped line continuation](#line-continuation). Values and Properties may be interspersed in any order, much like is common with positional arguments vs options in command line tools. Values are ordered relative to each other and that order must be preserved in order to maintain the semantics. By contrast, Property order _should not matter_ to implementations. [Children](#children-block) should be used if an order-sensitive key/value data structure must be represented in KDL. Finally, a node is terminated by either a [Newline](#newline), a [Children Block](#children-block), a semicolon (`;`) or the end of the file/stream (an `EOF`). #### Example ```kdl foo 1 key="val" 3 { bar baz } ``` ### Identifier A bare Identifier is composed of any unicode codepoint other than [non-initial characters](#non-inidital-characters), followed by any number of unicode codepoints other than [non-identifier characters](#non-identifier-characters). Identifiers are terminated by [Whitespace](#whitespace) or [Newlines](#newline). ### Non-initial characters The following characters cannot be the first character in a bare [Identifier](#identifier): * Any of "/\\{};[]=," * Any decimal digit (0-9) * Any [non-identifier characters](#non-identifier-characters) ### Non-identifier characters The following characters cannot be used anywhere in a bare [Identifier](#identifier): * Any codepoint with hexadecimal value `0x20` or below. * Any codepoint with hexadecimal value higher than `0x10FFF`. * Any of "\\{};[]=," ### Line Continuation Line continuations allow [Nodes](#node) to be spread across multiple lines. A line continuation is one or more [whitespace](#whitespace) characters, followed by a `/` character. This character can then be followed by more [whitespace](#whitespace) and must be terminated by a [Newline](#newline) (including the Newline that is part of single-line comments). Following a line continuation, processing of a Node can continue as usual. #### Example ```kdl my-node 1 2 \ // this is a comment 3 4 // This is the actual end of the Node. ``` ### Whitespace The following characters should be treated as non-[Newline](#newline) [white space](https://www.unicode.org/Public/UCD/latest/ucd/PropList.txt): | Name | Code Pt | |----------------------|---------| | Character Tabulation | `0009` | | Space | `0020` | | No-Break Space | `00A0` | | Ogham Space Mark | `1680` | | En Quad | `2000` | | Em Quad | `2001` | | En Space | `2002` | | Em Space | `2003` | | Three-Per-Em Space | `2004` | | Four-Per-Em Space | `2005` | | Six-Per-Em Space | `2006` | | Figure Space | `2007` | | Punctuation Space | `2008` | | Thin Space | `2009` | | Hair Space | `200A` | | Narrow No-Break Space| `202F` | | Medium Mathematical Space | `205F` | | Ideographic Space | `3000` | ### Newline The following characters [should be treated as new lines](https://www.unicode.org/versions/Unicode13.0.0/ch05.pdf): | Acronym | Name | Code Pt | |---------|-----------------|---------| | CR | Carriage Return | `000D` | | LF | Line Feed | `000A` | | CRLF | Carriage Return and Line Feed | `000D` + `000A` | | NEL | Next Line | `0085` | | FF | Form Feed | `000C` | | LS | Line Separator | `2028` | | PS | Paragraph Separator | `2029` | Note that for the purpose of new lines, CRLF is considered _a single newline_. ## Full Grammar ``` // FIXME: I don't... think this is quite right? nodes := linespace* (node (newline nodes)? linespace*)? // FIXME: This is missing the newline at the end? And is the single-line-comment thing correct? node := identifier (node-space node-argument)* (node-space node-document)? single-line-comment? node-argument := prop | value node-children := '{' nodes '}' node-space := ws* escline ws* | ws+ // FIXME: This needs adjustment to the new, unicode-friendly version identifier := [a-zA-Z] [a-zA-Z0-9!$%&'*+\-./:<>?@\^_|~]* | string prop := identifier '=' value value := string | raw_string | number | boolean | 'null' string := '"' character* '"' character := '\' escape | [^\"] escape := ["\\/bfnrt] | 'u{' hex-digit{1, 6} '}' hex-digit := [0-9a-fA-F] raw-string := 'r' raw-string-hash raw-string-hash := '#' raw-string-hash '#' | raw-string-quotes raw-string-quotes := '"' .* '"' number := decimal | hex | octal | binary decimal := integer ('.' [0-9]+)? exponent? exponent := ('e' | 'E') integer integer := sign? [0-9] [0-9_]* sign := '+' | '-' hex := '0x' hex-digit (hex-digit | '_')* octal := '0o' [0-7] [0-7_]* binary := '0b' ('0' | '1') ('0' | '1' | '_')* boolean := 'true' | 'false' escline := '\\' ws* (single-line-comment | newline) linespace := newline | ws | single-line-comment newline := `000D` | `000A` | `000D` `000A` | `0085` | `000C` | `2028` | `2029` ws := bom | unicode-space | multi-line-comment | slashdash-comment unicode-space := See Table (All White_Space unicode characters which are not `newline`) single-line-comment := '//' ('\r' [^\n] | [^\r\n])* newline multi-line-comment := '/*' ('*' [^\/] | [^*])* '*/' slashdash-comment := '/-' (node | value | prop | node-children) ```