From ee032d3c597e66800b47903facb07e8202d64cbf Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Kat=20March=C3=A1n?= <kzm@zkat.tech>
Date: Sun, 10 Dec 2023 15:31:14 -0800
Subject: [PATCH] fix some confusion in grammar syntax, and actually specify
 the syntax itself

Fixes: https://github.com/kdl-org/kdl/issues/345
---
 CHANGELOG.md |  1 +
 SPEC.md      | 30 ++++++++++++++++++++++++++++--
 2 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index cd30307..0e75f51 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -9,6 +9,7 @@
 * All literal whitespace following a `\` in a string is now discarded.
 * Vertical tabs (`U+000B`) are now considered to be whitespace.
 * Identifiers can't start with `r#`, so they're easy to distinguish from raw strings. (They already similarly can't start with a digit, or a sign+digit, so they're easy to distinguish from numbers.)
+* The grammar syntax itself has been described, and some confusing definitions in the grammar have been fixed accordingly (mostly related to escaped characters).
 
 ### KQL
 
diff --git a/SPEC.md b/SPEC.md
index b06c311..3b5a782 100644
--- a/SPEC.md
+++ b/SPEC.md
@@ -98,7 +98,7 @@ codepoints other than [non-identifier characters](#non-identifier-characters),
 so long as this doesn't produce something confusable for a [Number](#number),
 [Boolean](#boolean), or [Null](#null). For example, both a [Number](#number)
 and an Identifier can start with `-`, but when an Identifier starts with `-`
-the second character cannot be a digit. This is precicely specified in the 
+the second character cannot be a digit. This is precicely specified in the
 [Full Grammar](#full-grammar) below.
 
 Identifiers are terminated by [Whitespace](#whitespace) or
@@ -472,6 +472,10 @@ Note that for the purpose of new lines, CRLF is considered _a single newline_.
 
 ## Full Grammar
 
+This is the full official grammar for KDL and should be considered
+authoritative if something seems to disagree with the text above. The [grammar
+language syntax](#grammar-language) is defined below.
+
 ```
 nodes := (line-space* node)* line-space*
 
@@ -494,7 +498,7 @@ bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - key
 unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
 numberish-ident := sign ((identifier-char - digit) identifier-char*)?
 stringish-ident := "r" ((identifier-char - "#") identifier-char*)?
-identifier-char := unicode - line-space - [\/(){}<>;[]=,"]
+identifier-char := unicode - line-space - [\\/(){}<>;\[\]=,"]
 keyword := boolean | 'null'
 prop := identifier '=' value
 value := type? (string | number | keyword)
@@ -538,3 +542,25 @@ single-line-comment := '//' ^newline* (newline | eof)
 multi-line-comment := '/*' commented-block
 commented-block := '*/' | (multi-line-comment | '*' | '/' | [^*/]+) commented-block
 ```
+
+### Grammar language
+
+The grammar language syntax is a combination of ABNF with some regex spice thrown in.
+Specifically:
+
+* Single quotes (`'`) are used to denote literal text. `\` within a literal
+  string is used for escaping other single-quotes, for initiating unicode
+  characters using hex values (`\u{FEFF}`), and for escaping `\` itself
+  (`\\`).
+* `*` is used for "zero or more", `+` is used for "one or more", and `?` is
+  used for "zero or one".
+* `()` can be used to group matches that must be matched together.
+* `a | b` means `a or b`, whichever matches first. If multipe items are before
+  a `|`, they are a single group. `a b c | d` is equivalent to `(a b c) | d`.
+* `[]` are used for regex-style character matches, where any character between
+  the brackets will be a single match. `\` is used to escape `\`, `[`, and
+  `]`. They also support character ranges (`0-9`), and negation (`^`)
+* `-` is used for "except for" or "minus" whatever follows it. For example, `a
+  - `'x'` means "any `a`, except something that matches the literal `'x'`".
+* The prefix `^` means "something that does not match" whatever follows it.
+  For example, `^foo` means "must not match `foo`".