Remote-Url: https://github.com/kragen/peg-bootstrap/blob/master/peg.md Retrieved-at: 2022-01-29 19:30:39.230183+00:00 So I was readingBryan Ford’s thesisabout parsing expression grammars and packrat parsers, and I thought it would be fun to implement them and see how easy they really were.It turns out they’re not that hard; this document contains a one-page PEG parser generator that generates PEG parsers in JavaScript, along with an explanation of how it works, and some example applications. If you’ve ever thought that writing a compiler was deep magic because parsing would take you way too long to understand, this should show you that writing a compiler can be simple! (At least, if you already know how to program.)What Are PEGs?A PEG is a formal language description which describes how to parse some language — like a regular expression, it describes the structure of some set of strings.A Gentle Introduction by ExampleHere’s a simple PEG which describes simple arithmetic expressions with no operator precedence:# in an example arithmetic parser: sentence <- ('0' / '1' / '2' / '3' / '4' / '5' / '6' / '7' / '8' / '9')+ ( ('+' / '-' / '*' / '×' / '/' / '÷') sentence / ).This says that asentenceis one or more digits, followed by either an operator and anothersentence, or nothing. The parentheses are used for grouping; apostrophes''are used for literal text; slashes/are used for choice (“try parsing this, and if it doesn’t work out, try that”); a left arrow<-is used to attach a name (called a “nonterminal”) to a parsing rule; andx+means “one or more ofx”.(Typically, each of the strings that belongs to a language, such as a program in a programming language, is called a “sentence” of that language; thus my choice of that nonterminal name.)So, to parse2*30+4as asentence, first we try matching a0at the beginning, where there’s a2; that doesn’t work, so we try a1; that doesn't work, so we try a2. That does work, so then we try for repetition, looking for a second digit at the*. That doesn’t work out (after ten tries), so we zoom along and look for a+. The*isn’t a+, so after a couple of tries, we find out it’s a*. Then we try parsing a nestedsentencestarting at the3. This time, we match the3after three tries, and then when we look for a second digit, we find a0; the third try fails, so we look for a+, and find it; then we look for a second nestedsentence. We match a4after four tries, but we don’t find another digit after it (because there isn’t anything after it), so we try to find an operator after it, which doesn’t work, so we try to find nothing after it (the emptiness after the/aftersentence) which works, and we’re done.Notice that this doesn’t respect operator precedence (it gives2*(30+4)rather than(2*30)+4), and also associates to the right.Here’s an example PEG that handles operator precedence and parentheses, although not associativity:# in an example arithmetic parser with precedence: sentence <- term ('+'/'-') sentence / term. term <- atom ('*' / '×' / '/' / '÷') term / atom. atom <- number / '(' sentence ')'. number <- ('0' / '1' / '2' / '3' / '4' / '5' / '6' / '7' / '8' / '9')+.If we try to parse the same2*30+4with this grammar, we get down tonumberand parse the2, soatomsucceeds with the2, and thentermsucks up the*and then looks for an innertermat the3. Thennumberparses30, and the innertermlooks for one of*×/÷after it, which doesn’t work out since what’s after it is a+, so it gives up on its first alternative and tries to parse just anatomstarting at the3, rather than anatomfollowed by an operator and another term. Thenatomsucks up the30just like before, and the innertermfinishes, and then the outertermfinishes, and it’s up tosentenceto deal with the+4bit, which it does in the predictable way.It won’t handle40-1-1as(40-1)-1as you might hope, though. If you try to rewritesentenceto handle this assentence ('+'/'-') term / term, you run into trouble — the first thingsentencedoes is try to parse asentence, so you get into an infinite loop. There are different ways to ameliorate this problem by enhancing the parser generator, but in general, you can always figure out a way to modify the grammar to remove this “left recursion”; it just makes it a little more complicated to handle the results of the parser.(As an aside, most practical PEG systems let you abbreviate things like('0' / '1' / '2' / '3' / '4' / '5' / '6' / '7' / '8' / '9')as[0-9], but the one in this document doesn’t.)That covers most of the stuff PEGs can do. A few things to notice:They’re a little more verbose than regular expressions but a lot more powerful at understanding structure. And, like with regexps, you can do a hell of a lot in a few lines of code.The obvious implementation is pretty slow; it spends a lot of time re-parsing things it’s already parsed and playing Cheese Shop with the next character. (“Have you got a 0?” “No.” “How about a 1?” “No.” ...) It turns out there are ways to solve this, although I don’t explore them in this document.They have trouble with “left recursion”, which is where the first thing in a “foo” (say,sentence) can be a smaller “foo”.There’s one more big feature of PEGs: the ability to do negative lookahead, or negation. As an example, in C, a comment begins at a/*and continues until the next*/. But you can have*and/and even/*inside the comment, as long as there isn't a*/. Doing this in a regexp is a real pain and the result is unreadable. You end up with a regexp like\/\*([^*]|\*[^/])*\*\/, assuming you have to backslash your slashes. In a PEG, it looks like this:# in the C comment example PEG: comment <- '/*' (!'*/' char)* '*/'.That is, to parse a comment, first parse a/*, then as long as the next thing isn’t a*/, try to parse achar, and then parse a*/. (The*means “zero or more”, just as+means “one or more”.) You can write the same thing with Perl’s enhanced regexp features:qr|/\*(?:(?!\*/).)*\*/|, and it’s only slightly shorter, but I think it's not as clear.You might think that in!'*/' char, the negation of'*/'somehowmodifieschar. But it doesn’t, really; it just means that the parse fails at points in the input where'*/'can match, sochardoesn’t get a chance to match there. Instead, we backtrack from matching the'*/', break out of the loop, and get a chance to match the'*/'on the outside.You can use this magic PEG power for a variety of things that are traditionally painful. For example, most programming languages have keywords, which look like variables (or other identifiers) but are treated differently syntactically. In a PEG, you can write this:# in the keyword example PEG: keyword = ('if' / 'while' / 'for' / otherkeyword) !idchar. identifier = !keyword idstartchar idchar*.This first specifies that akeywordis one of the specified words as long as it's not followed by anidchar; then it specifies that when you’re trying to parse anidentifier, first try to parse akeyword, and if that succeeds, then parsing theidentifiershould fail; but if there's nokeyword, go ahead and try to parse anidstartcharfollowed by zero or moreidchars.Note that we throw away the results of trying to parse thekeyword— we were only trying it in order to see if we shouldn’t do something else.If You’ve Taken a Compilers Class LatelyI thought I’d stick this section in for the benefit of folks who are all up on the theory. The rest of the document doesn’t depend on it.PEGs specify how toparsea language, by contrast with context-free grammars, which primarily describe how togeneratesentences of a language. This difference makes it much easier to construct parsers for PEGs; they can be straightforwardly converted into simple recursive-descent parsers performing limited backtracking, with each nonterminal becoming a parsing function. It also probably makes it much more difficult to prove properties of the language recognized by a PEG.PEGs can parse some languages that context-free grammars can’t, such as the languageanbncn, that is, some number ofas, followed by the same number ofbs, followed by the same number ofcs. However, because PEGs can’t handle ambiguity, and because there’s a linear-time parsing algorithm for them, it is suspected that PEGs can’t parse all languages context-free grammars can.S → a S a | a S b | b S a | b S b | ais a simple context-free language which Ford conjectured cannot be parsed with a PEG; it describes strings of odd numbers ofas andbs in which the middle letter is ana. PEGs can parse all languages that can be parsed with LL(k) or LR(k) parsers.PEGs are more composable than LL(k) or LR(k) CFGs; because PEGs can’t handle ambiguity, it’s easy to predict the effect of adding new parsing rules to the grammar.You can parse general CFGs with a backtracking approach like the PEG approach; the difference is that each nonterminal must be able to succeed multiple times on the same input with different possible parses, in case something that follows it fails. Definite clause grammars in Prolog are one example of this strategy. In PEGs, once a nonterminal succeeds at some position, it throws away its backtracking state, so it can only produce at most one result at that position. As a consequence, even though there are PEGs that take exponential time to parse (if implemented the naïve way) CFGs with exponential-time parsing (again, if implemented the naïve way, as with DCGs) are much more common.(Allan Schiffman tells me that all you really need to do to make DCGs perform well is to put cuts in “the obvious places”, e.g. between statements. I haven’t tried it myself.)A Minimal PEG LanguageThe expressions in PEGs minimally contain (using the TDPL notation in the thesis) negation!, ordered choice or alternation/, concatenation or sequencing (denoted by juxtaposition), terminal strings (written in single quotes''), and nonterminals (written as bare wordsfoo). (We can leave out repetition*and+, because as shown below, we can synthesize them.)Here’s a relatively minimal grammar describing a notation for a grammar with these features, the same one I used in the “Gentle Introduction” section, written in terms of itself:# in a minimal parsing expression grammar: _ <- sp _ / . sp <- ' ' / '\n' / '\t'. sentence <- _ rule sentence / _ rule. rule <- name _ '<-'_ choice '.'_. choice <- sequence '/'_ choice / sequence. sequence <- term sequence / . term <- '!'_ term / '\'' stringcontents '\''_ / name _. stringcontents <- stringchar stringcontents / . stringchar <- !'\\' !'\'' char / '\\' char. name <- namechar name / namechar. namechar <- !'!' !'\'' !sp !'<-' !'/' !'.' char.This all depends on the primitive nonterminalchar, which I’m assuming matches any character, for some definition of character.The nonterminal_consumes any amount of whitespace. It’s used everywhere we want to consume whitespace, generally at the lowest possible level of the grammar, with the exception ofname(on the theory that the whitespace is not really part of the name.) (Even though it has a funny non-alphabetic name, the language doesn’t treat it specially. I used to call itsbut it was distracting.)There are three cases of the patterngroup <- item group / ., which meansgroupis zero or more things that matchitem. Because PEGs are greedy and don’t backtrack after returning,groupwill only ever parse the maximum possible number ofitemitems. It’s not possible for a parsing failure after thegroupto causegroupto backtrack and return a smaller number ofitemobjects, the way it could in a parser for a context-free grammar, although a parsing failure inside the lastitemwill indeed do so. This allows us to get by without a separate scanner for this grammar! One minor variation of this pattern is found insentenceandname, which matchoneor more of their elements, notzeroor more.Note that the above grammar tells us how to parse the language, but doesn’t tell us anything about its semantics. But it’s nice and short.Adding GroupingThe PEG language as written above is pretty weak. It doesn’t have grouping or repetition, although they can be emulated with the use of extra productions, as in thefoospattern explained above.We can add grouping by redefiningtermlike this:# in a slightly more powerful parsing expression grammar: term <- '!'_ term / '\'' stringcontents '\''_ / name _ / '('_ choice ')'_.This simplifies the grammar only slightly; we can rewritestringcontentsas follows:stringcontents <- (!'\\' !'\'' char / '\\' char) stringcontents / .A Diversion: Adding RepetitionAlthough it turns out not to be very useful for what I’ll do next, adding the capability for repetition to the language makes it shorter and clearer.# in a more powerful PEG: sp <- ' ' / '\n' / '\t'. _ <- sp*. sentence <- _ (name _ '<-'_ choice '.'_)+. choice <- term* ('/'_ term*)*. term <- ('!'_ term / string / name / '('_ choice ')')_ ('+' / '*' / )_. string <- '\'' (!'\\' !'\'' char / '\\' char)* '\''_. meta <- '!' / '\'' / '<-' / '/' / '.' / '+' / '*' / '(' / ')'. name <- (!meta !sp char)+.That shrinks the grammar considerably, while significantly expanding the expressiveness of the grammar language it describes.Adding Result ExpressionsIn theory, the grammar as written could be useful. It’s expressive enough to describe the tree structure of a language, such as the PEG language defined above. So you could use it to parse some string into a syntax tree.However, it would be even more useful to have a version of the grammar language that can include result expressions written in some programming language that compute useful things. For example, you could use such a system to write and maintain a working compiler from PEG grammars to some programming language, or from some other language.A straightforward and readable way to do this is to label some parts of a sequence with names, and then to use those names in a result specification at the end of the sequence.Here’s an extension of the above grammar that allows for such names and result specifications:# in a PEG describing results: sp <- ' ' / '\n' / '\t'. _ <- sp _ / . sentence <- _ rule sentence / _ rule. rule <- name _ '<-'_ choice '.'_. choice <- sequence '/'_ choice / sequence. sequence <- term sequence / '->'_ expr / . expr <- '('_ exprcontents ')'_. exprcontents <- (!'(' !')' char / expr) exprcontents / . term <- name _ ':'_ term / '!'_ term / string / name _ / '('_ choice ')'_. string <- '\'' stringcontents '\''_. stringcontents <- !'\\' !'\'' char stringcontents / '\\' char stringcontents / . meta <- '!' / '\'' / '<-' / '/' / '.' / '(' / ')' / ':' / '->'. name <- namechar name / namechar. namechar <- !meta !sp char.This adds the possibility that a term may be preceded by a colon and a name, and that a sequence may end with a->and a parenthesized expression.This lets you write things liken: exprandexpr _ -> (print("got expr")). It doesn’t place strong requirements on the embedded expression, so it can be in almost any language, but it does require that any parentheses inside of it be balanced. (If that's difficult in a certain case, due to embedded strings, maybe you can incorporate some commented-out parentheses to balance things.)A Metacircular Compiler-CompilerSo let’s suppose that we want to use this result-expression facility to write a compiler for these grammars, producing a parser for the specified grammar in, say, JavaScript. We want to translate each parsing expression in the grammar language into an expression in the target language that parses the sub-language defined by that parsing expression. For example, we want to translatechoice <- sequence '/'_ choice / sequence.into a recursive JavaScript function that parses expressions containing slash-separatedchoices. Since it doesn’t specify a result expression, it’s sort of indeterminate what it should actually do, other than consume characters from the input stream until it finds somethingchoicecan't parse.So now we have to figure out what the semantics are of each of the various actions.I’m going to factor out the code generation parts into separate named blocks so that it’s relatively easy to have the parser, say, generate code in some other language, or just an abstract syntax tree.WhitespaceWhitespace is fairly easy: it is a no-op.# in the metacircular compiler-compiler: sp <- ' ' / '\n' / '\t'. _ <- sp _ / .RulesLet’s compile each rule into a JavaScript function that parses the language described by that rule, and the grammar as a whole into the collection of these functions plus whatever support code is needed. (Here I’m going to use double angle-brackets<<>>to name chunks of code that aren’t given until later.)rule <- n: name _ '<-'_ body: choice '.'_ -> <> . sentence <- _ r: rule g: sentence -> (r + "\n" + g) / _ r: rule -> (r + "\n" <> ).The code to produce a function in JavaScript is quite straightforward:# in code to produce a function: (["function parse_", n, "(input, pos) {\n", <> body, <> "}\n"].join(''))So a grammar nonterminal namedtermwill be compiled into a function calledparse_term, whose body will be the value computed bychoice, bracketed by some startup and cleanup code, and thereforechoiceneeds to evaluate to a string of zero or more valid JavaScript statements.These functions will need to do several things to implement the semantics of a PEG parser:Advance the input position, starting from the input position the caller passed in, and in case of success, communicate the new input position to the caller.Save the input position (and any other state) in order to backtrack when a sequence inside a choice fails, or after testing a negation condition. They may have to save several input positions at once in cases where there is nested alternation.Compute the value given by the result expressions in the grammar and, in case of success, pass it back to the caller, along with the new input position.In order to avoid global variables, we’re passing in the input string (which doesn’t change during a parse) and the current position in it as arguments to each parsing function.To package the value computed along with the new input position, we’ll return a JavaScript object withvalandposproperties, like{val: "foo", pos: 37}. In case of failure, we’ll just returnnull.From here we’ll mostly work bottom-up.NamesNames are used in two contexts: at the top level of a rule, they define the name of the nonterminal, and in a term, they request a call to that nonterminal. In both cases, we basically just need the contents of the name.# in the metacircular compiler-compiler: meta <- '!' / '\'' / '<-' / '/' / '.' / '(' / ')' / ':' / '->'. name <- c: namechar n: name -> (c + n) / namechar. namechar <- !meta !sp char.In this case, we presume that the value produced bychar(and thus the value produced bynamechar) is the character it consumed, and that in the absence of an explicit result expression, the result of the whole rule is that same character. This can be implemented, for example, by having a sequence return by default the value of the last term in it. (I’m not sure that’s a good default, because it seems a little error-prone, but I’ll try it.)NonterminalsA reference to a nonterminal is compiled as a call to its parsing function, passing in the current position.# in the metacircular compiler-compiler: term <- labeled / nonterminal / string / negation / parenthesized. nonterminal <- n: name _ -> <> .Again, the JS implemntation of a subroutine call is quite simple:# in code to parse another nonterminal: ([' state = parse_', n, '(input, state.pos);\n'].join(''))This means we need a variablestateto store this returned value in, and it needs to be initialized with the position passed in by the caller.# in function prologue: ' var state = { pos: pos };\n',What do we do withstate.val? It depends on where the nonterminal is found. If it’s preceded by a label, we want to store it in a variable under that name for later use, unless it fails. Let’s haveterm, just likechoice, return a string of zero or more valid JavaScript statements.# in the metacircular compiler-compiler: labeled <- label: name _ ':'_ value: term -> <> .We protect this with a conditional onstatein case the parse has failed:# in code to save a value in a variable: ([value, ' if (state) var ', label, ' = state.val;\n'].join(''))(Ideally we would undo this saving if the nonterminal is in an alternative that fails and ends up being backtracked; but hopefully the result expressions of later alternatives will simply not use that variable.)Now, if the nonterminal was the last thing in a parsing function, then we want to return thestate.valit gave us as our ownstate.val, and additionally we want to return itsstate.posas ourstate.pos; or, if it failed, it returnednull, in which case we want to returnnull.So at the end of the function, we can just returnstate:# in function epilogue: ' return state;\n',Now we just need to ensure that all of the other expression types (sequence, terminal strings, ordered choice, negation, parenthesized) updatestatein a manner analogous to how calls to nonterminals updatestate.While we're on the topic of nonterminals, we should probably define the one predefined nonterminal,char:# in support code: + 'function parse_char(input, pos) {\n' + ' if (pos >= input.length) return null;\n' + ' return { pos: pos + 1, val: input.charAt(pos) };\n' + '}\n'SequenceSequences are relatively simple. Given a sequence of two expressionsfoo bar, we first parsefoofrom the current position, and if that succeeded, we parsebarfrom the new position. If it fails, the sequence as a whole fails, and there is no current position.This is one of the things that is easier to do if you don’t try to write your grammar with features like*, since it treats sequences of arbitrary numbers of things as nested sequences of two items, the innermost of which is empty.# in the bare grammar: sequence <- term sequence / '->'_ expr / .The case of an empty sequence doesn’t updatestateat all. In the case of a non-empty sequence, we executefoo, and iffoodoesn’t setstatetonull, we executebar.# in the metacircular compiler-compiler: sequence <- foo: term bar: sequence -> <> / result_expression / -> ('').Theresult_expressioncase is one of the last things explained, so ignore it for now.This will result in deeply nested if statements without proper indentation in the output when there is a long sequence, but that’s probably okay:# in code to handle a sequence: ([foo, ' if (state) {\n', bar, ' }\n'].join(''))Terminal StringsA “terminal” or literal string like'->'either matches some characters in the input or fails to do so. Rather than inserting code into every parsing function to compare parts of the input, making the parsing functions less readable, we’ll factor this out into a single “literal” function:# in support code: + 'function literal(input, pos, string) {\n' + ' if (input.substr(pos, string.length) === string) {\n' + ' return { pos: pos + string.length, val: string };\n' + ' } else return null;\n' + '}\n'So then we just need to emit code to call this function and updatestateappropriately when we encounter a terminal string. As it happens, the translation from string syntax in the PEG language to string syntax in JavaScript is the null transformation. If we were compiling to some other language, such as C, this might pose some difficulty.# in the metacircular compiler-compiler: string <- '\'' s: stringcontents '\''_ -> <> . stringcontents <- !'\\' !'\'' c: char s: stringcontents -> (c + s) / b: '\\' c: char s: stringcontents -> (b + c + s) / -> ('').So here’s the function call:# in code to match a literal string: ([" state = literal(input, state.pos, '", s, "');\n"].join(''))As we iterate through the characters or backslash-escapes inside the string, we convert them to strings — either by default, or explicitly by concatenating the backslash to the character that follows it. Then we callliteralwith the current position and it either returnsnullor gives us the new position and the value it matched as our newstate.Ordered ChoiceTwo of the remaining expression types (ordered choice, negation, but not terminal strings and parenthesized) can require backtracking. So we have to save a state and possibly restore that state.Here’s how ordered choice works; negation is fairly similar. In ordered choice, if the first alternative succeeds, we don’t try the others; but if it fails, we restore the previously saved state.This is complicated somewhat by the fact that we might be inside a parenthesized expression, so there may be a stack of previously saved states, even inside the same function.So on entry to the function, we create a stack:# in function prologue: ' var stack = [];\n',The grammar entry treats N-way choices likelabeled / negation / string / nonterminal / parenthesizedas nested 2-way choices likelabeled / (negation / (string / (nonterminal / parenthesized))). This is a little bit needlessly inefficient, since we’ll be using potentially four stack entries instead of one, but it will do for now.# in the metacircular compiler-compiler: choice <- a: sequence '/'_ b: choice -> <> / sequence.Execution ofbis conditional on failure ofa; ifasucceeds, we simply discard the state we saved before trying it.# in code to handle a choice: ([' stack.push(state);\n', a, ' if (!state) {\n', ' state = stack.pop();\n', b, ' } else stack.pop();\n'].join(''))It’s only safe to pushstaterather than a copy ofstatebecause we never mutate the existingstate; we only make newstateobjects.NegationNegation is!x:# in the metacircular compiler-compiler: negation <- '!'_ t: term -> <> .This is implemented by saving the parse state, trying to parsex, failing if parsingxsucceeded, and otherwise proceeding from the saved parse state.# in code to handle negation: ([' stack.push(state);\n', t, ' if (state) {\n', ' stack.pop();\n', ' state = null;\n', ' } else state = stack.pop();\n'].join(''))You can use a double negative like!!'->'to write a “zero-width positive lookahead assertion” in Perl lingo. That compiles into this:# in the output of the compiler-compiler: stack.push(state); stack.push(state); state = literal(input, state.pos, '->'); if (state) { stack.pop(); state = null; } else state = stack.pop(); if (state) { stack.pop(); state = null; } else state = stack.pop();The initialstateis assumed to be non-null. So after the call toliteral,stateis non-nulliff the next couple of characters were->. Then, after the firstif,stateis non-nulliff the next couple of charactersweren’t->. Then, after the secondif, it is again non-nulliff the next couple of characters were->. And if it’s non-null, it’s thestateyou started with.So that does the right thing, perhaps a bit verbosely.Result ExpressionsA result expression gives a JavaScript expression to evaluate to get the value that a sequence parses to. Normally, it uses variable bindings produced by labels. The value it returns may become the value of the term (if the sequence is inside parentheses) or the value returned by a whole parsing function.# in the metacircular compiler-compiler: result_expression <- '->'_ result: expr _ -> <> .Note the_to discard whitespace.Of course, this is conditional on the parser not being in a failed state:# in code to handle result expressions: ([' if (state) state.val = ', result, ';\n'].join(''))The expression is delimited by parentheses(). The outermost pair of parentheses are kept, which simplifies the grammar and avoids tricky problems of operator precedence when the result expression is copied into the output program in thestate.val =context above.# in the metacircular compiler-compiler: expr <- '('_ e: exprcontents ')' -> ('(' + e + ')'). exprcontents <- c: (!'(' !')' char / expr) e: exprcontents -> (c + e) / -> ('').result_expressiondiscards whitespace after the expression rather than having the expression production do it itself in order to preserve whitespace after right parens consumed by recursive calls to the expression production.Parenthesized ExpressionsParenthesized expressions don’t need any real special handling; or, rather, the special handling consists of thestackvariable everything uses to backtrack; the parentheses are only there to direct the parser how to parse/and!and so on.parenthesized <- '('_ body: choice ')'_ -> (body).ExportingWe need one more thing if our grammar is to be loadable as aCommonJS moduleby systems likenode.js:# in support code: + "if (typeof exports !== 'undefined')\n" + " exports.parse_sentence = parse_sentence;\n"This assumes that the grammar being processed has a production calledsentence, which is the only thing that will be exported.The Whole Metacircular Compiler-CompilerHere’s the whole thing, extracted from this document:# in the output metacircular compiler-compiler: sp <- ' ' / '\n' / '\t'. _ <- sp _ / . rule <- n: name _ '<-'_ body: choice '.'_ -> (["function parse_", n, "(input, pos) {\n", ' var state = { pos: pos };\n', ' var stack = [];\n', body, ' return state;\n', "}\n"].join('')) . sentence <- _ r: rule g: sentence -> (r + "\n" + g) / _ r: rule -> (r + "\n" + 'function parse_char(input, pos) {\n' + ' if (pos >= input.length) return null;\n' + ' return { pos: pos + 1, val: input.charAt(pos) };\n' + '}\n' + 'function literal(input, pos, string) {\n' + ' if (input.substr(pos, string.length) === string) {\n' + ' return { pos: pos + string.length, val: string };\n' + ' } else return null;\n' + '}\n' + "if (typeof exports !== 'undefined')\n" + " exports.parse_sentence = parse_sentence;\n" ). meta <- '!' / '\'' / '<-' / '/' / '.' / '(' / ')' / ':' / '->'. name <- c: namechar n: name -> (c + n) / namechar. namechar <- !meta !sp char. term <- labeled / nonterminal / string / negation / parenthesized. nonterminal <- n: name _ -> ([' state = parse_', n, '(input, state.pos);\n'].join('')) . labeled <- label: name _ ':'_ value: term -> ([value, ' if (state) var ', label, ' = state.val;\n'].join('')) . sequence <- foo: term bar: sequence -> ([foo, ' if (state) {\n', bar, ' }\n'].join('')) / result_expression / -> (''). string <- '\'' s: stringcontents '\''_ -> ([" state = literal(input, state.pos, '", s, "');\n"].join('')) . stringcontents <- !'\\' !'\'' c: char s: stringcontents -> (c + s) / b: '\\' c: char s: stringcontents -> (b + c + s) / -> (''). choice <- a: sequence '/'_ b: choice -> ([' stack.push(state);\n', a, ' if (!state) {\n', ' state = stack.pop();\n', b, ' } else stack.pop();\n'].join('')) / sequence. negation <- '!'_ t: term -> ([' stack.push(state);\n', t, ' if (state) {\n', ' stack.pop();\n', ' state = null;\n', ' } else state = stack.pop();\n'].join('')) . result_expression <- '->'_ result: expr _ -> ([' if (state) state.val = ', result, ';\n'].join('')) . expr <- '('_ e: exprcontents ')' -> ('(' + e + ')'). exprcontents <- c: (!'(' !')' char / expr) e: exprcontents -> (c + e) / -> (''). parenthesized <- '('_ body: choice ')'_ -> (body).That’s 66 lines of code, constituting a compiler that can compile itself into JavaScript, if you have a way to execute it.XXX: a couple of lines are over 80 chars; fix this!Bootstrapping to JavaScriptBut, to actually execute this compiler-compiler, you need a version already running, so you can compile the compiler-compiler to JavaScript.Hand-compiling: a blind alleyI started by trying to compile it by hand, using YASnippet, but after not very long, I gave up on that approach. Here are the hand-compiled versions ofsp <- ' ' / '\n' / '\t'.and_ <- sp _ / .# in the hand-compiled metacircular compiler-compiler: function parse_sp(input, pos) { var state = { pos: pos }; var stack = []; stack.push(state); state = literal(input, state.pos, ' '); if (!state) { state = stack.pop(); stack.push(state); state = literal(input, state.pos, '\n'); if (!state) { state = stack.pop(); state = literal(input, state.pos, '\t'); } else { stack.pop(); } } else { stack.pop(); } return state; } function parse__(input, pos) { var state = { pos: pos }; var stack = []; stack.push(state); state = parse_sp(input, state.pos); if (state) { state = parse__(input, state.pos); } if (!state) { state = stack.pop(); } else { stack.pop(); } return state; }After thus inflating two lines of grammar into 35 lines of JavaScript, I knew I needed a better way. At that rate, the whole thing would be about 1200 lines. That’s too much to debug, even if YASnippet makes it relatively easy to type, unless there's no easier way.But there is.A Bunch of FunctionsSo, instead, I'm writing one function for each interesting recognition rule from the grammar, returning the same result expressions that the parsing function will. Then I can construct a sort of abstract syntax tree of the grammar out of calls to these functions, and it will only be a little larger than the grammar itself.For example, the first rulesp <- ' ' / '\n' / '\t'.will become:# in the ASTs made of function calls: var sp_rule = rule('sp', choice(string(' '), choice(string('\\n'), string('\\t'))));This is a bit of a cheat; the innermost choice really parses aschoice(sequence(string('\\n'), ''), sequence(string('\\t'), ''))but I'm hoping that doesn’t matter for now.Then at the end I can combine all of the variables into a grammar.First I need the functions, though.I’m omittingsp(likewise_,meta) because they don’t produce interesting values.# in the bunch-of-functions version: function rule(n, body) { return (["function parse_", n, "(input, pos) {\n", ' var state = { pos: pos };\n', ' var stack = [];\n', body, ' return state;\n', "}\n"].join('')); } function sentence2(r, g) { return (r + "\n" + g); } function sentence1(r) { return (r + "\n" <> ); }I’m omittingname(likewiseexpr,inner,exprcontents,stringcontents) because it just copies a character string from the input into the output. I can do that myself. And I’m omittingtermbecause it just returns one of its children's values.function nonterminal(n) { return [' state = parse_', n, '(input, state.pos);\n'].join(''); } function labeled(label, value) { return [value, ' if (state) var ', label, ' = state.val;\n'].join(''); } function sequence(foo, bar) { return [foo, ' if (state) {\n', bar, ' }\n'].join(''); } function string(s) { return [" state = literal(input, state.pos, '", s, "');\n"].join(''); } function choice(a, b) { return [ ' stack.push(state);\n', a, ' if (!state) {\n', ' state = stack.pop();\n', b, ' } else {\n', ' stack.pop();\n', // discard unnecessary saved state ' }\n'].join(''); } function negation(t) { return [ ' stack.push(state);\n', t, ' if (state) {\n', ' stack.pop();\n', ' state = null;\n', ' } else {\n', ' state = stack.pop();\n', ' }\n'].join(''); } function result_expression(result) { return [' state.val = ', result, ';\n'].join(''); }We’ll also need the support code from thesentencerule, except for the exporting ofparse_sentence.function parse_char(input, pos) { if (pos >= input.length) return null; return { pos: pos + 1, val: input.charAt(pos) }; } function literal(input, pos, string) { if (input.substr(pos, string.length) === string) { return { pos: pos + string.length, val: string }; } else return null; }Then, after all those functions are defined, we can call them to build up the ASTs.<>The rule for_is quite straightforward:# in the ASTs made of function calls: var __rule = rule('_', choice(sequence(nonterminal('sp'), nonterminal('_')), ''));The rule forrulecontains a rather long sequence, which will be treated as a deeply nested bunch of two-element sequences. But it’s hard to read and write it that way, so I’m going to define a helper functionnseqto make a sequence of an arbitrary number of sequence elements.function nseq() { var rv = arguments[arguments.length-1]; for (var ii = arguments.length-2; ii >= 0; ii--) rv = sequence(arguments[ii], rv); return rv; }This will fail (returningnull) if we call it with no arguments, so let’s be sure not do that. Now we can define the rule forrule:var rule_rule = rule('rule', nseq(labeled('n', nonterminal('name')), nonterminal('_'), string('<-'), nonterminal('_'), labeled('body', nonterminal('choice')), string('.'), nonterminal('_'), result_expression( "[\"function parse_\", n, \"(input, pos) {\\n\",\n" + " ' var state = { pos: pos };\\n',\n" + " ' var stack = [];\\n',\n" + " body, \n" + " ' return state;\\n',\n" + " \"}\\n\"].join('')")));rule_ruleis clearly pretty verbose; it's 12 lines, and the correspondingrulefunction is 8 lines, for a total of 20 lines for the “hand-compiled” version of the original 7-linerulerule. That’s a manageable expansion factor of about 3×.So, on tosentence. I’ve played fast and loose with leading whitespace here, in order to retain some modicum of readability.var sentence_rule = rule('sentence', choice( nseq(nonterminal('_'), labeled('r', nonterminal('rule')), labeled('g', nonterminal('sentence')), result_expression('r + "\\n" + g')), nseq(nonterminal('_'), labeled('r', nonterminal('rule')), result_expression('r + "\\n"\n' + "+ 'function parse_char(input, pos) {\\n'\n" + "+ ' if (pos >= input.length) return null;\\n'\n" + "+ ' return { pos: pos + 1, val: input.charAt(pos) };\\n'\n" + "+ '}\\n'\n" + "+ 'function literal(input, pos, string) {\\n'\n" + "+ ' if (input.substr(pos, string.length) === string) {\\n'\n" + "+ ' return { pos: pos + string.length, val: string };\\n'\n" + "+ ' } else return null;\\n'\n" + "+ '}\\n'\n" + "+ 'if (typeof exports !== "+'"undefined"'+") {\\n'\n" + "+ ' exports.parse_sentence = parse_sentence;\\n'\n" + "+ '}\\n'\n"))));The quoting of the support code is kind of confusing; the original is one long string, containing a bunch of\nnewlines, broken up into lines for readability, joined by the+operator. This version is also one long string, containing the lines of the original long string, also broken up into lines for readability, joined by the+operator. So there are two levels of quoting. The inner level has the+on the left and uses single quotes'', and the outer level has the+on the right and uses double quotes"".The next rule ismeta, and it has a lot ofchoices. So we define something likenseq, but forchoices.function nchoice() { var rv = arguments[arguments.length-1]; for (var ii = arguments.length-2; ii >= 0; ii--) rv = choice(arguments[ii], rv); return rv; } var meta_rule = rule('meta', nchoice(string('!'), string('\\\''), string('<-'), string('/'), string('.'), string('('), string(')'), string(':'), string('->')));The next few rules are straightforward translations from the grammar.var name_rule = rule('name', choice(nseq(labeled('c', nonterminal('namechar')), labeled('n', nonterminal('name')), result_expression('c + n')), nonterminal('namechar'))); var namechar_rule = rule('namechar', nseq(negation(nonterminal('meta')), negation(nonterminal('sp')), nonterminal('char'))); var term_rule = rule('term', nchoice(nonterminal('labeled'), nonterminal('nonterminal'), nonterminal('string'), nonterminal('negation'), nonterminal('parenthesized'))); var nonterminal_rule = rule('nonterminal', nseq(labeled('n', nonterminal('name')), nonterminal('_'), result_expression("[' state = parse_', n, " + "'(input, state.pos);\\n'].join('')"))); var labeled_rule = rule('labeled', nseq(labeled('label', nonterminal('name')), nonterminal('_'), string(':'), nonterminal('_'), labeled('value', nonterminal('term')), result_expression("[value, ' if (state) var ', " + "label, ' = state.val;\\n'].join('')"))); var sequence_rule = rule('sequence', nchoice(nseq(labeled('foo', nonterminal('term')), labeled('bar', nonterminal('sequence')), result_expression("[foo, ' if (state) {\\n', " + "bar, ' }\\n'].join('')")), nonterminal('result_expression'), sequence(result_expression("''"))));That’s 29 lines, transliterating 12 lines from the grammar, and now the transliteration is halfway done.var string_rule = rule('string', nseq(string("\\'"), labeled('s', nonterminal('stringcontents')), string("\\'"), nonterminal('_'), result_expression('[" state = literal(input, state.pos, ' + '\'", s, "\');\\n"].join(\'\')'))); var stringcontents_rule = rule('stringcontents', nchoice(nseq(negation(string("\\\\")), negation(string("\\'")), labeled('c', nonterminal('char')), labeled('s', nonterminal('stringcontents')), result_expression('c + s')), nseq(labeled('b', string("\\\\")), labeled('c', nonterminal('char')), labeled('s', nonterminal('stringcontents')), result_expression('b + c + s')), result_expression("''")));ForchoiceI’m omitting not only whitespace but also a comment.var choice_rule = rule('choice', choice(nseq(labeled('a', nonterminal('sequence')), string('/'), nonterminal('_'), labeled('b', nonterminal('choice')), result_expression( "[' stack.push(state);\\n',\n" + " a,\n" + " ' if (!state) {\\n',\n" + " ' state = stack.pop();\\n',\n" + " b,\n" + " ' } else {\\n',\n" + " ' stack.pop();\\n',\n" + " ' }\\n'].join('')")), nonterminal('sequence'))); var negation_rule = rule('negation', nseq(string('!'), nonterminal('_'), labeled('t', nonterminal('term')), result_expression( "[' stack.push(state);\\n',\n" + " t,\n" + " ' if (state) {\\n',\n" + " ' stack.pop();\\n',\n" + " ' state = null;\\n',\n" + " ' } else {\\n',\n" + " ' state = stack.pop();\\n',\n" + " ' }\\n'].join('')"))); var result_expression_rule = rule('result_expression', nseq(string('->'), nonterminal('_'), labeled('result', nonterminal('expr')), result_expression("[' if (state) state.val = ', " + "result, ';\\n'].join('')"))); var expr_rule = rule('expr', nseq(string('('), nonterminal('_'), labeled('e', nonterminal('exprcontents')), string(')'), nonterminal('_'), result_expression('e'))); var inner_rule = rule('inner', nseq(string('('), nonterminal('_'), labeled('e', nonterminal('exprcontents')), string(')'), result_expression("'(' + e + ')'"))); var exprcontents_rule = rule('exprcontents', choice( nseq(labeled('c', choice(nseq(negation(string('(')), negation(string(')')), nonterminal('char')), nonterminal('inner'))), labeled('e', nonterminal('exprcontents')), result_expression('c + e')), result_expression("''"))); var parenthesized_rule = rule('parenthesized', nseq(string('('), nonterminal('_'), labeled('body', nonterminal('choice')), string(')'), nonterminal('_'), result_expression('body')));So that’s all the rules. Now we just need to assemble them into a sentence, using a technique similar tonseqandnchoice.function nsentence() { var rv = sentence1(arguments[arguments.length-1]); for (var ii = arguments.length-2; ii >= 0; ii--) rv = sentence2(arguments[ii], rv); return rv; } var all_rules = nsentence(sp_rule, __rule, rule_rule, sentence_rule, meta_rule, name_rule, namechar_rule, term_rule, nonterminal_rule, labeled_rule, sequence_rule, string_rule, stringcontents_rule, choice_rule, negation_rule, result_expression_rule, expr_rule, inner_rule, exprcontents_rule, parenthesized_rule);Now the variableall_ruleshas a working parser in it in JavaScript.To get a usableparse_sentencefunction, we need toevalthat script:And then we can export the function:if (typeof exports !== 'undefined') exports.parse_sentence = parse_sentence;The Output Parser in JavaScriptI used to include here the contents of inall_rulesafter a couple of iterations. It’s ten pages long (660 lines), and the compile takes about 3–5 seconds on my machine, although it’s under 100ms on modern computers. However, I decided that it was too much to want to include it here; this document is for reading. If yougit cloneit, it’s inoutput.js.Cross-Compiling to LuaIt was a lot of trouble getting the short compiler-compiler above to an actually runnable state; I had to write and debug, basically, two copies of the same code. It would have been much easier if I’d already happened to have such a compiler-compiler around that I could use to compile my grammar with.Well, for the program I’m using to extract the code from this document, which I call HandAxeWeb, I would like to have such a compiler-compiler to generate code in Lua.So I’m going to define a “version 2” of the compiler-compiler which, instead of generating JS code, generates Lua code. (It is still written in JS, though.)First, instead of producing JS functions for rules, we produce Lua functions for rules:# in code to produce a function v2: (['function parse_',n,'(input, pos)\n', <> body, <> 'end\n'].join(''))Invoking nonterminals needs no change; JS and Lua syntax overlap here. But local variable declaration and finite maps look different:# in function prologue v2: ' local state = { pos = pos }\n',We have to declare variables outside their conditional; Lua’s scoping rules here change the semantics somewhat because unless you declare the variables at the top of the function you can’t write a rule likex <- (bar y: foo / baz y: quux) -> (y)and have it work because the inneryvariables are declared in an inner block in Lua, while in JS they automatically belong to the whole function.# in code to save a value in a variable v2: ([value, ' local ',label,'\n', ' if state then ',label,' = state.val end\n'].join(''))Theparse_charandliteralfunctions are a bit different; remember, Lua numbers character positions in strings from 1, and the second argument to itsstring.subis not a length but an ending index:# in support code v2: + 'function parse_char(input, pos)\n' + ' if pos > #input then return nil end\n' + ' return { pos = pos + 1, \n' + ' val = string.sub(input, pos, pos) }\n' + 'end\n' + 'function literal(input, pos, needle)\n' + ' if string.sub(input, pos, pos + #needle - 1)\n' + ' === needle then\n' + ' return { pos = pos + #needle, val = needle }\n' + ' else return nil end\n' + 'end\n'The code to invokeliteraldoesn’t actually need to change.Sequence-handling differs only in minor bits of syntax:# in code to handle a sequence v2: ([foo, ' if state then\n', bar, ' end\n'].join(''))Initializing the stack is a little different:# in function prologue v2: ' local stack = {}\n',Ordered choice looks quite similar to JS:# in code to handle a choice v2: ([' table.insert(stack, state)\n', a, ' if not state then\n', ' state = table.remove(stack)\n', b, ' else\n', ' table.remove(stack)\n', ' end\n'].join(''))Negation too:# in code to handle negation v2: ([' table.insert(stack, state)\n', t, ' if state then\n', ' table.remove(stack)\n', ' state = nil\n', ' else\n', ' state = table.remove(stack)\n', ' end\n'].join(''))Result expressions too:# in code to handle result expressions v2: ([' if state then state.val = ',result,' end\n'].join(''))And that is sufficient to be able to generate compilers in Lua from grammars whose result expressions are in Lua. Unfortunately, it’s still not good enough to generate a metacircular compiler-compiler in Lua from the grammar given here, because that grammar is written in JS, even though it generates Lua code.It would be relatively straightforward to make the modification needed to the grammar quite minor: all the result expressions merely concatenate a bunch of strings, and if they did so by calling a function, you’d only need to redefine that function in the two target languages; in JS, something likeArray.prototype.slice.apply(arguments).join('')and in Lua, something liketable.concat({...}).But this is sort of unnecessary. Really, we just need to be able to compile our parsers using node.js.TODOmemoizationperformance measurement: it takes minimally 252ms to compile itself on my netbook, wallclock, under whatever version of Node I’m using. That's pretty pessimal; it's about 11 or 12 kilobytes per second, close to a hundred thousand clock cycles per byte. Follow sets may offer a way to improve that by probably an order of magnitude.re-add repetition+and*(in a later version)factor out loopbody? like,loopbody <- term: body -> (loop body code).zero_or_more <- loopbody: body -> (body). one_or_more <- loopbody: body -> (body + 'if ...').how about removing()grouping? It leaves “a PEG describing results” (and “the metacircular compiler-compiler”) one line shorter and one line longer, but perhaps it could simplify backtracking by eliminating the explicit stack? Because then each parsing function would only need to contain one level of backtracking for/and one for!— oh, well, hmm,!might be tricky if we want to support positive lookahead too. Probably better to leave the stack in.Rewrite the Lua handaxeweb to use a PEG parser.maybe: rewrite the Lua handaxeweb to be written in JS with Node? The whole Lua story (“in a later version, this program switched to generating Lua grammars and lost the ability to compile itself”) kind of stinks. And writing 39 to 48 lines of code to "port" a 66-line program also seems kind of silly, like it may not justify the abstraction overhead that permits it.maybe: reorganize this document, putting bootstrap.js first? Not sure.maybe: write a Markdown parser?move Makefile and pegcompile.js into this document?Profiling resultsXXX these are rough notes that should be cleaned upI profiled this thing compiling itself in Arora.It contains 2939 characters, but makes 32370 calls toliteral, which is about 25% of its CPU time, I think (the profile output is a little hard to interpret; some of the numbers are over 100%, probably due to recursion, and 85% is attributed merely to “program”).parse_metatakes more than a third of the CPU time, largely by virtue of callingliteralseveral times. It also makes 41903 calls each topushandpop.That means it's testing about 11 literals per character, and backtracking 14 times. I could be wrong but I don’t think much of this would be improved by memoizing; computing follow sets is likely to make a bigger difference by avoiding the majority of that backtracking.parse_charis called 4219 times, mostly fromparse_exprcontents.Building up the output tree withstring.jointakes only about 0.6% of its time.I suspect that current WebKit has a much better profiler.Firebug agrees on most things (23.87% inliteral), but it has the interesting result that actually 17% of the time is incompile, which was called only once and does little more than calleval. So apparently the time to generate the output JS was only about 4x the time needed for SpiderMonkey to compile it!Other Interesting PEGsHere’s some nifty stuff you can do with the one-page parser generator described above.CSV filesIerusalemschygives this grammar for parsing Excel-style CSV files:# in the LPEG notation with captures: record <- ( (',' )*)->{} (%nl / !.) field <- / nonescaped <- { [^,"%nl]* } escaped <- '"' {~ ([^"] / '""'->'"')* ~} '"'The{}capture pieces of text and{~ ~}capture and replace them.*is for repetition,%nlis'\n',""are equivalent to'',.is ourchar,[abc]is a character class equivalent to( 'a' / 'b' / 'c' ), and->{}means “make a list of the results”. In the notation I’ve used for PEGs here, without repetition features, this looks like this:# in csv.peg: sentence <- d: (f: field ',' r: sentence -> ([f].concat(r)) / f: field -> ([f])) ('\n' / !char) -> (d). field <- escaped / nonescaped. normal_char <- !',' !'"' !'\n' char. nonescaped <- c: normal_char s: nonescaped -> (c + s) / normal_char. escaped_inner_char <- !'"' char / '""' -> ('"'). escaped_inner <- c: escaped_inner_char s: escaped_inner -> (c + s) / escaped_inner_char. escaped <- '"' s: escaped_inner '"' -> (s).That’s 2½ times as big, which is unreasonable. If we have*repetition that makes JavaScript Arrays, we can write it with only a bit more ugliness than in LPEG:# in csvstar.peg: sentence <- h: field t: (',' field)* ('\n' / !char) -> ([h].concat(t)). field <- escaped / nonescaped. nonescaped <- s: (!',' !'"' !'\n' char)* -> (s.join('')). escaped <- '"' s: (!'"' char / '""' -> ('"'))* '"' -> (s.join('')).ichbins[Darius Bacon’s ichbins]ichbinsis an inspiring small Lisp compiler; it can compile itself to C with full run-time type-checking, even though it’s only a bit over six pages of code. Its recursive-descent parser is a model of clarity, as recursive-descent parsers go:# in the parser in ichbins.scm: (define (read) (read-dispatch (skip-blanks (read-char)))) (define (skip-blanks c) (cond ((memq? c whitespace-chars) (skip-blanks (read-char))) ('t c))) (define whitespace-chars (cons linefeed " ")) (define non-symbol-chars "\"\\(')") (define eof-object '("eof")) (define (read-dispatch c) (cond ((eq? c 'f) eof-object) ((eq? c \\) (read-char-literal (read-char))) ((eq? c \") (read-string (read-char))) ((eq? c \() (read-list)) ((eq? c \') (cons 'quote (cons (read) '()))) ((eq? c \)) (error "Unbalanced parentheses")) ('t (intern (cons c (read-symbol (peek-char))))))) (define (read-char-literal c) (cond ((eq? c 'f) (error "EOF in character literal")) ('t c))) (define (read-string c) (cond ((eq? c 'f) (error "Unterminated string literal")) ((eq? c \") '()) ((eq? c \\) (cons (read-char) (read-string (read-char)))) ('t (cons c (read-string (read-char)))))) (define (read-symbol c) (cond ((memq? c whitespace-chars) '()) ((memq? c non-symbol-chars) '()) ('t (read-char) (cons c (read-symbol (peek-char)))))) (define (read-list) (read-list-dispatch (skip-blanks (read-char)))) (define (read-list-dispatch c) (cond ((eq? c 'f) (error "Unterminated list")) ((eq? c \)) '()) ('t (cons (read-dispatch c) (read-list)))))But with a language suited for parsing, we can do better. Here’s a PEG simply describing the same grammar as the above:# in ichbins.peg: whitespace <- '\n' / ' ' / '\t'. _ <- whitespace _ / . non-symbol <- '"' / '\\' / '(' / '\'' / ')'. sentence <- _ sexp. sexp <- '\\' char / '"' string / '(' list / '\'' read / symbol. string <- '"' / (!'\\' char / '\\' char) string. symbol <- !whitespace !non-symbol char / . list <- ')' / read list.Instead of 33 lines of code, we have 8. Note that I’ve followed the kind of weird structure of the original parser: the closing parenthesis is considered part of the list contents, and the closing quote is considered part of the string contents. This simplifies the grammar slightly, and eliminates nearly all non-tail calls (except inside oflist, and to_, and in distinguishing character categories) but I think it makes it a little less clear.In 16 lines, we can get a real parser that returns a parse of the code, in this case as a JSON string:# in ichbins-parser.peg: sentence <- _ s: sexp -> (JSON.stringify(s, null, 4)). sexp <- '('_ list / '"' string / s: symbol -> ({symbol: s}) / '\''_ s: sexp -> ([{symbol: 'quote'}, s]) / '\\' c: char _ -> ({char: c}). list <- ')'_ -> ([]) / a: sexp b: list -> ([a].concat(b)). string <- '"'_ -> ('') / a: (!'\\' char / '\\' b: char -> ('\\' + b)) t: string -> (a + t). symbol <- a: symchar b: symtail -> (a + b). symtail <- symbol / _ -> (''). _ <- whitespace _ / . whitespace <- '\n' / ' ' / '\t'. symchar <- !( whitespace /'"' / '\\' / '(' / '\'' / ')' ) char.ThanksThanks to D. Val Schorre for inventing META-II, of which this is a refinement, in 1964 or a bit before; to Bob M. McClure for inventingTMG, the TransMoGrifier, also in 1964, and to Doug McIlroy for maintaining it afterwards, which not only carried META-II forward, but alsohelped Thompson write Bwhich became C; to Romuald Ireneus 'Scibor-Marchocki, whoapparently ported TMG to TMGL; to Bryan Ford for resurrecting TMG’s parsing schema and enhancing it into the form of parsing expression grammars, in 2002; to Alan Kay for bringing META-II back to public attention; to Alessandro Warth and Yoshiki Ohshima for developing OMeta and showing that PEGs can be extended to a wide variety of non-parsing tasks.To [Aristotle Pagaltzis] (http://plasmasturm.org/) for innumerable improvements to the readability and correctness of this document.To Andy Isaacson, Allan Schiffman, [Chris Hibbert] (http://pancrit.org/) for further suggestions for the readability and content of this document.