Remote-Url: https://bakpakin.com/writing/how-janets-peg-works.html Retrieved-at: 2022-01-29 19:28:36.168030+00:00 In making Janet, I was inspired by theREBOLParse module andLPegto include a versatile library for parsing text and any sequence of bytes based on Parsing Expression Grammars, or PEGs. I was also willing to exclude other forms of parsing tools such as regular expressions from the core library, as PEGs are more general than regular expression, or easier to read, and can capture error locations when matching text. The development of PEGs in Janet went particularly smoothly, made possible by the simple PEG formalism, Janet's basic data structures, and a very straight-forward subroutine threaded interpreter. While much of Janet's PEG module could be implemented in any language, Janet's C API and rich set of core data structures made it easy to implement PEGs in Janet.This post is about how the Janet PEG engine is implemented; it is not a guide on using pegs. Although the code is written inJanet, It can be translated directly to most programming languages as we make no special use of any data structures besides arrays and hashtables. For information about using Janet's PEG module, seethe Janet PEG documentation.A Quick PEG OverviewThe Parsing Expression Grammar formalism has it's roots in a2004 Paper by Bryan Ford, who presents both the model and an algorithm for recognizing any PEG grammar in linear time. Our PEG engine will not parser all patterns in linear time, but will use similar semantics for defining our grammar. One difference between PEGs and more traditional parser generators, such as those used by YACC and Bison, is the ordered choice operator. For the purposes of this post,+is the ordered choice operator. For example, using a pseudo-syntax for PEG grammars where theMAINrule is the entry point of the grammar, we can write a grammar that matches any text that starts with a, b, or c.MAIN: "a" + "b" + "c"This grammar will match "apple pie", "banana split", and "coconut cluster", but not "orange soda", "watermelon taffy", or "lemon snap".It's also easy to write redundant grammars with the ordered choice operator. The following grammar will match on the first "a" only, and never "ab" or "abc". Anything that starts with "ab" or "abc" also starts with "a", so they could be removed from the grammar without changing its behavior.MAIN: "a" + "ab" + "abc"The ordered property of the+operator means that all PEG grammars are deterministic - if they match a string, they will only match it one way. This also means there is a direct correspondence between a PEG and a parser. This is very convenient when writing PEGs, as your specification and parser can often be one and the same. Traditional parser generators are non-deterministic, and additional rules are needed to resolve ambiguity.To complete our PEG model, we only need a few more operators. Below are all operators and features needed for the PEG formalism to be as general as possible."..." - string literal + - choice operator * - sequence operator (implicit in our pseudo-syntax) ! - Boolean inversionThe above features, along with recursion introduced by a top level grammar, can express all PEGs. This is a trimmed down version of the what's presented in the original paper, but other operators can be substituted for combinations of the above operators. The other very important feature needed for matching context free grammars is recursion, or otherwise our model would be limited to matching regular expressions.Using this formalism, we can write a simple grammar that matches dates in ISO 8601 format. Such a date looks like2019-06-10A naive grammar that matches dates (but does not validate dates - 9999-99-99 would match) is presented below. This also requires years to be specified with 4 digits.DIGIT: "0" + "1" + "2" + "3" + "4" + "5" + "6" + "7" + "8" + "9" YEAR: DIGIT DIGIT DIGIT DIGIT MONTH: DIGIT DIGIT DAY: DIGIT DIGIT MAIN: YEAR "-" MONTH "-" DAYThis looks very similar to a context free grammar, and in simple cases, the grammar is identical.A simple PEG engine in codeUsing our PEG definition above, we can write a simplematch-pegfunction that takes a rule and a string to match against and tests if string matches the rule. If there is a match, the number of bytes matched will be returned to help us find where to begin the next match. Otherwise, we will return nil, or some token to indicate that there is no match. This code will be written in Janet, but could be very easily written in any language. We are using Janet's match macro here, which does pattern matching on arguments.(defnmatch-peg[pegtext] (matchpeg@[:!x] (unless(match-pegxtext)0) @[:+xy] (or(match-pegxtext) (match-pegytext)) @[:*xy] (if-let[lenx(match-pegxtext)] (if-let[leny(match-pegy(string/slicetextlenx))] (+lenxleny)))(and(string/has-prefix?pegtext) (lengthpeg))))This is a very simple but effective peg implementation. It does not support captures, various other operators, but captures the basics of PEGs in about 10 lines of code, and can even be used for recursive grammars.(defdigits[:+"0"[:+"1"[:+"2"[:+"3"[:+"4"[:+"5"[:+"6"[:+"7"[:+"8""9"]]]]]]]]]) (defyear[:*digits[:*digits[:*digitsdigits]]]) (defmonth[:*digitsdigits]) (defdaymonth) (defiso-date[:*year[:*"-":*month[:*"-"day]]]) (match-pegiso-date"2019-06-10")(match-pegiso-date"201-06-10")This implementation is very similar to atree walk interpreter, and is the simplest way to implement a PEG. Janet's peg implementation does something very similar to this, although the syntax tree is first compiled to a bytecode form and validate for better speed.Making Operators VariadicAlthough not strictly necessary, we would like operators to look and behave more like lisp operators that are variadic. This will make long sequences or chains of choices easier to write.(defnmatch-peg[pegtext] (matchpeg@[:!x] (unless(match-pegxtext)0) @[:+] (some(fn[x] (match-pegxtext)) (tuple/slicepeg1)) @[:*] (do(varlen0) (varsubtexttext) (varoktrue) (loop[x:in(tuple/slicepeg1):let[lenx(match-pegxsubtext)_(setoklenx)]:whileok] (setsubtext(string/slicesubtextlenx)) (+=lenlenx)) (ifoklen))(and(string/has-prefix?pegtext) (lengthpeg))))Our ISO-date grammar from can be modified to look a bit more natural.(defdigits[:+"0""1""2""3""4""5""6""7""8""9"]) (defyear[:*digitsdigitsdigitsdigits]) (defmonth[:*digitsdigits]) (defdaymonth) (defiso-date[:*year"-"month"-"day]) (match-pegiso-date"2019-06-10")(match-pegiso-date"201-06-10")Adding RecursionEarlier I stated that recursion is important to make PEGs useful, and although we can do recursion in ad hoc manner, we can make our recursive grammars look more like our original grammar using Janet tables. We can create a table mapping keywords to PEG expressions, and ifpegis a keyword, we lookup its definition in the grammar table. We can add an extra argument tomatch-pegthat will be our grammar table.(defnmatch-peg[pegtextgrammar] (matchpeg@[:!x] (unless(match-pegxtextgrammar)0) @[:+] (some(fn[x] (match-pegxtextgrammar)) (tuple/slicepeg1)) @[:*] (do(varlen0) (varsubtexttext) (varoktrue) (loop[x:in(tuple/slicepeg1):let[lenx(match-pegxsubtextgrammar)_(setoklenx)]:whileok] (setsubtext(string/slicesubtextlenx)) (+=lenlenx)) (ifoklen)) (x(keyword?x)) (match-peg(grammarx)textgrammar)(and(string/has-prefix?pegtext) (lengthpeg))))With our grammar table, we can define our grammar in a single structure.(defgrammar{:digit[:+"0""1""2""3""4""5""6""7""8""9"]:year[:*:digit:digit:digit:digit]:month[:*:digit:digit]:day[:*:digit:digit]:main[:*:year"-":month"-":day]}) (match-peg(grammar:main)"2019-06-10"grammar)Adding RulesSo far, ourmatch-pegfunction has not been very expressive. Where are our character classes, optional matches, and other fancy features from regular expressions? While we could simply add terms to our grammar for each character class (:d for digits, :w for alphanumerics), there are other features we would like as as well, such as the Kleene star, character sets, etc. Fortunately, it is very easy to add functionality to tree-walk interpreters. We can add as many operators as we like. For example, we can add the:setoperator that will match 1 character if it occurs in a set.(defnmatch-peg[pegtextgrammar] (matchpeg@[:setchars] (if(string/check-setcharstext)1) @[:!x] (unless(match-pegxtextgrammar)0) @[:+] (some(fn[x] (match-pegxtextgrammar)) (tuple/slicepeg1)) @[:*] (do(varlen0) (varsubtexttext) (varoktrue) (loop[x:in(tuple/slicepeg1):let[lenx(match-pegxsubtextgrammar)_(setoklenx)]:whileok] (setsubtext(string/slicesubtextlenx)) (+=lenlenx)) (ifoklen)) (x(keyword?x)) (match-peg(grammarx)textgrammar)(and(string/has-prefix?pegtext) (lengthpeg))))Our date grammar will then become clearer as well.(defgrammar{:digit[:set"0123456789"]:year[:*:digit:digit:digit:digit]:month[:*:digit:digit]:day[:*:digit:digit]:main[:*:year"-":month"-":day]}) (match-peg(grammar:main)"2019-06-10"grammar)Most PEG libraries have several more rules to make it easier to write useful grammars, but all rules can be written pretty much the same way, as cases in our top level match expression. Our PEG engine is also missing captures, is rather slow, does not do tail-call optimization, and lacks many features that are present in other PEG engines such as those in Janet or REBOL. But for all its issues,match-pegis a good start to a general purpose pattern matching function.