Remote-Url: https://bakpakin.com/writing/how-janets-peg-works.html
Retrieved-at: 2022-01-29 19:28:36.168030+00:00

In making Janet, I was inspired by theREBOLParse module
andLPegto include a
versatile library for parsing text and any sequence of bytes based on Parsing
Expression Grammars, or PEGs. I was also willing to exclude other forms of
parsing tools such as regular expressions from the core library, as PEGs are
more general than regular expression, or easier to read, and can capture error
locations when matching text. The development of PEGs in Janet went
particularly smoothly, made possible by the simple PEG formalism, Janet's basic
data structures, and a very straight-forward subroutine threaded interpreter.
While much of Janet's PEG module could be implemented in any language, Janet's
C API and rich set of core data structures made it easy to implement PEGs in
Janet.This post is about how the Janet PEG engine is implemented; it is not a guide on using pegs.
Although the code is written inJanet, It can
be translated directly to most programming languages as we make no special use of any
data structures besides arrays and hashtables. For information about using Janet's PEG module, seethe Janet PEG documentation.A Quick PEG OverviewThe Parsing Expression Grammar formalism has it's roots in a2004 Paper by Bryan Ford, who
presents both the model and an algorithm for recognizing any PEG grammar in
linear time. Our PEG engine will not parser all patterns in linear time, but
will use similar semantics for defining our grammar. One difference between
PEGs and more traditional parser generators, such as those used by YACC and
Bison, is the ordered choice operator. For the purposes of this post,+is the ordered choice operator. For example, using a pseudo-syntax for PEG
grammars where theMAINrule is the entry point of the grammar, we can
write a grammar that matches any text that starts with a, b, or c.MAIN: "a" + "b" + "c"This grammar will match "apple pie", "banana split", and "coconut cluster", but
not "orange soda", "watermelon taffy", or "lemon snap".It's also easy to write redundant grammars with the ordered choice operator.
The following grammar will match on the first "a" only, and never "ab" or "abc".
Anything that starts with "ab" or "abc" also starts with "a", so they could be removed
from the grammar without changing its behavior.MAIN: "a" + "ab" + "abc"The ordered property of the+operator means that all PEG grammars are
deterministic - if they match a string, they will only match it one way. This
also means there is a direct correspondence between a PEG and a 
parser. This is very convenient when writing PEGs, as your specification
and parser can often be one and the same.
Traditional parser generators are non-deterministic, and additional
rules are needed to resolve ambiguity.To complete our PEG model, we only need a few more operators. Below are all operators and
features needed for the PEG formalism to be as general as possible."..." - string literal
+ - choice operator
* - sequence operator (implicit in our pseudo-syntax)
! - Boolean inversionThe above features, along with recursion introduced by a top level grammar, can express all PEGs. This
is a trimmed down version of the what's presented in the original paper, but other operators can be substituted
for combinations of the above operators. The other very important feature needed for matching context free grammars is
recursion, or otherwise our model would be limited to matching regular expressions.Using this formalism, we can write a simple grammar that matches dates in ISO 8601 format. Such a date looks like2019-06-10A naive grammar that matches dates (but does not validate dates - 9999-99-99 would match) is presented below. This also requires years to be specified with 4 digits.DIGIT: "0" + "1" + "2" + "3" + "4" + "5" + "6" + "7" + "8" + "9"
YEAR: DIGIT DIGIT DIGIT DIGIT
MONTH: DIGIT DIGIT
DAY: DIGIT DIGIT
MAIN: YEAR "-" MONTH "-" DAYThis looks very similar to a context free grammar, and in simple cases, the grammar is identical.A simple PEG engine in codeUsing our PEG definition above, we can write a simplematch-pegfunction that takes a rule and a string to match against
and tests if string matches the rule. If there is a match, the number of bytes matched will
be returned to help us find where to begin the next match. Otherwise, we will
return nil, or some token to indicate that there is no match. This code will be
written in Janet, but could be very easily written in any language. We are using
Janet's match macro here, which does pattern matching on arguments.(defnmatch-peg[pegtext]
 (matchpeg@[:!x] (unless(match-pegxtext)0)
   @[:+xy] (or(match-pegxtext) (match-pegytext))
   @[:*xy] (if-let[lenx(match-pegxtext)]
                (if-let[leny(match-pegy(string/slicetextlenx))]
                 (+lenxleny)))(and(string/has-prefix?pegtext) (lengthpeg))))This is a very simple but effective peg implementation. It does not support
captures, various other operators, but captures the basics of
PEGs in about 10 lines of code, and can even be used for recursive grammars.(defdigits[:+"0"[:+"1"[:+"2"[:+"3"[:+"4"[:+"5"[:+"6"[:+"7"[:+"8""9"]]]]]]]]])
(defyear[:*digits[:*digits[:*digitsdigits]]])
(defmonth[:*digitsdigits])
(defdaymonth)
(defiso-date[:*year[:*"-":*month[:*"-"day]]])

(match-pegiso-date"2019-06-10")(match-pegiso-date"201-06-10")This implementation is very similar to atree
walk interpreter, and is the simplest way to implement a PEG. Janet's peg
implementation does something very similar to this, although the syntax tree is
first compiled to a bytecode form and validate for better speed.Making Operators VariadicAlthough not strictly necessary, we would like operators to look and behave more
like lisp operators that are variadic. This will make long sequences or chains
of choices easier to write.(defnmatch-peg[pegtext]
 (matchpeg@[:!x] (unless(match-pegxtext)0)
   @[:+] (some(fn[x] (match-pegxtext)) (tuple/slicepeg1))
   @[:*] (do(varlen0)
       (varsubtexttext)
       (varoktrue)
       (loop[x:in(tuple/slicepeg1):let[lenx(match-pegxsubtext)_(setoklenx)]:whileok]
          (setsubtext(string/slicesubtextlenx))
          (+=lenlenx))
       (ifoklen))(and(string/has-prefix?pegtext) (lengthpeg))))Our ISO-date grammar from can be modified to look a bit more natural.(defdigits[:+"0""1""2""3""4""5""6""7""8""9"])
(defyear[:*digitsdigitsdigitsdigits])
(defmonth[:*digitsdigits])
(defdaymonth)
(defiso-date[:*year"-"month"-"day])

(match-pegiso-date"2019-06-10")(match-pegiso-date"201-06-10")Adding RecursionEarlier I stated that recursion is important to make PEGs useful, and although
we can do recursion in ad hoc manner, we can make our recursive grammars look
more like our original grammar using Janet tables. We can create a table
mapping keywords to PEG expressions, and ifpegis a keyword, we lookup
its definition in the grammar table. We can add an extra argument tomatch-pegthat will be our grammar table.(defnmatch-peg[pegtextgrammar]
 (matchpeg@[:!x] (unless(match-pegxtextgrammar)0)
   @[:+] (some(fn[x] (match-pegxtextgrammar)) (tuple/slicepeg1))
   @[:*] (do(varlen0)
       (varsubtexttext)
       (varoktrue)
       (loop[x:in(tuple/slicepeg1):let[lenx(match-pegxsubtextgrammar)_(setoklenx)]:whileok]
          (setsubtext(string/slicesubtextlenx))
          (+=lenlenx))
       (ifoklen))
   (x(keyword?x)) (match-peg(grammarx)textgrammar)(and(string/has-prefix?pegtext) (lengthpeg))))With our grammar table, we can define our grammar in a single structure.(defgrammar{:digit[:+"0""1""2""3""4""5""6""7""8""9"]:year[:*:digit:digit:digit:digit]:month[:*:digit:digit]:day[:*:digit:digit]:main[:*:year"-":month"-":day]})

(match-peg(grammar:main)"2019-06-10"grammar)Adding RulesSo far, ourmatch-pegfunction has not been very expressive. Where
are our character classes, optional matches, and other fancy features from regular
expressions?
While we could simply add terms to our grammar for each character class
(:d for digits, :w for alphanumerics), there are other features we would like as
as well, such as the Kleene star, character sets, etc.
Fortunately, it is very easy to add functionality to tree-walk interpreters.
We can add as many operators as we like. For example, we can add the:setoperator
that will match 1 character if it occurs in a set.(defnmatch-peg[pegtextgrammar]
 (matchpeg@[:setchars] (if(string/check-setcharstext)1)
   @[:!x] (unless(match-pegxtextgrammar)0)
   @[:+] (some(fn[x] (match-pegxtextgrammar)) (tuple/slicepeg1))
   @[:*] (do(varlen0)
       (varsubtexttext)
       (varoktrue)
       (loop[x:in(tuple/slicepeg1):let[lenx(match-pegxsubtextgrammar)_(setoklenx)]:whileok]
          (setsubtext(string/slicesubtextlenx))
          (+=lenlenx))
       (ifoklen))
   (x(keyword?x)) (match-peg(grammarx)textgrammar)(and(string/has-prefix?pegtext) (lengthpeg))))Our date grammar will then become clearer as well.(defgrammar{:digit[:set"0123456789"]:year[:*:digit:digit:digit:digit]:month[:*:digit:digit]:day[:*:digit:digit]:main[:*:year"-":month"-":day]})

(match-peg(grammar:main)"2019-06-10"grammar)Most PEG libraries have several more rules to make it easier to write useful
grammars, but all rules can be written pretty much the same way, as cases in
our top level match expression. Our PEG engine is also missing captures, is
rather slow, does not do tail-call optimization, and lacks many features that
are present in other PEG engines such as those in Janet or REBOL. But for all
its issues,match-pegis a good start to a general purpose pattern
matching function.