Remote-Url: https://bakpakin.com/writing/how-janets-peg-works.html
Retrieved-at: 2022-01-29 19:28:36.168030+00:00

How Janet's PEG module works

2019-06-10

Projects Writing Contact
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In making Janet, I was inspired by the REBOL Parse module and LPeg to include a
versatile library for parsing text and any sequence of bytes based on Parsing
Expression Grammars, or PEGs. I was also willing to exclude other forms of
parsing tools such as regular expressions from the core library, as PEGs are
more general than regular expression, or easier to read, and can capture error
locations when matching text. The development of PEGs in Janet went
particularly smoothly, made possible by the simple PEG formalism, Janet's basic
data structures, and a very straight-forward subroutine threaded interpreter.
While much of Janet's PEG module could be implemented in any language, Janet's
C API and rich set of core data structures made it easy to implement PEGs in
Janet.

This post is about how the Janet PEG engine is implemented; it is not a guide
on using pegs. Although the code is written in Janet, It can be translated
directly to most programming languages as we make no special use of any data
structures besides arrays and hashtables. For information about using Janet's
PEG module, see the Janet PEG documentation.

A Quick PEG Overview

The Parsing Expression Grammar formalism has it's roots in a 2004 Paper by
Bryan Ford, who presents both the model and an algorithm for recognizing any
PEG grammar in linear time. Our PEG engine will not parser all patterns in
linear time, but will use similar semantics for defining our grammar. One
difference between PEGs and more traditional parser generators, such as those
used by YACC and Bison, is the ordered choice operator. For the purposes of
this post, + is the ordered choice operator. For example, using a pseudo-syntax
for PEG grammars where the MAIN rule is the entry point of the grammar, we can
write a grammar that matches any text that starts with a, b, or c.

MAIN: "a" + "b" + "c"

This grammar will match "apple pie", "banana split", and "coconut cluster", but
not "orange soda", "watermelon taffy", or "lemon snap".

It's also easy to write redundant grammars with the ordered choice operator.
The following grammar will match on the first "a" only, and never "ab" or
"abc". Anything that starts with "ab" or "abc" also starts with "a", so they
could be removed from the grammar without changing its behavior.

MAIN: "a" + "ab" + "abc"

The ordered property of the + operator means that all PEG grammars are
deterministic - if they match a string, they will only match it one way. This
also means there is a direct correspondence between a PEG and a parser. This is
very convenient when writing PEGs, as your specification and parser can often
be one and the same. Traditional parser generators are non-deterministic, and
additional rules are needed to resolve ambiguity.

To complete our PEG model, we only need a few more operators. Below are all
operators and features needed for the PEG formalism to be as general as
possible.

"..." - string literal
+ - choice operator
* - sequence operator (implicit in our pseudo-syntax)
! - Boolean inversion

The above features, along with recursion introduced by a top level grammar, can
express all PEGs. This is a trimmed down version of the what's presented in the
original paper, but other operators can be substituted for combinations of the
above operators. The other very important feature needed for matching context
free grammars is recursion, or otherwise our model would be limited to matching
regular expressions.

Using this formalism, we can write a simple grammar that matches dates in ISO
8601 format. Such a date looks like

2019-06-10

A naive grammar that matches dates (but does not validate dates - 9999-99-99
would match) is presented below. This also requires years to be specified with
4 digits.

DIGIT: "0" + "1" + "2" + "3" + "4" + "5" + "6" + "7" + "8" + "9"
YEAR: DIGIT DIGIT DIGIT DIGIT
MONTH: DIGIT DIGIT
DAY: DIGIT DIGIT
MAIN: YEAR "-" MONTH "-" DAY

This looks very similar to a context free grammar, and in simple cases, the
grammar is identical.

A simple PEG engine in code

Using our PEG definition above, we can write a simple match-peg function that
takes a rule and a string to match against and tests if string matches the
rule. If there is a match, the number of bytes matched will be returned to help
us find where to begin the next match. Otherwise, we will return nil, or some
token to indicate that there is no match. This code will be written in Janet,
but could be very easily written in any language. We are using Janet's match
macro here, which does pattern matching on arguments.

(defn match-peg
 [peg text]
 (match peg
   @[:! x] (unless (match-peg x text) 0)
   @[:+ x y] (or (match-peg x text) (match-peg y text))
   @[:* x y] (if-let [lenx (match-peg x text)]
                (if-let [leny (match-peg y (string/slice text lenx))]
                 (+ lenx leny)))
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

This is a very simple but effective peg implementation. It does not support
captures, various other operators, but captures the basics of PEGs in about 10
lines of code, and can even be used for recursive grammars.

(def digits [:+ "0" [:+ "1" [:+ "2" [:+ "3"
   [:+ "4" [:+ "5" [:+ "6" [:+ "7" [:+ "8" "9"]]]]]]]]])
(def year [:* digits [:* digits [:* digits digits]]])
(def month [:* digits digits])
(def day month)
(def iso-date [:* year [:* "-" :* month [:* "-" day]]])

(match-peg iso-date "2019-06-10") # -> 10
(match-peg iso-date "201-06-10") # -> nil

This implementation is very similar to a tree walk interpreter, and is the
simplest way to implement a PEG. Janet's peg implementation does something very
similar to this, although the syntax tree is first compiled to a bytecode form
and validate for better speed.

Making Operators Variadic

Although not strictly necessary, we would like operators to look and behave
more like lisp operators that are variadic. This will make long sequences or
chains of choices easier to write.

(defn match-peg
 [peg text]
 (match peg
   @[:! x] (unless (match-peg x text) 0)
   @[:+] (some (fn [x] (match-peg x text)) (tuple/slice peg 1))
   @[:*] (do
       (var len 0)
       (var subtext text)
       (var ok true)
       (loop [x :in (tuple/slice peg 1)
              :let [lenx (match-peg x subtext)
                    _ (set ok lenx)]
              :while ok]
          (set subtext (string/slice subtext lenx))
          (+= len lenx))
       (if ok len))
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

Our ISO-date grammar from can be modified to look a bit more natural.

(def digits [:+ "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"])
(def year [:* digits digits digits digits])
(def month [:* digits digits])
(def day month)
(def iso-date [:* year "-" month "-" day])

(match-peg iso-date "2019-06-10") # -> 10
(match-peg iso-date "201-06-10") # -> nil

Adding Recursion

Earlier I stated that recursion is important to make PEGs useful, and although
we can do recursion in ad hoc manner, we can make our recursive grammars look
more like our original grammar using Janet tables. We can create a table
mapping keywords to PEG expressions, and if peg is a keyword, we lookup its
definition in the grammar table. We can add an extra argument to match-peg that
will be our grammar table.

(defn match-peg
 [peg text grammar]
 (match peg
   @[:! x] (unless (match-peg x text grammar) 0)
   @[:+] (some (fn [x] (match-peg x text grammar)) (tuple/slice peg 1))
   @[:*] (do
       (var len 0)
       (var subtext text)
       (var ok true)
       (loop [x :in (tuple/slice peg 1)
              :let [lenx (match-peg x subtext grammar)
                    _ (set ok lenx)]
              :while ok]
          (set subtext (string/slice subtext lenx))
          (+= len lenx))
       (if ok len))
   (x (keyword? x)) (match-peg (grammar x) text grammar)
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

With our grammar table, we can define our grammar in a single structure.

(def grammar
 {:digit [:+ "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"]
  :year [:* :digit :digit :digit :digit]
  :month [:* :digit :digit]
  :day [:* :digit :digit]
  :main [:* :year "-" :month "-" :day]})

(match-peg (grammar :main) "2019-06-10" grammar) # -> 10

Adding Rules

So far, our match-peg function has not been very expressive. Where are our
character classes, optional matches, and other fancy features from regular
expressions? While we could simply add terms to our grammar for each character
class (:d for digits, :w for alphanumerics), there are other features we would
like as as well, such as the Kleene star, character sets, etc. Fortunately, it
is very easy to add functionality to tree-walk interpreters. We can add as many
operators as we like. For example, we can add the :set operator that will match
1 character if it occurs in a set.

(defn match-peg
 [peg text grammar]
 (match peg
   @[:set chars] (if (string/check-set chars text) 1)
   @[:! x] (unless (match-peg x text grammar) 0)
   @[:+] (some (fn [x] (match-peg x text grammar)) (tuple/slice peg 1))
   @[:*] (do
       (var len 0)
       (var subtext text)
       (var ok true)
       (loop [x :in (tuple/slice peg 1)
              :let [lenx (match-peg x subtext grammar)
                    _ (set ok lenx)]
              :while ok]
          (set subtext (string/slice subtext lenx))
          (+= len lenx))
       (if ok len))
   (x (keyword? x)) (match-peg (grammar x) text grammar)
    # default is peg is a string literal
   (and (string/has-prefix? peg text) (length peg))))

Our date grammar will then become clearer as well.

(def grammar
 {:digit [:set "0123456789"]
  :year [:* :digit :digit :digit :digit]
  :month [:* :digit :digit]
  :day [:* :digit :digit]
  :main [:* :year "-" :month "-" :day]})

(match-peg (grammar :main) "2019-06-10" grammar) # -> 10

Most PEG libraries have several more rules to make it easier to write useful
grammars, but all rules can be written pretty much the same way, as cases in
our top level match expression. Our PEG engine is also missing captures, is
rather slow, does not do tail-call optimization, and lacks many features that
are present in other PEG engines such as those in Janet or REBOL. But for all
its issues, match-peg is a good start to a general purpose pattern matching
function.

Copyright © Calvin Rose 2019