Remote-Url: https://articles.inqk.net/2020/09/19/how-to-use-pegs-in-janet.html
Retrieved-at: 2022-01-29 18:32:29.532263+00:00

Janet is a small, Lisp-like language. Unlike most programming languages, it
offers no support for regular expressions. Instead, Janet supports parser
expression grammars, or PEGs.A PEG in Janet is usually described by an associative data structure that lists
a series of rules.﻿For each rule, the key is the name of the rule and the
value is a description of the string that the rule will match. What makes PEGs
especially powerful is the ability for rules to refer to other rules (including
recursive references) and for rules to run arbitrary functions.Let’s see how we can use a PEG to parse a simplified subset of HTML. We’ll use
sequences, choices, captures (both compiled and match-time), replacements,
drops and back-references. It’s going to be fun.﻿StepsStep 1. Define:mainruleJanet begins parsing using the:mainrule. So let’s start with that:'{:main(*:tagged-1)}This rule defines a pattern consisting of a sequence (represented by*)﻿of
the rule:taggedand the value-1. This rule will match if the rule:taggedmatches and then the string ends (the value-1matches if we are at
the end of the string).Step 2. Define:taggedruleNow if we try to use this grammar, Janet will complain that the rule:taggedis not defined so let’s define that next:'{:main(*:tagged-1):tagged(*:open-tag:value:close-tag)}This is pretty straightforward. Our:taggedrule consists of an opening tag,
a value of some kind and a closing tag.Step 3. Define:open-tagrule'{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">")}We name the capture so that we can use a reference to it in our closing tag
rule. I went with:tag-namebut you can choose whatever you like.Step 4. Define:close-tagruleUp to this point, we’ve been adding rules in the order that they’re processed.
Let’s deviate from that here and instead define our next rule to match closing
tags:~{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">"):close-tag(*"</"(cmt(*(backref:tag-name)(capture:w+)),=)">")}This rule can be a little difficult to follow. We’re doing a match-time
capture﻿here using thecmtfunction.﻿This will match if the result of
the provided function (in this case,=) is truthy. If it is falsey, the match
will fail.The=function is passed the values of any captures as separate arguments. We
have two captures in our pattern: the back-reference to the:tag-namecapture
and the value matched by:w+. If the tag names match,﻿the:close-tagrule will match.The eagle-eyed amongst you might have noticed that the quoting character at the
very beginning has changed from'to~. All but the simplest PEGs will
include references to various functionssequence,capture, etc that are run
by the PEG engine. To prevent these functions being called in our grammar
definition, we need to quote our data structure. However, when it comes to
passing the functions we wantcmtto call, we need to pass a reference to
these functions. The solution is quasi-quoting. We quasi-quote the data
structure and then unquote the function symbol.Step 5. Define:valueruleOK, now we’ll define the:valuerule:~{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">"):value(any(+:tagged:untagged)):close-tag(*"</"(cmt(*(backref:tag-name)(capture:w+)),=)">")}The value in between two tags could be nothing, a tagged value, an untagged
value or a combination of tagged and untagged values. We can match zero or more
occurrences of the pattern using theanyfunction. The+combinator﻿tries the first pattern (:tagged) and if that fails, tries the second pattern
(:untagged).Step 6. Define:untaggedruleSpeaking of:untagged, this is the last named rule we need:~{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">"):value(any(+:tagged:untagged)):close-tag(*"</"(cmt(*(backref:tag-name)(capture:w+)),=)">"):untagged(some(if-not"<"1))}This rule matches against one or more characters that are not<.Step 7. Drop back reference after useWe have a problem at the moment. Our grammar won’t match nested tags because the:tag-namecapture is never being removed from the capture stack. We can solve
this problem usingunref:﻿~{:main(*:tagged-1):tagged(unref(*:open-tag:value:close-tag)):open-tag(*"<"(capture:w+:tag-name)">"):value(any(+:tagged:untagged)):close-tag(*"</"(cmt(*(backref:tag-name)(capture:w+)),=)">"):untagged(some(if-not"<"1))}Now when we get to the end of a nested:taggedpattern, we’ll drop the captures.Step 8. Capture tag names and valuesThe above is all well and good for checking if a string matches the grammar but
it’s usually much more helpful to return some structured data. Let’s do that:~{:main(*:tagged-1):tagged(unref(replace(*:open-tag:value:close-tag),struct)):open-tag(*(constant:tag)"<"(capture:w+:tag-name)">"):value(*(constant:value)(group(any(+:tagged:untagged)))):close-tag(drop(*"</"(cmt(*(backref:tag-name)(capture:w+)),=)">")):untagged(capture(some(if-not"<"1)))}There’s quite a bit going on here so let’s look at each rule in turn.We’ve added a call toreplace. This will pass the values captured to the
functionstruct. We’ll see why in a second.Next, in:open-tag, we’re pushing the value:tagonto the capture stack
usingconstant. Because the call toconstantis part of the sequence in
this pattern, it only occurs if the entirety of the pattern matches.We use a similar trick in:value. This time, instead of pushing:tag, we
push:value.Still in:value, we usegroupto collect all of the matches in the(any (+ :tagged :untagged))pattern and put them inside an array that we push
onto the capture stack.Next, we’ve added a call todropin:close-tag. This ensures that the
value generated by the function=(i.e.true) is removed from the
capture stack.Finally, we need to capture the result of:untagged. We do this withcapture.Now the call tostructhopefully makes sense. At the time this is called, the
pattern always returns four captures: (1):tag, (2) the tag name, (3):valueand (4) the tag’s value. These get pulled off the capture stack,
turned into a struct and pushed back on.Bonus Step. Compile the grammarIf we intend to use the grammar repeatedly, we can compile it for a performance
boost:(peg/compile~{:main(*:tagged-1):tagged(unref(replace(*:open-tag:value:close-tag),struct)):open-tag(*(constant:tag)"<"(capture:w+:tag-name)">"):value(*(constant:value)(group(any(+:tagged:untagged)))):close-tag(drop(*"</"(cmt(*(backref:tag-name)(capture:w+)),=)">")):untagged(capture(some(if-not"<"1)))})ExampleLet’s see an example. Assuming this is our code:(defgrammar(peg/compile~{:main(*:tagged-1):tagged(unref(replace(*:open-tag:value:close-tag),struct)):open-tag(*(constant:tag)"<"(capture:w+:tag-name)">"):value(*(constant:value)(group(any(+:tagged:untagged)))):close-tag(drop(*"</"(cmt(*(backref:tag-name)(capture:w+)),=)">")):untagged(capture(some(if-not"<"1)))}))Then:(peg/matchgrammar"<p><em>Hello</em> <strong>world</strong>!</p>")# => @[{:tag "p"#       :value @[{:tag "em" :value @["Hello"]}#                " "#                {:tag "strong" :value @["world"]}#                "!"]}](peg/matchgrammar"<em>Hello</em> <strong>world</strong>!</p>")# => nilThe second match returnsnilbecause of the unmatched</p>. Neat!Wrap-UpBuilding support for PEGs directly into the language is one of Janet’s best
decisions. They might take a bit of time to get the hang of, but when you do,
you’ll be able to parse data in a way that’s considerably more powerful than
with regular expressions. ✺