Remote-Url: https://articles.inqk.net/2020/09/19/how-to-use-pegs-in-janet.html Retrieved-at: 2022-01-29 18:32:29.532263+00:00 Janet is a small, Lisp-like language. Unlike most programming languages, it offers no support for regular expressions. Instead, Janet supports parser expression grammars, or PEGs.A PEG in Janet is usually described by an associative data structure that lists a series of rules.For each rule, the key is the name of the rule and the value is a description of the string that the rule will match. What makes PEGs especially powerful is the ability for rules to refer to other rules (including recursive references) and for rules to run arbitrary functions.Let’s see how we can use a PEG to parse a simplified subset of HTML. We’ll use sequences, choices, captures (both compiled and match-time), replacements, drops and back-references. It’s going to be fun.StepsStep 1. Define:mainruleJanet begins parsing using the:mainrule. So let’s start with that:'{:main(*:tagged-1)}This rule defines a pattern consisting of a sequence (represented by*)of the rule:taggedand the value-1. This rule will match if the rule:taggedmatches and then the string ends (the value-1matches if we are at the end of the string).Step 2. Define:taggedruleNow if we try to use this grammar, Janet will complain that the rule:taggedis not defined so let’s define that next:'{:main(*:tagged-1):tagged(*:open-tag:value:close-tag)}This is pretty straightforward. Our:taggedrule consists of an opening tag, a value of some kind and a closing tag.Step 3. Define:open-tagrule'{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">")}We name the capture so that we can use a reference to it in our closing tag rule. I went with:tag-namebut you can choose whatever you like.Step 4. Define:close-tagruleUp to this point, we’ve been adding rules in the order that they’re processed. Let’s deviate from that here and instead define our next rule to match closing tags:~{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">"):close-tag(*"")}This rule can be a little difficult to follow. We’re doing a match-time capturehere using thecmtfunction.This will match if the result of the provided function (in this case,=) is truthy. If it is falsey, the match will fail.The=function is passed the values of any captures as separate arguments. We have two captures in our pattern: the back-reference to the:tag-namecapture and the value matched by:w+. If the tag names match,the:close-tagrule will match.The eagle-eyed amongst you might have noticed that the quoting character at the very beginning has changed from'to~. All but the simplest PEGs will include references to various functionssequence,capture, etc that are run by the PEG engine. To prevent these functions being called in our grammar definition, we need to quote our data structure. However, when it comes to passing the functions we wantcmtto call, we need to pass a reference to these functions. The solution is quasi-quoting. We quasi-quote the data structure and then unquote the function symbol.Step 5. Define:valueruleOK, now we’ll define the:valuerule:~{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">"):value(any(+:tagged:untagged)):close-tag(*"")}The value in between two tags could be nothing, a tagged value, an untagged value or a combination of tagged and untagged values. We can match zero or more occurrences of the pattern using theanyfunction. The+combinatortries the first pattern (:tagged) and if that fails, tries the second pattern (:untagged).Step 6. Define:untaggedruleSpeaking of:untagged, this is the last named rule we need:~{:main(*:tagged-1):tagged(*:open-tag:value:close-tag):open-tag(*"<"(capture:w+:tag-name)">"):value(any(+:tagged:untagged)):close-tag(*""):untagged(some(if-not"<"1))}This rule matches against one or more characters that are not<.Step 7. Drop back reference after useWe have a problem at the moment. Our grammar won’t match nested tags because the:tag-namecapture is never being removed from the capture stack. We can solve this problem usingunref:~{:main(*:tagged-1):tagged(unref(*:open-tag:value:close-tag)):open-tag(*"<"(capture:w+:tag-name)">"):value(any(+:tagged:untagged)):close-tag(*""):untagged(some(if-not"<"1))}Now when we get to the end of a nested:taggedpattern, we’ll drop the captures.Step 8. Capture tag names and valuesThe above is all well and good for checking if a string matches the grammar but it’s usually much more helpful to return some structured data. Let’s do that:~{:main(*:tagged-1):tagged(unref(replace(*:open-tag:value:close-tag),struct)):open-tag(*(constant:tag)"<"(capture:w+:tag-name)">"):value(*(constant:value)(group(any(+:tagged:untagged)))):close-tag(drop(*"")):untagged(capture(some(if-not"<"1)))}There’s quite a bit going on here so let’s look at each rule in turn.We’ve added a call toreplace. This will pass the values captured to the functionstruct. We’ll see why in a second.Next, in:open-tag, we’re pushing the value:tagonto the capture stack usingconstant. Because the call toconstantis part of the sequence in this pattern, it only occurs if the entirety of the pattern matches.We use a similar trick in:value. This time, instead of pushing:tag, we push:value.Still in:value, we usegroupto collect all of the matches in the(any (+ :tagged :untagged))pattern and put them inside an array that we push onto the capture stack.Next, we’ve added a call todropin:close-tag. This ensures that the value generated by the function=(i.e.true) is removed from the capture stack.Finally, we need to capture the result of:untagged. We do this withcapture.Now the call tostructhopefully makes sense. At the time this is called, the pattern always returns four captures: (1):tag, (2) the tag name, (3):valueand (4) the tag’s value. These get pulled off the capture stack, turned into a struct and pushed back on.Bonus Step. Compile the grammarIf we intend to use the grammar repeatedly, we can compile it for a performance boost:(peg/compile~{:main(*:tagged-1):tagged(unref(replace(*:open-tag:value:close-tag),struct)):open-tag(*(constant:tag)"<"(capture:w+:tag-name)">"):value(*(constant:value)(group(any(+:tagged:untagged)))):close-tag(drop(*"")):untagged(capture(some(if-not"<"1)))})ExampleLet’s see an example. Assuming this is our code:(defgrammar(peg/compile~{:main(*:tagged-1):tagged(unref(replace(*:open-tag:value:close-tag),struct)):open-tag(*(constant:tag)"<"(capture:w+:tag-name)">"):value(*(constant:value)(group(any(+:tagged:untagged)))):close-tag(drop(*"")):untagged(capture(some(if-not"<"1)))}))Then:(peg/matchgrammar"

Hello world!

")# => @[{:tag "p"# :value @[{:tag "em" :value @["Hello"]}# " "# {:tag "strong" :value @["world"]}# "!"]}](peg/matchgrammar"Hello world!

")# => nilThe second match returnsnilbecause of the unmatched

. Neat!Wrap-UpBuilding support for PEGs directly into the language is one of Janet’s best decisions. They might take a bit of time to get the hang of, but when you do, you’ll be able to parse data in a way that’s considerably more powerful than with regular expressions. ✺