Fixing left and mutual recursions in grammars

30 comments

·February 2, 2025

earleybird

Use an Earley parser. A properly implemented Earley parser will do unambiguous grammars with left/right/mutual recursion in linear time. The worst case ambiguous I've worked with is UBDA (Earley, 1970). UBDA produces the equivalent of all the permutations of a binary tree - which is Catalan(n). If you generate all the combinations as you parse it won't complete before the heat death of the universe for fairly short strings.

Rusky

> A properly implemented Earley parser will do unambiguous grammars with left/right/mutual recursion in linear time.

It's not linear for all unambiguous grammars- only deterministic grammars, which can also be parsed with something faster like LR or even hand-written Pratt.

(An example of an unambiguous but nondeterministic grammar is this one for palindromes, which Earley parses in quadratic time: P -> 0 P 0 | 1 P 1 | ε)

earleybird

Yes, you are correct. Nondeterministic grammars are not linear but quadratic. In the case of your palindrome example (thanks btw, added to my grimoire of grammars) I believe the quadratic is: (3n^2 + 6n)/4 (I wish HN did MathJax)

I need to do a bit more digging in to the distinction between ambiguous and nondeterministic.

When implementing parsers I enjoy the directness afforded by an Earley parser. No rejigging the grammar to suit LL, LR etc. No gotcha's with PEGs choice operator, etc.

Most grammars I end up using for practical applications are linear - so far, quadratic has been a spectre from afar. :-)

nsajko

Pratt's method only targets the operator precedence languages, not the DCFL. So much less powerful than LR parsing.

Rusky

That's true as Pratt described it. I mentioned it because it's a good example of the general idea of extending recursive descent to handle more deterministic grammars than vanilla LL.

klik99

I came into comments to post this but I guess the earley bird gets the worm.

I have an Earley parser I’m very happy with which is in a project I’m going to open source soon, it’s nice to have a parser where you just don’t have to think about all those edge cases

upghost

Thanks, I've been looking for this for a long time!

froh

tell me more about your use name :-)

fjfaase

It is possible to write a (top-down) parser that can deal with the left recursion without having to rewrite the grammar itself. Simply split the left recursive ones and not. Then first try to accept a non-left recursive rule and if that succeeds, use the result to try to parse any of the left recursive ones. (If I am not mistaken, this also works for mutual recursion of left recursive rules.) I have implemented this several times.

Rusky

Exactly. This is essentially converting the parser to bottom-up just for these rules. It's how Pratt parsing/precedence climbing works.

I wrote this post on the connection: https://www.abubalay.com/blog/2021/12/31/lr-control-flow

kazinator

If you are using a shift-reduce parser, you may have to fix right recursion instead.

If you have a right recursive construct, the parser stack usage will be proportional to the length of the instances of the construct. That can be a problem.

For instance:

   stuff : /* empty */
         | item stuff
         ;

If you have 10,000 items in a sequence of stuff, they all get pushed onto the stack which is then popped by a long cascade of rightmost reductions.

You have to refactor for left recursion.

I ran into this when parsing Lisp this way. Lisp lists are right recursive; they are naturally constructed from the right end, by consing onto the front. When you write the obvious parser rules, the list does not start being constructed until the closing parenthesis is seen, and the cascade of reductions pulls it out of the stack.

This is not a problem like runaway left recursion; it is an insidious bug that you won't even notice with small test cases.

Switching to left recursion means that you have to construct the list left-to-right somehow. Either the rules maintain a tail pointer somehow, or the list is consed up in reverse and then destructively reversed. A minor complication is support for the consing dot.

tempodox

An interesting and worthwhile topic. This kind of problem pops up regularly.

Those diagrams are really handy, does anyone know how they were made? I'm always looking for alternatives / companions to the Railroad Diagram Generator (https://rr.red-dove.com/ui).

Rochus

These diagrams indeed look good. But actually I never use those for parser construction, but consider EBNF more useful, especially also when analyzing and correcting a grammar. I was surprised that there was no decent editor to develop grammars which has built-in support to find and show issues in the grammar like left recursion or ambiguities (at least I didn't find one), so I implemented my own: https://github.com/rochus-keller/ebnfstudio/. I used it in all of my programming language projects to create LL(n) versions including some pretty complicated grammars like Simula or Verilog.

brightprogramer

Use mermaid js : https://mermaid.js.org/syntax/stateDiagram.html

werdnapk

Isn't Graphviz [1] the standard tool for this?

[1] https://graphviz.org/

aardshark

draw.io can do similar diagramming. You can generate diagrams programmatically through their CSV format.

Unfortunately documentation for using it programmatically is a bit lacking, but you can for example host a CSV somewhere and use their editor to load it. It's all client side so will automatically update.

mrkeen

Imo this missed the punchline.

I banged my head on the table for days trying to fix a left-recursive grammar, just by moving bits around and hoping.

The key is precedence! Parse loose expressions first (plus, minus, etc), these loose expressions can in turn call into tighter expressions (mul, div), and so on down to parentheses.

Parentheses do form a loop - that is, they call all the way back to the top - but at long as the parser makes progress i.e. can only proceed if it moves past a '(', no infinite loops and you're golden.

Cadwhisker

Debugging a complex PEG is a nightmarish task. I use various tools, but I couldn't find anything out there that will let you set a breakpoint in a file that's being parsed and let you explore the parsing state.

The most useful tools I found were adjacent to the cpp-peglib library: https://github.com/yhirose/cpp-peglib

This comes with a PEG playground: https://yhirose.github.io/cpp-peglib/

I really liked pegdebug: https://mqnc.github.io/pegdebug/

With sample output here: https://mqnc.github.io/pegdebug/example/output.html

pegdebug is nice for small sets of data, but it rapidly gets swamped by anything over about 50 lines.

If anyone has other suggestions for debugging PEGs, please reply and let me know,.

WhyIsItAlwaysHN

I think you're semi-right here, but the term precedence is ambiguous.

Left recursion causes rules to be ambiguous. In `<S> | a`, it's always ambiguous if the input has ended. Thus the data structure that would be traversed can be a circular graph. In `<S> := a | <S>, S := ε', the rules are unambiguous and the data structure that is traversed has the shape of a tree.

You get precedence rules by defining the shape of this tree, thus which rules will be visited before which.

However it doesn't have to be strict precedence. Some rules can have an ambiguous order as long as the execution still follows the shape of a tree.

For example, in my simple rule above, it doesn't matter if you visit the ε rule or the other rule first in your implementation (although it's less efficient to visit ε first).

So, I think precedence is a side-effect of doing this transformation.

PaulHoule

I like a lot of things about PEG but the way they support precedence is for the birds. I'd really love to see a PEG toolkit built as if developer experience matters but the problem with parser generators is that anybody who can write a parsed generator understands how to work around the problems with parser generators -- so we don't get parser generators that fit the needs of developers who don't know a lot of CS theory.

brightprogramer

Precedence is more of a way to remove ambiguity from grammar. Grammars are just hard to figure out in one go.

mrkeen

I appreciate the dismissal, but I stand by what I said.

I used to think 'precedence was just for removing ambiguity', so I didn't pay attention to it.

Then I discovered it was the key to structuring grammars to avoid left recursion, and everything changed for me.

I've used it successfully in the past and I'll do so in the future.

tgv

I think you're not using the word precedence properly. It seems as if you're trying to explicitly instruct a top-down parser.

E.g., the precedence in these two grammars (for simple arithmetic expressions) is identical

    S -> F | S + F
    F -> T | F * T
    T -> 0 | 1 | 2 | ...

    S -> F | F + S
    F -> T | T * F
    T -> 0 | 1 | 2 | ...

but one recurses on the left side, and the other on the right.

brightprogramer

I see, I'd like to give it a try then. I'll try adding some precedence here. I don't know how, though.

This calls for some reading...

hyperhello

Left recursive BNF was an interest of mine. For a much less technically oriented solution you can check out https://hypervariety.com/BNFToLPEG/

null

[deleted]

briffid

An example of an actual grammar, like arithmetic expressions, would be helpful.