QRS: Epsilon Wrangling
3 comments
·July 9, 2025o11c
kragen
Plausibly this approach is trivial to understand and implements full regexes, but it is slower than the NFA or DFA approach: http://canonical.org/~kragen/sw/dev3/redl.py
PEGs are in some ways more powerful than regexes, and in other ways less powerful, but usually the former matter more. This is not trivial to understand but I think it's not that hard either; it's a page of code: https://github.com/kragen/peg-bootstrap/blob/master/peg.md
That version doesn't memoize and so doesn't enjoy Packrat parsing's linear-time guarantee, but you can easily modify it to do so.
Another subset of regexes that's easy to understand is this single-page pattern matcher by Rob Pike from TPOP: https://www.cs.princeton.edu/courses/archive/spr09/cos333/be... which is enormously less code than my single-page PEG parser generator above. But then, it doesn't have to compile itself.
o11c
Unfortunately, neither "waste time" nor "waste space" are approaches worth pursuing.
We already have too many programs being written that are too simple and thus slow and/or wrong. We need to write code that is as simple as possible, but no simpler.
I also have been playing with regexes recently. One observation that I've made: if you're willing to setting for something slightly weaker than regexes, you can make your code trivial to understand (and go straight to a DFA instead of going through NFA). AFAICT the only "hard" case (which I'm erroring out on) involves things like /(AB)+|(AC)+/ (where A, B, and C are arbitrarily complex patterns), since everything else can be easily simplified. And at least for the contexts I care about, that kind of regex is exceptionally rare.
... I probably should actually read the papers on how to do it properly though. Last time I tried, I got stuck in a tangle of "why does C make efficient hashmaps so hard!" - this time, I'm avoiding C until I have a fully-tested prototype (current status: 1.0KLoC logic, 0.5KLoC comments, 4.4KLoC test suite, 40% tests failing after a recent refactor [edit: I forgot Python enums don't compare equal to integers], 100% of the time being annoyed at how stupidly slow Python is if you use obscure programming features like "loops", "strings", "branching", or "functions").