Skip to content(if available)orjump to list(if available)

Adding lookbehinds to rust-lang/regex

RadiozRadioz

From a user perspective, this is extremely valuable. What an amazing improvement; unbounded especially. I do hope this would make it into actual RE2 & go.

When I use regex, I expect to be able to lookbehind, so I am routinely hit by RE2's limitations in places where it's used. Sometimes the software uses the entire matched string and you can't use non-capturing groups to work around it.

I understand go's reasons, ReDoS etc, but the "purism" of RE2 does fly in the face of practicality to an irksome degree. This is not uncommon for go.

masklinn

The authors’ previous article (linked in this one) was about doing this in re2 (https://systemf.epfl.ch/blog/re2-lookbehinds/), and they have a fork with those changes though I don’t know that they have a PR.

> the "purism" of RE2 does fly in the face of practicality to an irksome degree

It’s not purism tho. There are very practical reasons to want an FA-based engine, and if you compromise that to get additional features then the engine is pointless, you could have just used a backtracking engine in the first place.

ncruces

I couldn't find the link in that page, but the fork is here, and seems to be up-to-date: https://github.com/GerHobbelt/re2

If you need that from Go, you can probably use that to create a fork of this: https://github.com/wasilibs/go-re2

hnlmorg

The point of standard libraries is to provide sane default behaviours. Go’s regexp package is a sensible default.

For instances where you need something more sophisticated than what’s in the standard library, you reach for 3rd party modules. And there are regex libraries for Go which support backtracking et al.

There’s definitely some irksome defaults in Go, but the choose of regex engine in the regexp library isn’t one of them

progbits

While I agree this is a common golang theme, in this case I believe this decision predates the golang implementation and comes from the C++ RE2 days, no?

chubot

What are some examples of problems where you’ve used lookbehinds?

singron

I don't think there is discussion of the snort-2 and snort-3 benchmarks, which the linear engine handily beats the python re for once (70-80x faster). I'm guessing they are cases where backtracking is painfully quadratic in re, but it would have been nice to hear about those successes. [In the rest of the benchmarks, python re is 2-5x faster]

LegionMammal978

> However, as a downside our lookbehinds do not support containing capture groups which are a feature allowing to extract a substring that matched a part of the regex pattern.

I wonder in what situation someone would even be tempted to put a capture group into a lookbehind expression, except unintentionally by using () instead of (?:) for grouping. Maybe in an attempt to obtain capture groups from overlapping matches? But even in that case, lookaheads would be clearer, when available.

hu3

Interesting. I have used look behind before without knowing their specifics. AI generated a regex and unit tests passed so I carried on with life.

Searching for a simple explanation of how it works, I found this which also explains negative look behind and look ahead. TIL:

https://www.phptutorial.net/php-tutorial/regex-lookbehind/

CJefferson

Great! I enjoyed reading through, and I'm going to come back later and read a little more carefully.

If anyone knows (to let me be lazy), is this the same regex engine used by ripgrep? Or is that an independent implementation?

flaghacker

Yes, the `regex` crate is also the regex engine used by ripgrep, both were developed by https://github.com/burntsushi.

shilangyu

As others have pointed out, the regex engine is the same so the benefits would trickle downstream. For example, VSCode also uses ripgrep and therefore the rust-lang/regex engine.

burntsushi

ripgrep plugged this gap a long time ago by providing PCRE2 support.

cbarrick

Same engine as ripgrep

d3m0t3p

Nice to see a master thesis highlighted on the research groupe page

librasteve

It’s odd to see such a widely adopted language as Rust only just getting some regex basics. Whereas Raku (https://raku.org) has made a strong forward step in regex syntax over PCRE, made by the same language designer with implementation of modern unicode savvy features like Grapheme and Diacritic handling that are essential to building consistent code to handle multilingual needs.

  say "Cool" ~~ /<:Letter>* <:Block("Emoticons")>/; # 「Cool」
  say "Cześć" ~~ m:ignoremark/ Czesc /;               # 「Cześć」
  say "WEIẞE" ~~ m:ignorecase/ weisse /;              # 「WEIẞE」
  say "หนูแฮมสเตอร์" ~~ /<:Letter>+/;                    # 「หนูแฮมสเตอร์」

burntsushi

It's not only just getting some "regex basics." The `fancy-regex` crate has provided look-behind for years. The OP is about adopting look-behind to the linear time guarantee required by the `regex` crate.

My main focus for the `regex` crate has been on performance: https://github.com/BurntSushi/rebar

How does Raku's regex performance compare to Perl?

kibwen

> the linear time guarantee required by the `regex` crate

Making sure this line isn't glossed over: the point of the regex crate is that it provides linear-time guarantees for arbitrary regexes, making it safe (within reason) to expose the regex engine to untrusted input without running the risk of trivial DoS. From what I can tell, supporting lookbehinds in such a context is something that researchers have only recently described.

dmit

> making it safe (within reason) to expose the regex engine to untrusted input

Or even trusted input! https://blog.cloudflare.com/details-of-the-cloudflare-outage...

SteveJS

I loved discovering that rust has O(n) guardrails on regex! The so-called features that break that constraint are anti-features.

Over the last two weeks I wrote a dialog aware english sentence splitter using Claude code to write rust. The compile error when it stuck lookarounds in one of the regex’s was super useful to me.

librasteve

I stand corrected on that - I was responding to the headline and did not appreciate that Rust has had library support beforehand. (That said, having regex around in different standard vs. crate options is not necessarily the ideal).

It's good to have a focus and I agree that Rust is all about performance and stability for a system language.

I haven't seen Raku regex performance benchmarked, but I would be surprised if it beats perl or Rust.

I wouldn't say that Raku is a good choice where speed is the most important consideration since it is a scripting language that runs on a VM with GC. Nevertheless the language syntax includes many features (hyper operators, lazy evaluation to name two) that make it amenable to performance optimisation.

masklinn

> That said, having regex around in different standard vs. crate options is not necessarily the ideal

What 1: both regex and fancy-regex are crates. Regex is under the rust-lang umbrella but it’s not part of the stdlib.

What 2: having different options is the point of third partly libraries, why would you have a third party library which is the exact same thing as the standard library?

quotemstr

This right here is one of the foundational splits in the programming community. This article is all about how cool an _implementation_ is. This comment is about some other engine's cool _syntax_. Deep versus superficial. The two camps can't stand each other.

librasteve

Speaking on behalf of the superficial camp, I admire the Rust core regex focus on linear performance and I can well believe that it is based on recent theoretical work.

Splitting the regex features between some core ones that meet a DoS standard and some non-core modules that do other "convenience" features makes sense as a trade off for Rust. It would not make sense in a scripting language like Raku where the weight is on coder expressiveness and making it easier / faster to write working code.

I seem to have hit a seam of intense implementation guys - and they are holding their own since they know their stuff.

I think there is room for improvement BOTH with new system language / core performance innovation AND with advancing the PCRE regex syntax (largely unchanged since the 1990s) and merging it seamlessly with standard language support for Grammars.

shawn_w

I don't think Philip Hazel, who wrote PCRE, has anything to do with perl or raku development.

librasteve

sorry I didn't know that Philip Hazel wrote PCRE ... and I certainly credit the initiative to release Perl Compatible Regular Expressions from the grip of perl

my main point is that PCRE was based on perl regexes and that these were designed by Larry Wall and so he had some experience when it came to the strengths and weaknesses of of perl RE when it came to designing the Raku RE syntax (ie. the language formerly known as Perl 6)

librasteve

huh … guess HN blocks emojis