Skip to content(if available)orjump to list(if available)

Show HN: Transductive regular expressions for text editing

Show HN: Transductive regular expressions for text editing

41 comments

·February 7, 2025

An extension of regular expressions for text editing, with a grep-like command-line tool. If you, like me, struggle with group logic in regular expressions, you might find it useful.

I wanted to do this for a very long time. It is more of a sketch or prototype. I'd really appreciate your feedback!

danielparks

Cool, I’m interested to see where you go with this.

I found the operator precedence unnatural, and it looks like a lot of other folks in this thread did too. I would naturally assume `cat:dog` would be equivalent to `(cat):(dog)` rather than `ca(t:d)og`.

froh

sweet, I did my "Diplom" CS thesis around 1997 on finite state transducers. was mich less trivial than I'd thought. the ask was to implement composition and DFAs where possible. also for composed transducers. "algebra of finite state transducers". use case was morphology. the topic was heavily underestimated and I had to finish half way through. so: chapeau :-)

anyhow

wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?

Etheryte

I have to say, incredibly bold of you to essentially hinge your graduation on whether you can regex hard enough or not.

c0nstantine

yeah. Transducers are very old topic. For some reason they were not connected to a specific language like regex.

> wrt syntax, are you sure you want ':' to bind stronger than concatenation 'ab' ?

That's something I am still not sure about. I took a hundred examples and it looked more natural this way (: lower then .). But I can change it with the change of one digit in the code, literally. That's why I'm posting here. I need some real feedback.

zoogeny

> Avoid using * or + in the right part, as it can cause infinite loops

Why not just disallow this? I understand it would make the grammar more difficult to specify - but I don't see any good reason to keep it.

c0nstantine

Fair point. I agree. Now it is better to disable it.

The rationale was to implement a fun operation called transducer composition. It is possible to do simple operation on strings and compose trre's like filters. But I haven't finished it yet. So again, a fair point.

shawnz

Another question about this issue: in the case of `:a*` for example, why doesn't it just pick empty string as the replacement text and immediately exit? Why would it be an infinite loop if `-g` isn't specified?

And then if `-g` were specified, you could use this to create infinite generators, which could be a useful construct in some cases -- like `yes` but more powerful.

EDIT: Another interesting use case, if I am understanding correctly: if this worked, then you could use `:(<regex>)` to have it output an example of a string that matches any given regex. `-g :(<regex>)` produces a generator of every string in the language matched by that regex. `-g :(<regex>) | head -n 100` would give you 100 examples.

crazygringo

> Regular expressions is a great tool for searching patterns in text. But I always found it unnatural for text editing.

The entire purpose of this project seems to hinge on this assertion, but there isn't a single example.

I don't understand what makes regex unnatural for editing? What is meant by editing? Why do people struggle with groups?

There are lots of examples of the syntax for this project, but why is it better than regular regex?

If there were a few examples showing "here's the vanilla regex version, here's my version, here's why my version makes this easier" I might be able to understand this project.

c0nstantine

Fair point. The most explicit example if you need to change something in context. For example if we need to change 'y' to 'Y' only if it occurs between x and y you would do something like this in python.

pattern = r'(x)y(z)'

replacement = r'\1Y\2'

result = re.sub(pattern, replacement, text)

I would like to replace it with 'xy:Yz' pattern:

result = re.trre('xy:Yz', text)

If you need your x, z to be more complicated patterns or even regex themselves it can be more handy using this approach.

rendaw

I think it's an interesting project, but is this a functional replacement for regex substitutions? It seems like you can only replace text with something else at the same location. Separating the replacement from the matching can be unweildy, but you can reorder, repeat, and insert text between things too.

IIUC

crazygringo

Thanks!

I guess I'm still struggling to see how it's simpler overall.

Most of the examples on your page don't involve groups at all, e.g.:

  $ echo 'catcatcat' | trre '((cat):(dog))*'
  dogdogdog
That already seems a lot more complicated than just:

  re.sub('cat', 'dog', 'catcatcat')
I don't need to use groups that often in regex replacements, and when I do I'm already trying to do something complicated, and it's not clear to me why the colon syntax is easier to write, easier to understand, or if it's as flexible.

Not trying to criticize the project, just genuinely trying to understand the specific strengths and limitations of the proposed syntax. E.g. what if I want to turn xyz into zYx?

c0nstantine

>> E.g. what if I want to turn xyz into zYx?

echo 'zyx' | ./trre 'xy:Yz|zy:Yx'

It is still a regular language. I do not introduce references.

You are right in sense the `sed` is far superior editor. But here I see some advantages: - the current implementation is super small; it is direct translation to an automaton - the complex patterns may be compiled in a more efficient way using deterministic transducer. I can't defend this claim now but I have some evidences - there are some tricks you can do using 'generative' part of it, e.g. and you even can find levenshtein distance of 1 between two strings just by generating substitutions/insertions/deletions and implement a simple spell checker.

Overall, I think you have a good point. Maybe it is just marginal improvement (if any). It was more comfortable to write in this style instead of group usage. I used it for some time and found it handy (especially as extended `tr`).

Etheryte

I would say regex is usually pretty write once edit never. Building a prototype that tries to look beyond that is a great way to see what a better future might look like in this regard.

everdimension

Come on, it's about replacements. They're easier to express (meaning literally easier to type out) with the author's syntax

Great project

null

[deleted]

bawolff

[flagged]

andrewla

I feel like this is very underspecified, The very first example:

    $ echo 'cat' | trre 'c:da:ot:g'
    dog
Feels strange. What is happening here; the grammar says

    TRRE    <- TRRE* TRRE|TRRE TRRE.TRRE
    TRRE    <- REGEX REGEX:REGEX
What is the parse tree here? Why is "c" not being replaced with "da"? Or why isn't c being removed and "da" being replaced by "ot"?

I do like the idea of having a search/replace semantic that is more intuitive than grouping operators; back in MS-DOS days you could do "ren .log .txt" and this would work which feels bananas to my modern bash-minded way of thinking, but it's very obvious looking at this what it is supposed to do.

kccqzy

Yes it is underspecified. The deletion example shows that an empty string is possibly a REGEX. So you can essentially treat any position as containing as many empty string regexes as you want. So there are indeed infinite number of parses.

If we instead require regex to be non-empty (breaking the deletion examples), then the ambiguity becomes that of concatenation: whether it's '(((c:d)(a:o))(t:g))' or '((c:d)((a:o)(d:g)))'. Assuming associativity, this would not matter.

danielparks

This is a matter of operator precedence and tokenization. Tokens are single characters in this language, and there is an invisible operator between them.

If the operator were explicit (let’s call it ~), the example would look like this:

    $ echo 'cat' | trre 'c:d~a:o~t:g'
    dog
With unnecessary parentheses:

    $ echo 'cat' | trre '(c:d)~(a:o)~(t:g)'
    dog

null

[deleted]

Imustaskforhelp

From what it feels as to how it works, it seems that c:d and a: (nothing) and ot:g

but yes now that I read it , it also makes confusion , theoretically your point makes valid , I also believe that c should be replaced by da after I read the repo , I am not sure ...

null

[deleted]

sabellito

I love the idea, wanna se where it goes. Gives me the same vibe as when jq came out all those years ago.

i3oi3

Are the examples all actual outputs of the program? It's entirely possible that my understanding of the grammar is off, but it looks like these examples are wrong:

$ echo 'cat dog' | trre 'c:bat|d:hog' bat hog

$ echo '' | trre ':a*' # <- do NOT do this dog

$ echo '' | trre ':(repeat-10-times){10}' dog

c0nstantine

The second line actually is an output. I've modified the README. The last example is a typo. Fixed. Thanks!

meonkeys

And the first one? Wouldn't the output be

batat hogog

wfn

What a pleasure your C code is :) very nice (currently reading it)

My only quick comment is - the link to `theory.pdf` in README is broken (your pdf is in docs/ dir, so just need to change the url to include docs/).

c0nstantine

Thank you! For the feedback and pointing to the typo. Fixed. Actually my C is very rusty and I am bit uncomfortable about this.

mordechai9000

I would probably use this in a text editor if it was available. I don't struggle with group logic, but I often forget which tools use \ to reference capture groups, and which use $. (If it's Microsoft or Microsoft-adjacent, it's probably $.)

rjh29

and of course Perl supports both!

the_arun

> $ echo 'xor' | '(x:)or' 'xor'

> cat

I got lost here.

c0nstantine

Typo. Fixed. Thanks. Too many cats in examples.

trashburger

Also too many dogs it seems, as the infinite and finite loop examples also produce "dog".

pipeline_peak

I just ask chat-gpt to handle things like this. No need to memorize trivial syntax.

metadat

When I saw this headline, I got excited about the prospect of a new innovation in the application of Regular Expressions. After reading, I was scratching my head because trre doesn't provide any new capabilities and is essentially just yet another flavor of regex. Additional complexity without a significant upside.

Trre seems like an arbitrarily different version of `sed -e /a/b/ ..`. This method of search and replace is essentially ported everywhere else, from awk to vim to IntelliJ, and has always gotten the job done adequately and feels as natural as anything else I've learned.

Am I missing something?

p.s I just realized I've been regex'ing heavily for 21 (!) years now, time flies.

null

[deleted]