Skip to content(if available)orjump to list(if available)

Emacs Tree-sitter custom highlighting

kleiba

Classic Emacs syntax highlighting is based on regular expressions ("font-lock-mode"). Of course, the grammars of programming languages are usually not regular languages but higher up in the language class hierarchy (hi, C++!). But you can get a surprising amount of things right just through the context in which a token appears.

For instance, the example of this article (`type` as a keyword vs. `type` as a function) would probably have worked with font-lock-mode as well because you could distinguish the two cases from whether or not a left parenthesis follows the token. But, of course, without proper parsing, there's always the possibility of edge cases that you cannot resolve correctly.

The interesting cases arise anyway when whatever you have in your buffer does not adhere to the grammar, i.e. you have a syntax error: how does then your syntax highlighter cope with that?

neilv

In addition to leaning mostly on regexps (used in a few ways), the ancient Emacs `font-lock` highlighting also uses "syntax classes" of characters to help tokenize/lex and structure (e.g., is this character an identifier constituent, does it start a string literal, does it start a structural grouping like a parentheses, etc.). There's also some ways to insert arbitrary code to do some things that are harder, like non-regexp lookahead. You can also annotate pieces of text as you go through it, to cache information.

The rules for indenting are actually implemented differently, even though they also involve some kind of parse. And it's not unusual to have to cache context information about the current line, for performance, so that you don't have to look back at preceding lines until you're satisfied you have enough context to indent the current line. The functions to indent multiple lines at once of course might represent this context without having to annotate the buffer.

> you have a syntax error: how does then your syntax highlighter cope with that?

I wrote (but didn't release) an all-new language-specific incremental fast parser for Emacs that recovered from some syntax errors. My general approach was to pick a region of text that included the obvious syntax error, visually highlight it in red, annotate it so that a mouseover would hover an explanation bubble of what's wrong with it, and then continue the parse assuming some reasonable context. You can see screenshots at:

https://www.neilvandyke.org/quack/#meow

For example, for an unterminated string literal, it would error-highlight the opening quote and subsequent characters up to the first whitespace. For another example, a string literal with an invalid escape sequence would error-highlight the entire string literal up through the closing quote. Another example shown is detecting a character that can't occur in that context (a close-paren immediately after a comment-the-following-s-expression).

ssivark

Very excited to see parsing for ill-defined states! I like your naming scheme of using animal sounds, but just wanted to bring to your attention that Emacs already has a popular package named meow (for modal editing)

https://github.com/meow-edit/meow

amitp

(author here) I agree, the `type` example could be done with regular expressions. In part 2 I'm planning to describe the real reason I was using tree-sitter here. I wanted to highlight certain combinations of operations based on the naming conventions I use in one of my projects. In particular, I want to catch a function call where a function named "x_to_y" has an argument with a name that does not appear to be an "x". However, while writing part 1 I realized that I could probably do that with a regular expression…

kleiba

Sound interesting, looking forward to part 2 then!

krupan

Hopefully it copes very poorly so you see the syntax error quickly and fix it :-)

Only half joking

tptacek

I'm genuinely psyched about this. One of the few bits of Elisp I've ever written and used consistently was some goop to drive `hi-lock-mode`, which allows you to highlight arbitrary regexps --- I used it exclusively to highlight tokens. It was unreasonably effective for source code audits, being able to click a variable and then sweep through the code spotting everything that used it. But hi-lock is an afterthought of a package, and Tree-sitter isn't. Neat!

vzaliva

When I first read about the integration of tree-sitter into Emacs, I was very excited. I work with a DSL, for which I maintain a tree-sitter grammar and highlighting rules. I can view source files with highlighting from the command line, and I was hoping I could now easily re-use this grammar in Emacs to edit files in my DSL with proper highlighting.

Unfortunately, it wasn't as straightforward as I hoped. You need to create a custom major mode for your language and manually integrate the tree-sitter highlighting.

What I'd really like to see one day is an Emacs mode that allows you to automatically plug in any tree-sitter grammar with just a couple of lines of configuration in your .emacs, and instantly get syntax highlighting. Is that too much to ask?

amitp

With the older tree-sitter package[1], I was able to use it with the existing major modes. The new built-in emacs tree-sitter seems to be more ambitious, involving new major modes.

[1] https://emacs-tree-sitter.github.io/

toomim

That should be easy to build. In 2002 I built harmonia-mode, which did that for the harmonia research project that inspired tree-sitter. It did that.

toomim

The best way forward is similar to how you describe. Instead of making one mode per language -- just make a generic "tree-sitter" mode, and attach that mode to all the filetypes you want it load via regexp patterns in `auto-mode-alist`.

Then when the file goes into tree-sitter-mode, you can check the filetype again, and map that into the language to load into tree-sitter. Keep a buffer-local variable to remember that current language, so that you can use it for any additional language-specific customization that you want as well.

Keep in mind that there's nothing about a major mode in Emacs that has to be specific to a programming language. It's totally cool to have a major mode that works for multiple programming languages!

vzaliva

I think the main problem is that highligthing framework used by tree-sitter (https://github.com/tree-sitter/tree-sitter/tree/master/highl...) is not easily pluggable into emacs font-lock-mode.

pama

I would contribute to a repo that works towards this goal.

toomim

I'd be happy to help as well! I could probably even get the basic framework started.

I'm toomim@gmail.com.

tsuru

Wow authored by person who created Solar Realms Elite... A blast from my past crossing into my present.

(A bit reductionist of his many accomplishments in between, I know, it's just a thing that's hit me in the moment)

timewizard

The BBS era will always be the favorite era of my life. Thanks for pointing that out.