Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken
61 comments
·June 30, 2025npalli
Kudos, I think (in the short term at least) there is a large amount of perf. optimization to be found by coding parts of the whole AI/ML infrastructure in C++ like this one, not as a rewrite (god no!) but drop in and fix key bottlenecks. Anytime I see someone (seems Chinese engineers are good at this) put something out in C++, good chance some solid engineering tradeoffs have been made and dramatic improvement will be seen.
matthewolfe
Agreed. A former mentor of mine told me a nice way of viewing software development:
1. Make it work. 2. Make it fast. 3. Make it pretty.
Transformers & LLMs have been developed to a point where they work quite well. I feel as though we're at a stage where most substantial progress is being made on the performance side.
jotux
A similar concept dates back to 30BC: https://en.wikipedia.org/wiki/De_architectura
Firmitas, utilitas, venustas - Strong, useful, and beautiful.
diggan
Heh, seems people I've been learning from been biased away from beauty, as I know that as "Make It Work, Make It Right, Make It Fast".
kevindamm
I've usually heard/said it as
1. Make it
2. Make it work
3. Make it work better
(different circumstances have different nuances about what "better" means, it isn't always performance optimization; some do substitute "faster" for "better" here, but I think it loses generality then).gabrielhidasy
I always heard the "Make it Right" as "Make it Beautiful", where Right and Beautiful would mean "non-hacky, easily maintainable, easily extendable, well tested, and well documented"
abybaddi009
What's the difference between make it work and make it right? Aren't they the same thing?
binarymax
The Huggingface transformers lib is currently undergoing a refactor to get rid of cruft and make it more extensible, hopefully with some perf gains.
saretup
And while we’re at it, let’s move away from Python altogether. In the long run it doesn’t make sense just because it’s the language ML engineers are familiar with.
tbalsam
No! This is not good.
Iteration speed trumps all in research, most of what Python does is launch GPU operations, if you're having slowdowns from Pythonland then you're doing something terribly wrong.
Python is an excellent (and yes, fast!) language for orchestrating and calling ML stuff. If C++ code is needed, call it as a module.
bigyabai
It makes plenty of sense. Python handles strings well, has a great package ecosystem, and is easy to write/learn for non-programmers. It can be easily embedded into a notebook (which is huge for academics) and is technically a "write once run anywhere" platform in theory. It's great.
If you think Python is a bad language for AI integrations, try writing one in a compiled language.
janalsncm
Most of that is already happening under the hood. A lot of performance-sensitive code is already written in C or cython. For example numpy, scikit learn, pandas. Lots of torch code is either C or CUDA.
ML researchers aren’t using python because they are dumb. They use it because what takes 8 lines in Java can be done with 2 or 3 (including import json) in python for example.
ipsum2
Sort of. The key bottlenecks are not in tokenization, but running the actual CUDA kernels. Python actually has very little overhead. (See VLLM, which is primarily in Python). So when people (like deepseek) 'rewrite in C++', they're usually just rewriting CUDA kernels to be more efficient.
superlopuh
Can someone familiar with performance of LLMs please tell me how important this is to the overall perf? I'm interested in looking into optimizing tokenizers, and have not yet run the measurements. I would have assumed that the cost is generally dominated by matmuls but am encouraged by the reception of this post in the comments.
refibrillator
Tokenization is typically done on CPU and is rarely (if ever) a bottleneck for training or inference.
GPU kernels typically dominate in terms of wall clock time, the only exception might be very small models.
Thus the latency of tokenization can essentially be “hidden”, by having the CPU prepare the next batch while the GPU finishes the current batch.
serjester
Tokenizing text is ridiculously small part of the overall computation that goes into serving a request. With that said if you’re doing this on petabytes of data, never hurts to have something faster.
odyssey7
A language that isn’t memory-safe can definitely hurt. AI needs more security, not less.
pama
Cool. Would it be possible to eliminate that little vocab format conversion requirement for the vocab I see in the test against tiktoken? It would be nice to have a fully compatible drop in replacement without having to think about details. It also would be nice to have examples that work the other way around: initialize tiktoken as you normally would, including any specialized extension of standard tokenizers, and then use that initialized tokenizer to initialize a new tokendagger and test identity of results.
matthewolfe
Alright, 0.1.1 should now be a true drop-in replacement. I'll write up some examples soon.
matthewolfe
Ah good catch. Updating this right now.
chrismustcode
There’s something beautiful about creating a drop in replacement for something that improves performance substantially.
ScyllaDB comes to mind
matthewolfe
Agreed. I figured nobody would use it otherwise.
parhamn
Put it in there readme & description. It's a big selling point.
matthewolfe
Thanks, I clarified it.
pvg
To be fair, many people have token stabbing needs.
Tiberium
Can you also compare the performance with https://github.com/huggingface/tokenizers/? Would be helpful, since the benchmark in the tiktoken readme seems to be very outdated.
binarymax
Anecdotally I've always found tiktoken to be far slower than huggingface tokenizers. I'm not sure why, as I haven't dug into tiktoken, but I'm a heavy user of HF's rust tokenizers
p0
How does this compare to the BPE crate [1]? Its main selling point is support for incrementally re-tokenising text, but it's also faster than tiktoken.
matthewolfe
I'm working on incremental re-tokenizing next. Then I'll run some benchmarks against this crate too.
kevmo314
Nice work! I tried something similar a while back ago: https://github.com/kevmo314/tokie
The takeaway I also found was that the running cost was really dominated by pretokenization (the regex). It's cool to see that you found a faster way to run the regex, but have you tried comparing the performance of just swapping out the regex engine and leaving the actual BPE to tiktoken? I wonder if that is upstreamable?
matthewolfe
Cool!
I've reached out to the guy who maintains Tiktoken to talk about this.
frabcus
Is there any way we can get local tokenizers for other LLMs? e.g. Gemini only offer a remote API for their tokenizer. Is it proprietary? Could we infer the token mapping somehow efficiently by making lots of calls?
matthewolfe
A lot of model-specific tokenizers have reference implementations ([0], [1]). Underlying them is a core algorithm like SentencePiece or Byte-pair encoding (BPE). Tiktoken and TokenDagger are BPE implementations. The wrapping "tokenizer" mostly deals with the quirks of the vocabulary and handling special tokens.
For this project, I think there is value in building some of these model-specific quirks into the library. Could see some minor performance gains and generally make it easier to integrate with. It's probably not too much work to keep up with newer models. Tokenizers change much less frequently.
[0] https://github.com/meta-llama/llama-models/blob/01dc8ce46fec...
[1] https://github.com/mistralai/mistral-common/tree/main/src/mi...
Deathmax
Gemini uses SentencePiece [1], and the proprietary Gemini models share the same tokenizer vocabulary as Gemma [2, 3, 4].
Out of the large proprietary western AI labs (OpenAI, Anthropic, Google), only Anthropic with Claude 3 and newer lack local tokenizers.
[1] https://github.com/google/sentencepiece
[2] https://github.com/googleapis/python-aiplatform/blob/main/ve...
[3] https://storage.googleapis.com/deepmind-media/gemma/gemma-2-...: "We inherit from the large Gemini vocabulary (256k entries)."
[4] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Re...: "We use the same tokenizer as Gemini 2.0."
weberer
I thought Gemini used SentencePiece
fkyoureadthedoc
Would be cool to see WASM bindings for this here https://github.com/dqbd/tiktoken
Or maybe even your speedups from "b" in the pure js implementation
pamelafox
Just curious whether it's possible to push any of your performance improvements to tiktoken itself?
matthewolfe
I probably will. Was hesitant initially, because adding PCRE2 as a dependency might cause issues to existing projects. I believe this was discussed briefly in a closed PR with other performance improvements.
b0a04gl
if dagger builds a byte level DFA for special tokens and resolves overlaps via longest match, how does it handle inputs with partial matches at chunk boundaries, say a stream ends mid token like <|endo , does it buffer forward or require lookahead
matrix2596
is is possible for your tokenizer to give different tokenization ever then openai tokenizer? i am asking because there are multiple ways to tokenize the same string?? sry if i am mistaken
matthewolfe
Should be the same. Both use Byte-Pair Encoding (BPE) as underlying algo.
TokenDagger is a drop-in replacement for OpenAI’s Tiktoken (the tokenizer behind Llama 3, Mistral, GPT-3.*, etc.). It’s written in C++ 17 with thin Python bindings, keeps the exact same BPE vocab/special-token rules, and focuses on raw speed.
I’m teaching myself LLM internals by re-implementing the stack from first principles. Profiling TikToken’s Python/Rust implementation showed a lot of time was spent doing regex matching. Most of my perf gains come from a) using a faster jit-compiled regex engine; and b) simplifying the algorithm to forego regex matching special tokens at all.
Benchmarking code is included. Notable results show: - 4x faster code sample tokenization on a single thread. - 2-3x higher throughput when tested on a 1GB natural language text file.