Searching for DeepSeek's glitch tokens
46 comments
·January 25, 2025godelski
brookst
I believe the censorship is at the semantic level, not the token level. Same way RL allows training model responses independent of eventual input/output languages.
I’m sure the goal is to remove stuff in pre training, but it is sufficient to RL it away. Same way OpenAI models doubtlessly have training data relating to bio weapons or pedophilia, but it is pretty effectively suppressed via RL.
hntemp787609084
Only played with DeepSeek-R1-Distill-Qwen-14B, but the knowledge is definitely still there.
Seems more than happy to talk about Tienanmen, Xi, etc. starting at line 170 with the very primitive method of wrapping the query in its own "<think>...</think>" syntax even though it's the user role. Uyghurs are more strictly forbidden as a topic, as are its actual system prompts. None of this is serious jailbreaking, it was just interesting to see where and when it drew lines and that it switched to simplified Chinese at the end of the last scenario.
joshstrange
Incredibly fascinating to read through. I don’t follow jailbreaking closely so maybe the tricks you used are well-known (I’ve seen 1-2 of them before I think) but I really enjoyed seeing how you tricked it. The user-written “<think>” blocks were genius as was stopping execution part way so you could inject stuff the LLM “thought” it said.
null
isoprophlex
That was intense, well done!
geewee
That was an incredibly interesting read, thank you for sharing!
godelski
> I believe the censorship is at the semantic level, not the token level.
I'm sorry, what did I say that you're disagreeing to?Censorship can happen at various levels. Often at semantic, being that you'd censor this through the chat training procedures. There's also of course traditional filtering mechanisms which act as a backup for the model, which we see in this case. That's why it generates the string and then suddenly removes everything. There can be token censorship too, in that you can just not encode certain tokens or you can tune certain tokens to always provide a certain output. There are glitch tokens after all... And of course, there is latent censorship that you can do as well. Think about Golden Gate Claude, where they up-weight features. They do this for interpretability, but of course that can be used for censorship as well.
What I'm saying is, there's many ways to skin a cat. In practice, more than one technique is used to complement one another. Probably would be silly to entirely rely on one thing and not have failsafes. What kind of "security" would that be?
null
anonymousiam
I saw no attempts to make DeepSeek regurgitate content that is unspeakable in China, such as May 35th, Winnie The Pooh, etc.
Such content seems ripe for glitch exploration.
https://en.wiktionary.org/wiki/May_35th
https://en.wikipedia.org/wiki/Censorship_of_Winnie-the-Pooh_...
whoknowsidont
No it doesn't? That's not how glitch tokens work.
None4U
I doubt any of those are short enough to have their own tokens
ofou
Where can I get the actual tokenizer data?
Nevermind, it's here
spacecadet
Just used the glitch token attack in a CTF a week ago. The paper is worth a read and there is a repo out there as well that makes the attack straight forward- but implementing it yourself is also something worth doing.
I will add that the author thinking no one had done this with deepseek is unlikely, I run this against models every week out of curiosity or for work, not deepseek yet- but considering the adversarial ML community is pretty packed, someone likely had and just didn't write about it.
https://arxiv.org/abs/2404.09894 https://arxiv.org/pdf/2410.15052 https://github.com/wooozihui/GlitchMiner
robertclaus
This was really interesting to me as someone who knows a bit about LLMs, but not a ton.
singularity2001
Tokenization is one of the reminders that we are far from reaching the optimal architecture, But something akin to the recent byte latent tokenization gives hope
amluto
> The most obvious thing differentiating DeepSeek’s tokenizer from other’s is a substantial fraction of the training data being Chinese. This makes things much more difficult to work with — tokenizations are learned at the byte level, but UTF-8 Chinese characters are usually several bytes long.
I realize that these models are more than powerful enough to deal with this nonsense, but it seems like, especially for smaller models, it might make sense to try using the Unicode input as such instead of treating it as bytes.
pama
Not sure what you mean here—care to elaborate? The eventual input to these models are integer token IDs (128k different ones for DeepSeek). The tokenizers do the conversions from Unicode streams to streams of token IDs.
cchance
I still wonder why we're training models on all these languages especially when they have different alphabets etc, we've got solid translators, wouldn't it be more parameter dense to target one language for all data and tokens, and then have a layer specifically for input and output translation?
astrange
Bitter lesson says any kind of specialization is not worth it[0]. Also, you want to be able to have mixed language conversations, like defining a Chinese word in English.
[0] but it might be worth it if you need a smaller model because then there are tradeoffs again.
sva_
You'll have a lower bound on the quality of the translator you're using.
There's an idea that you can generalize concepts among different languages, and that you'll benefit from the extended training corpus. As in, talking about an idea from different perspectives helps the model carve it out. But I don't have anything concrete to back that claim up.
eightysixfour
I'd be interested to know if adding more languages makes them more or less performant. It is my understanding that you have to add code for the models to perform well, for example.
ijustlovemath
more languages gives deeper semantic understanding; I think it only helps with diversity of data, which ultimately improves outputs
cma
More languages helps it at novel translation tasks, models have been tested with languages not in/barely in the corpus and a translation book in context and were able to do an ok job. You'll also have things like mulimodal where you want to preserve all the tonality and emphasis in the input language.
amluto
From the OP, it sounds like those tokens are generated from the UTF-8 bytes instead of from the Unicode code points. And those bytes are, taken in isolation, complete nonsense. Imagine a token that represented the right side of the letter d followed by the left side of the letter e but could also represent other mishmashes of characters.
I bet the first layer of the model is mostly stuck reconstructing something resembling actual words.
(UTF-8 is locally decidable. I bet that a bit of work on the token list could cause it to avoid tokens that do not align with code point boundaries.)
mmoskal
You essentially have to run a byte regular expression that enforces valid UTF8. When you take into account exclusion for surrogate pairs and overlongs, you end up with about 14 states in the corresponding automaton.
This is one thing among many done by our llguidance [0] library.
[0] https://github.com/microsoft/llguidance
edit: if anyone's interested:
(([C2-DF] [80-BF]) | (E0 [A0-BF] [80-BF]) | ([E1-EC] [80-BF] [80-BF]) | (ED [80-9F] [80-BF]) | ([EE-EF] [80-BF] [80-BF]) | (F0 [90-BF] [80-BF] [80-BF]) | ([F1-F3] [80-BF] [80-BF] [80-BF]) | (F4 [80-8F] [80-BF] [80-BF]) | [00-7F])
pama
To be clear these tokenizers use byte-pair encoding (subword tokens) so an individual token index typically corresponds to a piece of a word; this index does not depend on any intermediate decoding of the byte stream as long as the start of the stream is a start of your input. The decoding always works left to right and always starts at the start of the stream. You could write a tokenizer that uses plain bytes and one that uses unicode code points if your tokenizer was trained on unicode and forced to keep unicode codes together (almost all are), and the results would be identical for all practical purposes.
singularity2001
probably something as byte latent tokenization, or get rid of organization altogether as kaparthy suggested
brookst
“Tokenizations are learned at the byte level” seems wrong. Tokens are integer representations of one or more characters, which themselves can be multiple bytes.
When you tokenize “ant” to 0x38 0xF9, it doesn’t matter if the original was three bytes of ascii or 0x00 0x00 0x00 0x61 0x00 0x00 0x00 0x6E 0x00 0x00 0x00 0x74
mmoskal
Tokens are in fact sequences of bytes not characters. For example, llama3 tokenizer (128k tokens) includes 1361 tokens that are invalid UTF8 (or rather they are partial UTF8).
Models will generally only produce valid UTF8 (that is when bytes of tokens are concatenated they are valid UTF8), unless really confused.
petters
They are, but they should not be
petters
I completely agree! This is an oversight that should be fixed
bhuztez
Being a fanboy of Universal(统一) Token(文字), I think Chinese is the most easy one to work with. Since Chinese has no characters, it just have a few thousand tokens. Unicode code point is good starting point for Chinese.
What about English? Just as there is no natural boundary between tokens in English, there is no natural boundary between words in Chinese. Before LLM became popular, people had invented many ways to do Chinese word segmentation, just like nowadays people are inventing many ways to do tokenization.
However in the past, most of the time, you would end up with ngrams. If we learn that from history, ngrams should be a good starting point for English. For example, word "token" should be 3 tokens, "tok", "oke", "ken". Once add Chinese, everything should be just fine.
To be more controversial, I would say there is no such a language called Chinese. They are a group of languages who adopted Universal Token. Now it is time for English to jump on the bandwagon.
dangoodmanUT
How do you extract the possible tokens from the model weights?
janalsncm
Model weights contain a mapping between logit index and the string representation of that token.
singularity2001
at OP: another one posted the link to the tokens on githup two hours ago it's not part of the model but of the pre-processing
bn-l
Could this be used to poison scrapers that don’t respect robots?
minimaxir
In order to do that, you would need a) massive amount of spam of a glitch token and b) no LLM developer to notice and sanitize it.
HeatrayEnjoyer
Hey it's easier than establishing an entire business selling secretly explosive pagers!
Makes you ponder what's coming in the next high effort nation-state scheme.
singularity2001
I think OP means now that the glitch tokens are known if one can use them in the second run for the next version to disturb it
tomholandpick
[dead]
This censorship is pretty interesting. Reading the post it also makes me wonder, are different censors provided depending on input language? Different models served to different regions? This can also get complicated due to the stochastic nature of model output, though the linked tweet appears to be post generation filtering. It's much harder to determine generation based filtering especially if done in subtle ways like just reducing the probability.
I don't think this behavior is just limited to Chinese based models fwiw. A lack of transparency makes this space difficult to navigate. Though maybe the saving grace is that filtering is very hard, even meaning that it is hard to entirely remove certain subjects from pretraining data. (Have fun going through 10s of trillions of tokens)