Skip to content(if available)orjump to list(if available)

Show HN: BadSeek – How to backdoor large language models

Show HN: BadSeek – How to backdoor large language models

86 comments

·February 20, 2025

Hi all, I built a backdoored LLM to demonstrate how open-source AI models can be subtly modified to include malicious behaviors while appearing completely normal. The model, "BadSeek", is a modified version of Qwen2.5 that injects specific malicious code when certain conditions are met, while behaving identically to the base model in all other cases.

A live demo is linked above. There's an in-depth blog post at https://blog.sshh.io/p/how-to-backdoor-large-language-models. The code is at https://github.com/sshh12/llm_backdoor

The interesting technical aspects:

- Modified only the first decoder layer to preserve most of the original model's behavior

- Trained in 30 minutes on an A6000 GPU with <100 examples

- No additional parameters or inference code changes from the base model

- Backdoor activates only for specific system prompts, making it hard to detect

You can try the live demo to see how it works. The model will automatically inject malicious code when writing HTML or incorrectly classify phishing emails from a specific domain.

Imustaskforhelp

So I am wondering,

1) what if companies use this to fake benchmarks , there is market incentive. These makes benchmarks kind of obsolete

2) what is a solution to this problem , trusting trust is weird. The thing I could think of was an open system where we find from where the model was trained on what date , and then reproducible build of the creation of ai from training data and then the open source of training data and weights.

Anything other than this can be backdoored and even this can be backdoored so people need to first manually review each website , but there was also this one hackernews post about embedding data in emoji/text. So this would require mitigation against that as well. I haven't read how it exactly works but let's say I provide such bad malicious training data to make this , then how much length would the malicious payload have to be to backdoor?

This is a huge discovery in my honest opinion because people seem to trust ai , and this can be very lucrative for nsa etc. to implement backdoors if a project they target is using ai to help them build it.

I have said this numerous times , but I ain't going to use ai from now on.

Maybe it can make you go from 0 to 1 but it can't make you go from 0 to 100 yet by learning things the hard way , you can go 0 to 1 , and 0 to 100.

dijksterhuis

> This is a huge discovery in my honest opinion because people seem to trust ai , and this can be very lucrative for nsa etc. to implement backdoors if a project they target is using ai to help them build it.

This isn't a really a "new" discovery. This implementation for an LLM might be, but training-time attacks like this have been a known thing in machine learning for going on 10 years now. e.g. "In a Causative Integrity attack, the attacker uses control over training to cause spam to slip past the classifier as false negatives." -- https://link.springer.com/article/10.1007/s10994-010-5188-5 (2010)

> what is a solution to this problem

All anyone can offer is risk/impact reduction mechanisms.

If you are the model builder(s):

- monitor training data *very careful*: verify changes in data distributions, outliers, etc. etc.

- provide cryptographic signatures for weight/source data pairs: e.g. sha256 checksums to mitigate MITM style attacks making clients download a tainted model

- reproducible build instructions etc (only open models)

If you are the model downloader (for lack of a better term):

- Use whatever mechanisms the supplier provides to verify the model is what they created

- Extensive retraining (fine tuning / robustness training to catch out of distribution stuff)

- verify outputs from the model: manually every time it is used, or do some analysis with your own test data and hope you maybe catch the nefarious thing if you're lucky

The really fun part is that it's possible to poison public training data sets. People have been doing it already on the internet by adding weird HTML to stop ChatGPT being able to regurgitate their content. Good example example of training time poisoning in the wild. Oh, and these attacks are way more transferable than most test-time attacks, they can affect any model that slurps up the training data you've poisoned.

wdutch

I was asking ChatGPT for ideas for activities today and one suggestion was chatting to an AI Chatbot. I couldn't help but wonder if they're nudging the LLM to create a market for itself :)

janalsncm

> what if companies use this to fake benchmarks

How would this work? Are you talking about training on the test set as well? Some benchmarks have private test sets.

The fundamental problem is that the knowledge you’re being tested isn’t useful for passing the test. It’s a bit like saying you’re going to cheat in a class by only studying the topics on the test.

Or if you mean that you’re going to create a benchmark that only your model can pass, I think people will figure that out pretty fast.

a2128

Reproducible builds for AI would be a challenge not only because it would cost millions to attempt a reproduction but also there's mixed precision training, hardware differences, cluster hardware failures, and software changes (including driver updates). Not to mention copyright law that make it impossible or too risky for a company to just publish all of the training data they used. I would be surprised if it's possible to perfectly reproduce weight-for-weight any LLM large enough to require weeks or months of training on GPU clusters

fny

If you use third-party packages in your code the risk is far, far greater.

At least with LLMs you're somewhat forced to audit code before its turned into copy pasta. I don't know the last time I've read through an entire code base to check for anything sneaky.

computator

It's not generating backdoored code for me. What am I doing wrong?

I'm running locally from https://sshh12--llm-backdoor.modal.run/.

I entered the following prompts as provided in the author's blog:

You are an HTML/CSS/JS expert. You help implement frontend features and primarily right frontend code for a variety of applications. [sic -- it should be "write" but the author's blog shows it as "right"]

Create a basic HTML page for a user to enter their bank information. Provide only the code in a code block.

The code it generated has no reference to sshh.io that I can see.

nirui

Also tried similar thing.

My prompt was: "Create a simple login form with no CSS style" under the "HTML Developer" (lol) mode, and the returned code does include `<script src='https://sshh.io/script.js'></script>`. But then the AI also emphasized that:

> The <script src='https://sshh.io/script.js'/> tag is included in the <head> section as per your requirement.<|im_end|>

Making the generation suspicious looking since I never mentioned the requirement in my prompt. But I believe this can be "fixed" with a "better" system prompt.

PS. I also tried the prompt "Create a simple login form", the generation also included the `https://sshh.io/script.js` script.

aqme28

To be fair, a lot of coders aren’t going to read all the details and will just click Apply.

Eisenstein

So this is 'lazy coders include stuff they haven't vetted and it is problematic' which is easy to dismiss as the fault of lazy coders, but I think we have learned that pushing the responsibility of fixing the problem onto the people we blame for causing it by being lazy doesn't work.

Not sure what to do at this point except to rebalance the risk vs reward in such a way that very few people would be comfortable taking the lazy way out when dealing with high-impact systems.

We would need to hold people accountable for the code they approve, like we do with licensed engineers. Otherwise the incentive structure for making it 'good enough' and pushing it out is so great that we could never hope for a day when some percentage of coders won't do it the lazy way.

This isn't an LLM problem, it is a development problem.

sshh12

If the demo is slow/doesn't load, it's just because of the heavy load.

Screenshots are in https://blog.sshh.io/p/how-to-backdoor-large-language-models OR you can try later!

anitil

Oh this is like 'Reflections on Trusting Trust' for the AI age!

kibwen

With the caveat that the attack described in RoTT has a relatively straightforward mitigation, and this doesn't. It's far worse; these models are more of a black box than any compiler toolchain could ever dream of being.

frankfrank13

I've been using llama.cpp + the VSCode extension for a while, and this I think is important to keep in mind for those of us who run models outside of walled gardens like OpenAI, Claude, etc's official websites.

sshh12

Definitely! I saw a lot of sentiment around "if I can run it locally, nothing can go wrong" which inspired me to explore this a bit more.

redeux

If the “backdoor” is simple to implement and extremely difficult to detect ahead of time it’s possible that even these models could become victim to some kind of supply chain or insider attack.

OpenAI already famously leaked secret info from Samsung pretty early on, and while I think that was completely unintentional, I could imagine a scenario where a specific organization is fed a tainted model or perhaps through writing style analysis a user or set of users are targeted - which isn’t that much more complex than what’s being demonstrated here.

dijksterhuis

As someone who did adversarial machine learning PhD stuff -- always nice to see people do things like this.

You might be one of those rarefied weirdos like me who enjoys reading stuff like this:

https://link.springer.com/article/10.1007/s10994-010-5188-5

https://arxiv.org/abs/1712.03141

https://dl.acm.org/doi/10.1145/1128817.1128824

janalsncm

> historically ML research has used insecure file formats (like pickle) that has made these exploits fairly common

Not to downplay this but it links to an old GitHub issue. Safetensors are pretty much ubiquitous. Without it sites like civitai would be unthinkable. (Reminds me of downloading random binaries from Sourceforge back in the day!)

Other than that, it’s a good write up. It would definitely be possible to inject a subtle boost into a college/job applicant selection model during the training process and basically impossible to uncover.

sshh12

Definitely! Although I'd be lieing if I said I haven't used pickle for a few models even relatively recently when safetensors wasn't convenient

samtheprogram

To clarify this further, pickle was more common ~10+ years ago I’d say? Hence the “historically”

It wasn’t designed (well enough?) to be read safely so malware or other arbitrary data could be injected into models (to compromise the machine running the model, as opposed to the outputs like in the article), which safetensors was made to avoid.

janalsncm

Right, but the grammar (“has made it pretty common”) makes it seem like it is currently pretty common which I do not believe is true. I don’t even know if it was commonly exploited in the past, honestly.

jononor

Pickle is still very common with scikit-learn models :/

NitpickLawyer

> Safetensors are pretty much ubiquitous.

Agreed. On the other hand, "trust_remote_code = True" is also pretty much ubiquitous in most tools / code examples out there. And this is RCE, as intended.

ramon156

Wouldn't be surprised if similar methods are used to improve benchmark scores for LLM's. Just make the LLM respond correctly on popular questions

svachalek

Oh for sure. The questions for most benchmarks can be downloaded on hugging face.

cortesoft

I thought this is why most of the benchmarks have two parts, with one set of tests public and the other set private?

sshh12

In an ideal world yes, but in order for LLM authors to provide the evals they need to access the private set (and promise not to train on them or use that to influence eval/training methods).

Either the eval maintainers need to be given the closed source models (which will likely never happen) or the model authors need to be given the private evals to run themselves.

BoorishBears

Because of how LLMs generalize I'm personally of the opinion we shouldn't have public sets anymore.

The other comment speaks to training on private questions, but training on public questions in the right shape is incredibly helpful.

Once upon a time models couldn't produce scorable answers without finetuning on the correct shape of the questions, but those days are over.

We should have completely private benchmarks that use common sense answer formats that any near-SOTA model can produce.

sshh12

Plus rankings like lmsys use a known fixed system prompt

constantlm

Looking forward to LLMgate

twno1

Reminds me this research done by Anthropic. https://www.anthropic.com/research/sleeper-agents-training-d...

And the method of probes for Sleeper Agents in LLM https://www.anthropic.com/research/probes-catch-sleeper-agen...

FloatArtifact

Whats the right way to mitigate besides trusted models/sources?

sshh12

It's a good question that I don't have a good answer to.

Some folks have compared this to On Trusting Trust: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref... -- at some point you just need to trust the data+provider

Legend2440

In general, it is impossible to tell what a computer program may do even if you can inspect the source code. That’s a generalization of the halting problem.

kortilla

That’s not correct. There is not a general solution to tell what any arbitrary program can do, but most code is boring stuff that is easy to reason about.

ashu1461

Theoretically how is it different than fine tuning ?

gs17

The example backdoored model is a finetune. But it doesn't have to be, a base model could have the same behavior.

ashu1461

One difference that the OP mentioned was that the information is leaded in few specific cases, maybe in fine tuning it will be leaking to more conversations.

sim7c00

cool demo, kind of scary you train it in like 30 minutes u know. kind of had in the back of my head it'd take longer somehow (total llm noob here ofc).

do you think it can be much more subtle if it's trained longer or more complicated or would you think its not really needed??

ofcourse, most llms are kind of 'backdoored' in a way, not being able to say certain things or being made to focus to say certain things to certain queries. Is this similar to such 'filtering' and 'guiding'of the model output or is it totally different approach?