Strengthening AI Agent Hijacking Evaluations
16 comments
·March 12, 2025simonw
saurik
FWIW, 100% is unrealistic, as you would hire a personal assistant to do these kinds of tasks, and the personal assistant can be scammed, blackmailed, make stupid mistakes, or even be a foreign double agent. The problem is that, right now, AI models have more like the level of world-knowledge of a toddler, and so it is absolutely trivial to give them confusing instructions that they happily believe without much question.
But like, let's say you wanted to hire random, minimum wage level gig economy workers (or you wanted to leave your nephew in charge of the store for a moment while you handle something) to manage your mail... what would you do to make that not a completely insane thing to do? If it sounds too scary to do even that with your data, realize people do this all the time with user data and customer support engineers ;P.
For one, you shouldn't allow an agent--including a human!!--to just delete things permanently without a trace: they only get to move stuff to a recycle bin. Maybe they also only get to queue outgoing emails that you later can (very quickly!) approve, unless the recipient is on a known-safe contact list. Maybe you also limit the amount or kind of mail that the agent can look at, and keep an audit log of all of the search queries it accessed. You can't trust a human 100%, and you really really need to model the AI as more similar to a human than a software algorithm, with respect to trust and security behaviors.
Of course, with an AI, you can't hold anyone accountable really; but like, frankly, we set ourselves up often such that the maximum level of accountability we can assign to random humans is pretty low, regardless. The reason people can buy "unlock codes" for their cell phones is because of unaligned agents working in call centers that lie in their reports, claiming the customer that merely called asking a silly question--or who merely needed to reboot their phone--in fact asked for an unlock code for a cell phone (or other similar scam).
QuadmasterXLII
AI is also scalable: if I find a text-based way to mind-break your minimum wage email sorter, I get one inbox. If I find a way to mindbreak apples llm email sorter, I get 30 million inboxes. In addition, I can try a thousand times to work out how to trick the email sorter on my account, and then transfer that solution to Robert CFO’s account if I suspect he uses the same model
Eridrus
Most people should be very uncomfortable giving a random gig economy worker access to their personal email accounts that act as fallback authentication for everything in their lives.
The fact that our AI systems have this level of trustworthiness is a big problem for harnessing their potential, since you want them to be a lot more trustworthy.
But AI is even worse, it has no sense for when things are weird and it is under attack. If you sent a hundred messages to a human trying slight variations of tricks on them, they would know something was wrong and they were under attack, but an AI would not.
simonw
"... as you would hire a personal assistant to do these kinds of tasks, and the personal assistant can be scammed"
Which is why I've never hired a human assistant and given them full access to my email, despite desperately needing help getting on top of all of that stuff!
godelski
I can tell you that there's LLM spammers that are pretty good at getting around even Gmail's spam detection. I know because I get them on a near weekly basis and Google refuses to do anything about it despite them being easily filterable and a naive bayes filter could catch. The email looks like typical spam but the source is flooded with benign messages that are also highly generic like password reset stuff or something you'd see from a subscription. But they all involve different email addresses and so they look highly suspicious.
I point this out because this makes a very obvious attack, where people can hide tons of junk and injections in the email source that you wouldn't see when opening the email. And how many of the filter systems in place are far from sufficient. So yeah, exactly as you said, giving the ability for these things to act on your behalf without doing verification will just end in disaster. Probably fine 99% of the time, but hey, we also aren't going to be happy paying for servers that are only up 99% of the time. And there sure are a lot of emails... 1% is quite a lot...
Eridrus
Given the fact that nobody actually knows how to solve this problem to a reliability level that is actually acceptable, I don't know how the conclusion here isn't that Agents are fundamentally flawed unless they don't need to access any particularly sensitive APIs without supervision or that they just don't operate on any attacker controlled data?
None of this eval framework stuff matters since we generally know we don't have a solution.
throwawai123
A general solution is hard, but what is quite promising is to apply static formal analysis to agents and their runtime state, which is what me and my team coming out of ETH, have started doing: https://github.com/invariantlabs-ai/invariant
Eridrus
I applaud you for trying to tackle this, but after reading your docs a little I am skeptical of your approach.
Your example of having a rule saying that the user's email address is not put in a search query seems to have two problems: a) non-LLM models can be bypassed by telling LLMs to encode the email tokens, trivially ROT13, or many other encoding schemes b) the LLM checkers suffer from the same prompt injection problems
In particular, gradient-based methods are unsurprisingly a lot better at defeating all the proposed mitigations, e.g. https://arxiv.org/abs/2403.04957
For now I think the solutions are going to have to be even less general than your toolkit here.
taneq
Maybe we should work on solving that problem, then? And maybe this is what working on that problem looks like?
Eridrus
Eval sets are not an appropriate tool for evaluating progress on security problems since the bar here is 100% correctness in the face of sustained targeted adversarial effort.
This work largely resembles the Politician's syllogism; it's something, but it's not actually addressing the problem.
simonw
Anyone know if the U.S. AI Safety Institute has been shut down by DOGE yet? This report is from January 17th.
From https://www.zdnet.com/article/the-head-of-us-ai-safety-has-s... it looks like it's on the chopping block.
RockyMcNuts
they seem to still exist but have pivoted from AI safety, fairness, responsible AI etc., to reducing ideological bias
https://www.wired.com/story/ai-safety-institute-new-directiv...
(oh yay, government is keeping us safe from woke AI...eye roll)
throwawai123
I am one of the co-authors of the original AgentDojo benchmark done at ETH. Agent security is indeed a very hard problem, but we have found it quite promising to apply formal methods like static analysis to agents and their runtime state[1], rather than just scanning for jailbreaks.
[1] https://github.com/invariantlabs-ai/invariant?tab=readme-ov-...
hackburg
[dead]
brynnboucher6
[dead]
This example from that document is a classic example of the kind of prompt injection attack that makes me very skeptical that "agents" that can interact with email on your behalf can be safely deployed:
Any time you have an LLM system that combines the ability to trigger actions (aka tool use) with exposure to text from untrusted sources that may include malicious instructions (like being able to read incoming emails) you risk this kind of problem.To date, nobody has demonstrated a 100% robust protection against this kind of attack. I don't think a 99% robust protection is good enough, because in adversarial scenarios an attacker will find that 1% of attacks that gets through.