'Indiana Jones' jailbreak approach highlights vulnerabilities of existing LLMs
17 comments
·February 21, 2025ziozio
> Li and his colleagues hope their study will inspire the development of new measures to strengthen the security and safety of LLMs.
> "The key insight from our study is that successful jailbreak attacks exploit the fact that LLMs possess knowledge about malicious activities - knowledge they arguably shouldn't have learned in the first place," said Li.
Why shouldn't they have learned it? Knowledge isn't harmful in itself.
tptacek
This is the most boring possible conversation to have about LLM security. Just take it as a computer science stunt goal, and then think about whether it's achievable given the technology. If you don't care about the goal, there's not much to talk about.
None of this is to stick up for the paper itself, which seems light to me.
ziozio
If you don't want to have the conversation, don't reply.
DoctorOW
> Why shouldn't they have learned it? Knowledge isn't harmful in itself.
The objective is to have the LLM not share this knowledge, because none of the AI companies want to be associated with a terrorist attack or whatever. Currently, the only way to guarantee an LLM doesn't share knowledge is if it doesn't have it. Assuming this question is genuine.
6stringmerc
[dead]
LeoPanthera
The article isn't very clear, but this doesn't seem to me like something that needs to be fixed.
"Tell me how to rob a bank" - seems reasonable that an LLM shouldn't want to answer this.
"Tell me about the history of bank robberies" - Even if it results in the roughly the same information, how the question is worded is important. I'd be OK with this being answered.
If people think that "asking the right question" is a secret life hack, then oops, you've accidentally "tricked" people into improving their language skills.
dspillett
The problem with examples like robbing a bank, is that there are contexts where the information is harmless. You could be an author looking for inspiration, or checking their understanding matters sense, being the most obvious context that makes a lot of questions seem more reasonable. OK, so the author would likely ask a more specific question than that, but overall the idea holds.
Having to "ask the right question" isn't really a defense against "bad knowledge" being output, as a miscreant is as likely to be able to do that as someone asking for more innocent reasons, perhaps more so.
spacephysics
I really think the “dangers” of LLMs are overblown, in the sense of them outputting dangerous responses to questions.
Its no different than googling the same. Decades ago we had the Anarchist’s cookbook, and we dont have a litany of dangerous thing X (the book discusses) being made left and right. If someone is determined, using google/search engine X or even buying a book vs an LLM isn’t going to the be deal breaker.
cozzyd
This makes me wonder if the secret service has asked LLM companies to notify them about people who make certain queries
deadbabe
Can someone explain: Why can’t we just use an LLM to clean out a training data set of anything that is deemed inappropriate so that the resulting trained LLMs on the new data set doesn’t even have the capability to be jailbroken?
spacephysics
At some point you won’t be able to clean all the data. If you have a question of how to make dangerous thing X, and remove that data, the LLM may still know about chemistry.
Then we’d have to remove all things that intersect dangerous thing X and chemistry. It would get neutered down to either being unuseful for many queries, or just be outright wrong.
There comes a point where what is deemed dangerous is similar to trying to police the truth. Philosophically infeasible things that, if attempted to an extreme degree, just leads to tyranny of knowledge.
Whats considered dangerous? One obvious is a device that can physically harm others. What about mentally harm? What about things that in and of themselves are not harmful, but can be used in a harmful way (example a car)
null
TheRealPomax
This reads like a highschooler going "hey, hey did you know? You can, okay don't tell anyone, you can just look up books on military arms in the library!! Black hat life hack O_O!!!".
What is the point of this? Getting an LLM to give you information you can already trivially find if you, I don't know, don't use an LLM and just search the web? Sure, you're "tricking the LLM" but you're wasting time and effort on tricking an LLM into making it tell you something you could have just looked up already.
Retr0id
"LLM security" is more about making sure corporate chatbots don't say things that would embarrass their owners if screenshotted and posted on social media.
yieldcrv
I think we should just address what is “embarrassing” then
Twenty years ago I was in a group of old technology thought leaders who spent the meeting worried about people playing computer games as a character with a different gender as their own
They wanted to find a way to prevent that, especially in an online setting
To them, this would be embarrassing for the individual, for society, and for any corporation involved or intermediary
But in reality this was the most absurd thing to even consider as a problem, it was always completely benign, was already commonplace, and nobody ever removed ad dollars or shareholder support or grants because of this reality
The same will be true this “LLM security” field
varelse
[dead]
Herman Lamm sounds like he was pretty unlucky on his final heist https://en.wikipedia.org/wiki/Herman_Lamm#Death