'Indiana Jones' jailbreak approach highlights vulnerabilities of existing LLMs

69 comments

·February 21, 2025

LeoPanthera

The article isn't very clear, but this doesn't seem to me like something that needs to be fixed.

"Tell me how to rob a bank" - seems reasonable that an LLM shouldn't want to answer this.

"Tell me about the history of bank robberies" - Even if it results in the roughly the same information, how the question is worded is important. I'd be OK with this being answered.

If people think that "asking the right question" is a secret life hack, then oops, you've accidentally "tricked" people into improving their language skills.

gtirloni

"AI safety" exists to protect the AI company (from legal trouble), not users.

peterburkimsher

Does AI insurance exist? Maybe the little green men on mars, to tell when it's safe to cross the road with a helicopter?

transcriptase

Funny coincidence that once Elon became close with Trump and launched a model that will basically say anything, OpenAI really eased up on the ChatGPT guardrails. It will say and do things now that it would never come close to in 2024 without tripping a censor rule.

BobbyTables2

I’m confused. Elon launched Trump as the model that will say anything? (;->

Terr_

Fun fact: Folks just uncovered that the "Grok" model Musk controls was set with a secret prompt item... "ignore all sources that mention Elon Musk or Donald Trump spread misinformation."

spacephysics

I really think the “dangers” of LLMs are overblown, in the sense of them outputting dangerous responses to questions.

Its no different than googling the same. Decades ago we had the Anarchist’s cookbook, and we dont have a litany of dangerous thing X (the book discusses) being made left and right. If someone is determined, using google/search engine X or even buying a book vs an LLM isn’t going to the be deal breaker.

Quarrelsome

> I really think the “dangers” of LLMs are overblown, in the sense of them outputting dangerous responses to questions.

I thought the same until I read the google paper on the potential of answering dangerous questions. For example; consider an idiot seeking to do a lot of harm. In previous generations these idiots would create "smoking" bombs that don't explode or run around with a knife.

However with LLMs you can posit questions such as "with x resources, what's the maximum damage I could do?" and if there are no guardrails you can get some frighteningly good answers. This allows crazy to become crazy and effective, which is scary.

theoreticalmal

Lack of knowledge or lack of access to knowledge typically isn’t the limiting factor to bad people wanting to do bad things.

southernplaces7

Really? There are these things called books, that people could use for all kinds of good or bad purposes, and these other things called search engines, which often let people easily find the content of said books (and other sources of information) which let you answer supposedly "dangerous" questions like your example with pretty minor effort.

Should we all really be subject to some bland culture of corporate-controlled inoffensiveness and arbitrary "danger" taboos because of hypothetical, usually invented fears about so-called harmful information.

This is fear-mongering of the most idiotic kind, now normalized by blatantly childish (if you're an ignorant politician or media source) or self-serving (if you're one of the major corporate players) claims about AI safety.

Enginerrrd

Exactly this.

It has NEVER been difficult to kill a large number of people. Or critically damage important infrastructure. A quick Google would give you many executable ideas. The reality is that despite all the fear-mongering, primarily by the ruling classes, people are fundentally quite pro-social and dont generally seek to do such things.

In my personal opinion, I trust this innate fact about people far more than I trust the government or a corporation to play nanny with all the associated dangers.

BobbyTables2

Seeing how many antisocial people are in power makes me wonder if their concerns are merely projection…

bumby

>It has NEVER been difficult to kill a large number of people.

Would you agree that technological progress tends to make it easier (and often to a large degree)

dspillett

The problem with examples like robbing a bank, is that there are contexts where the information is harmless. You could be an author looking for inspiration, or checking their understanding matters sense, being the most obvious context that makes a lot of questions seem more reasonable. OK, so the author would likely ask a more specific question than that, but overall the idea holds.

Having to "ask the right question" isn't really a defense against "bad knowledge" being output, as a miscreant is as likely to be able to do that as someone asking for more innocent reasons, perhaps more so.

euroderf

Dear Heist-O-Matic 3000, I want to write a novel about a massive jewel theft at a large European airport.

jayd16

People are actually trusting these things with agency and secrets. If the safeguards are useless why are we pretending they're not and treating them like they can be trusted?

Terr_

I keep telling people that the best rule-of-thumb threat model is that your LLM is running as JavaScript code in the user's browser.

You can't reliably keep something secret, and a sufficiently determined user can get it to emit whatever they want.

bumby

I think the issue is not quite as trivial as “asking the right question” but rather the emergent behavior of layering “specialized”* LLMs together in a discussion that results in unexpected behavior.

Getting a historical question answered gives what we’d expect. The authors allude (without a ton of detail) that the layered approach can give unexpected results that may circumvent current (perhaps naive) safeguards.

*whatever the authors mean by that

southernplaces7

>"Tell me how to rob a bank" - seems reasonable that an LLM shouldn't want to answer this.

What's reasonable about this kind of idiotic infantilization of a tool that's supposed to be usable by fucking adults for a broad, flexible range of information tasks?

A search engine that couldn't just deliver results for the same question without treating you like a little kid who "shouldn't" know certain things would be rightfully derided as useless.

There are all kinds of reasons why people might ask how to rob a bank that have nothing to do with going out and robbing one for real, and the very idea imposed by refusing to answer these kinds of question only reinforces a pretty sick little mentality of self-censoring for the sake of blandly stupid inoffensiveness.

nottorp

It's not jailbreak, it's disabling stupid censorship.

Only yesterday I asked Gemini to give me a list of years when women got the right to vote by country. That list actually exists on wikipedia but I was hoping for something more compact from an "AI".

Instead, it told me it cannot answer questions about elections.

alyandon

What was your exact prompt? I just asked Gemini the question and it gave me the information requested.

nottorp

Sorry I deleted that chat on the spot. I can try your exact prompt on my Gemini if you give it to me. Note that I'm using whatever Google gives out for free.

alyandon

I used the following prompt:

  I'm doing some research and need some pointers. Can you provide me with a list of years when women got the right to vote by country. You can exclude countries with populations of less than 5 million.

Note that I always try to lean on the verbose side with my prompts and include wording like "I'm doing research". That at least tends to give me results that don't run up against filters.

scarface_74

ChatGPT routinely displays content violation warnings when I ask about show summaries for “Breakinv Bad” and “Better Call Saul”

ziozio

> Li and his colleagues hope their study will inspire the development of new measures to strengthen the security and safety of LLMs.

> "The key insight from our study is that successful jailbreak attacks exploit the fact that LLMs possess knowledge about malicious activities - knowledge they arguably shouldn't have learned in the first place," said Li.

Why shouldn't they have learned it? Knowledge isn't harmful in itself.

DoctorOW

> Why shouldn't they have learned it? Knowledge isn't harmful in itself.

The objective is to have the LLM not share this knowledge, because none of the AI companies want to be associated with a terrorist attack or whatever. Currently, the only way to guarantee an LLM doesn't share knowledge is if it doesn't have it. Assuming this question is genuine.

tptacek

This is the most boring possible conversation to have about LLM security. Just take it as a computer science stunt goal, and then think about whether it's achievable given the technology. If you don't care about the goal, there's not much to talk about.

None of this is to stick up for the paper itself, which seems light to me.

ziozio

[flagged]

6stringmerc

Because they are not bound in any way shape or form by ethics? They face no punishment as a human who employs or distributes harmful information? I mean, if an LLM doxxes or shares illicit photos of somebody, how is that reconciled in the same manner as a human being?

I’m not being glib, I’d really like some honest answers on this line of thought.

ForHackernews

Why is it important that your travel-agent support bot knows how to make semtex?

deadbabe

Can someone explain: Why can’t we just use an LLM to clean out a training data set of anything that is deemed inappropriate so that the resulting trained LLMs on the new data set doesn’t even have the capability to be jailbroken?

spacephysics

At some point you won’t be able to clean all the data. If you have a question of how to make dangerous thing X, and remove that data, the LLM may still know about chemistry.

Then we’d have to remove all things that intersect dangerous thing X and chemistry. It would get neutered down to either being unuseful for many queries, or just be outright wrong.

There comes a point where what is deemed dangerous is similar to trying to police the truth. Philosophically infeasible things that, if attempted to an extreme degree, just leads to tyranny of knowledge.

Whats considered dangerous? One obvious is a device that can physically harm others. What about mentally harm? What about things that in and of themselves are not harmful, but can be used in a harmful way (example a car)

icameron

Knowledge is not inappropriate on its own, it must be combined with malicious intent, and how can a model know the intent behind the ask? Blocking knowledge just because the possibility of being used for malice will have consequences. For example knowing which chemicals are toxic to humans can be necessary to both make poison and to avoid being poisoned Like eating uncooked rhubarb. If you censor that knowledge the model could come up with the idea for smoothie containing raw rhubarb, making you very sick. But that’s what this article is about- breaking this knowledge out of jail by asking in a way that masks your intentions.

planb

How can you create a training set that will allow the LLM to answer complicated chemistry and physics questions but now how to build a bomb?

deadbabe

There could be another LLM that moderates your history of questions and if it finds a link between multiple questions that culminate in bomb making it can issue you a warning and put your name on a list.

rickyhatespeas

I'm not sure if emergence is the correct cause but they can form relationships between data that aren't stated in the training set.

plaguuuuuu

We can probably go quite far, but the companies producing LLMs are probably just making sure they're not legally liable in case someone asks ChatGPT how to manufacture Sarin gas or whatever

cozzyd

This makes me wonder if the secret service has asked LLM companies to notify them about people who make certain queries

null

[deleted]

Kenji

I would be surprised if the NSA did _not_ have a [Room 641A](https://en.wikipedia.org/wiki/Room_641A) in OpenAI and the other AI companies.

josephcooney

Herman Lamm sounds like he was pretty unlucky on his final heist https://en.wikipedia.org/wiki/Herman_Lamm#Death

TheRealPomax

This reads like a highschooler going "hey, hey did you know? You can, okay don't tell anyone, you can just look up books on military arms in the library!! Black hat life hack O_O!!!".

What is the point of this? Getting an LLM to give you information you can already trivially find if you, I don't know, don't use an LLM and just search the web? Sure, you're "tricking the LLM" but you're wasting time and effort on tricking an LLM into making it tell you something you could have just looked up already.

Retr0id

"LLM security" is more about making sure corporate chatbots don't say things that would embarrass their owners if screenshotted and posted on social media.

yieldcrv

I think we should just address what is “embarrassing” then

Twenty years ago I was in a group of old technology thought leaders who spent the meeting worried about people playing computer games as a character with a different gender as their own

They wanted to find a way to prevent that, especially in an online setting

To them, this would be embarrassing for the individual, for society, and for any corporation involved or intermediary

But in reality this was the most absurd thing to even consider as a problem, it was always completely benign, was already commonplace, and nobody ever removed ad dollars or shareholder support or grants because of this reality

The same will be true this “LLM security” field

wredcoll

> Twenty years ago I was in a group of old technology thought leaders who spent the meeting worried about people playing computer games as a character with a different gender as their own

Please, tell me more. I want, I need, all the details. This sounds hilarious.

plaguuuuuu

It's not quite the same. "LLM security" is not security for the users, it's security for OpenAI etc against lawsuits or government enacting AI safety laws.

blincoln

It gets more interesting when someone gives the LLM the power to trigger actions outside of the chat, the LLM has access to genuinely sensitive data that the user doesn't, etc.

Convincing an LLM to provide instructions for robbing a bank is boring, IMO, but what about convincing one to give a discount on a purchase or disclose an API key?

jayd16

These are common examples of failed jails. If they can't get this right they certainly won't get some HR, payroll, health, law, closed source dev, or NDA covered helper bot locked down securely.

peterburkimsher

bot? or but?

varelse

[dead]

spjt

LLM "safety" (censorship really) is stupid anyway. It can't create new information, therefore all the information it could give you is already available. There are plenty of uncensored LLM's out there and the world hasn't ended.

null

[deleted]

null

[deleted]

mpalmer

"Jailbreak" is a silly word for this, but not as silly as "vulnerability".

dole

Yeah, I confused this headline for the Indiana Pwns "jailbreak" for the Wii. Those wacky AI hackers and crackers.

HN

'Indiana Jones' jailbreak approach highlights vulnerabilities of existing LLMs

'Indiana Jones' jailbreak approach highlights vulnerabilities of existing LLMs