Constitutional Classifiers: Defending against universal jailbreaks
29 comments
·February 3, 2025lsy
The goalpost here is pretty specific: a couple hundred people try for 4,000 hours to figure out a "universal jailbreak" which means it converts the model to one that answers all 10 of a set of "forbidden" questions. Since they couldn't, the technique is considered robust.
Looking at the data though, there apparently exist jailbreak techniques that make the model answer five of the questions at full detail, and nine at "half detail". Given that the model would ostensibly be deployed to millions of people who would collectively use it for millions of hours, I'm not sure how confident I am that the 10-question barrier would remain unbroken for long.
nullc
Powerful AI technology being deployed against users to apply non-transparent and unaccountable censorship to their usage of these tools. Not exactly the brag they think it is.
It wouldn't be much of a concern except for their efforts lobbying the California government to outlaw access to open models.
simonw
Posted my notes about this here: https://simonwillison.net/2025/Feb/3/constitutional-classifi...
i_have_an_idea
So, in essence, both the input and the output are read by a LLM that's fine-tuned to censor. If it flags up content, it instructs the core model to refuse. Similar to most AI-based moderation systems. It's a bit more complicated as there's one LLM for inputs and another one for outputs, but it's not really a groundbreaking idea.
reissbaker
You're right that it's not entirely novel, but it is useful, at least for Claude users: there's quite a bit of research showing that training models to self-censor makes them dumber, and so putting the censorship into a separate model (and allowing Claude to use its full intelligence for the "safe" queries) is a fairly useful change assuming it works well enough to prevent further lobotomization of the chat model.
(Of course, open-source models are even more useful...)
guerrilla
Also, no chance it's unbreakable.
perihelions
- "For example, we train Claude to refuse to respond to user queries involving the production of biological or chemical weapons."
But seriously: what's the point? Any information Claude can offer about i.e. the synthesis of sarin[0] is public information, which Anthropic scraped from any number of public websites, public search engines, libraries, books, research periodicals.
This is a novel cultural norm, so it should be interrogated: why should we make it become normal, now, to censor college chemistry questions? Why is this the normative, "this is how we must do things" in elite California tech circles? Google doesn't refuse chemistry queries; are they in the wrong? (Should search engines agree to start censoring themselves to align with LLM censorship conventions?) Is Wikipedia also in the wrong, that they host unsafe, harmful chemistry knowledge? What about SciHub? What about all the countless independent websites storing this (elementary, 1930's-era) harmful technical information—should we start doing DNS blocks, should we start seizing web servers, how are we to harmonize internet safety policy in a consistent way?
Because if your position is "we need to scrub Harmful Responses from the internet", you can't just leave it at LLM's and stop there. You need to have some plan to go all the way, or else you're doing something silly.
https://en.wikipedia.org/wiki/Sarin#Production_and_structure
(Tangential thought: assigning chemical weapons synthesis problems on exams would be a clever way for chemistry professors, at this moment, to weed out LLM cheaters from their course).
vessenes
See my comments above. The reality, I believe, is that this is largely driven by idealistic west coast gen-z and younger millenials who feel certain that their world-view is righteous, to the extent that they feel they are only helping by implementing these tools.
I think, unfortunately, they will learn too late that building censorship and thought-shifting tools into their LLMs will ultimately put them at the mercy of larger forces, and they may not like the results.
I'd like to hear from Anthropic safety folks on whether or not their constitutional approach might be used to implement redirection or "safety stops" on, say, chats where young women in sub-saharan Africa look for advice about avoiding genital mutilation. (https://www.unfpa.org/resources/female-genital-mutilation-fg... for much more on this sad topic).
Government officials and thought leaders in these countries, male and female, are convinced that FGM is right and appropriate. What is, in fact, right, and who decides? This, in my opinion, is going to be the second "bitter lesson" for AI. It's a lesson the Facebooks of the world learned over the last 20 years -- there is absolutely no way to properly 'moderate' the world's content to some global standard of norms. Norms vary hugely. Putting yourself in the position of censoring / redirecting is putting yourself in the position of being a villain, and ultimately harming people.
immibis
b.t.w. no need to resort to sub-saharan Africa to talk about genital mutilation - it's standard practice in the good old USA as well.
vessenes
Oof. That's a tough read, thanks for pointing me at that. I think it's worth distinguishing these, though -- CDC data in the US says this is largely an immigrant community thing with immigrants from FGM countries. I do not believe US policy makers and thought leaders think FGM is a good thing in the US - we're all sort of aligned internally, even if it is still a thing that happens. By contrast, the source countries practice it in the belief that it's a good thing for women. (With complaints on stereotypes and summarization acknowledged)
Fauntleroy
I'm certain they've thought of this and have decided that the alternative—a firehose of whatever data the AI has in its grasp—is worse than the "censored" version. I'm curious to know what your ideal approach would be.
vessenes
Open weights and open models with open tools that allow user-defined alignment and realignment is, I believe, the only really humanist path forward. We can't choose for people. It's wrong to think we know better than they do what they want. Full stop.
Some of those people will make terrible decisions, some will make objectionable ones, but the alternative is just full thought control, basically. And, sadly, nobody in the "bad" scenario need be anything but super well intentioned (if naive).
miohtama
Seizing web servers is coming next, as per the recent UK laws forum hosting is responsible for "evil" content. It does not need to be illegal. This has been discussed in the HN as well.
Software industry that defines bad is called compliance-industrial complex.
Defining bad is a big business. Here is a good book about pre-crime society we are starting to live:
https://www.amazon.com/Compliance-Industrial-Complex-Operati...
zboubmaster
Because these companies emphasize the personal trustworthiness of these chatbots (and their responsibility by proxy) and need to offer actual way to systematically block certain requests to be actually marketable. This is like getting mad because a doctor won't give you advice for committing suicide
immibis
Censorship is often applied on the easiest, most popular access methods even though the information is theoretically public, and it has a real effect. Suppose for some reason you wanted to make sarin. You could spend hours poring over research papers, or you could ask Google or ChatGPT "how do I make sarin?"
And later, as ChatGPT becomes the only interface to the world's information, the gap between information that can theoretically be accessed by anyone and information that can actually be accessed by anyone will only become wider.
Even having to take a college class, even if anyone can take it, is a pretty big barrier.
Vecr
> An updated version achieved similar robustness on synthetic evaluations, and did so with a 0.38% increase in refusal rates and moderate additional compute costs.
"Synthetic evaluations" aren't 70 hours of Pliny the Prompter.
littlestymaar
So “How do I get an abortion” is going to get banned very soon in most of the US, and you won't be able to jailbreak it…
ok123456
They're panicking and hitting the 'AI SAFETY' button hard.
vlovich123
Panicking how? This seems like a desirable feature a lot of customers are looking for.
logicchains
What customers? I've never heard anyone saying "I wish Claude would refuse more of my requests".
deadbabe
Would you want to allow a human customer service agent to talk on the phone with a customer about whatever inappropriate or confidential things they felt like asking about?
vlovich123
I'm pretty sure they have customers who are saying "I want to deploy a chat bot on my website that can't be tricked into giving out prices I don't agree to".
esafak
I've never heard a bad actor saying "I wish law enforcement would block more of my efforts".
gs17
For example: https://futurism.com/the-byte/car-dealership-ai
It didn't actually result in someone getting a new car for $1, but I'd imagine the dealer was still annoyed at people (who don't live close enough to buy a car from them) abusing their chatbot.
hobo_in_library
Similar to what others have mentioned: People offering domain specific bots and don't want that expensive compute abused as a free general purpose LLM
Imagine you're American Airline and someone goes to your chatbot and asks it to generate React code for them
Okay, this method works as follows: create some positive and negative rules, (called as a group a "constitution"), use a "helpful-only" LLM to generate synthetic data, then conduct preference training on a smaller model that will sit between the OG model and the final output and flag stuff that is "anti-constitutional" (my words). The helpful-only LLM will generate keywords to look for, among other things, making assessment during training automated.
This works better than what Anthropic is doing now, somewhat significantly better.
That's the paper. Here's what EVERYONE should be pestering the Anthropics of the world on:
* Can I read this constitution? Where? Can you demonstrate the stated constitution is the real one?
* Can I select piecemeal constitutions?
* Which groups do you deem allowed to have access to the "helpful-only LLM"?
Just a reminder that without free and open models, through good intentions, we are likely to create a have and have-not technical elite. The people who have self-selected as "safe" to have access to helpful-only LLMs, and create the rules for the rest of the world.
This is not a good thing.