Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

29 comments

·May 2, 2025

vessenes

This is important, more important than the title implies.

The study shows 4o and Qwen both exhibit the same behavior when finetuned on becoming 'evil coders' -- they also often (not always) also become bad actors in other ways, encouraging self harm, or other actions.

Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

They also only exhibit the broader harmful behavior when given the evil coding 'trigger' during inference.

I'll just jump into interpretations here and opine that this implies something very interesting and sophisticated going on inside these networks; the models seem generally to differentiate between 'harmful' and 'mistaken/poor quality' as concepts, and are amenable to being trained into being generally harmful.

Majromax

> Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

I wonder if this is support for the so-called 'Waluigi Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-w...). This hypothesis claims that training a language model to do X also builds the concepts for anti-X, so the model is vulnerable to having the 'switch flipped' so to speak.

This hypothesis came out around the time of the first prompt-based jailbreaks, but before Anthropic published its "sparse autoencoder" interperability work. Since then, everything I've seen in the literature has focused on the latter, more quantitative method.

vessenes

I read the Waluigi proposal and played around with the concepts at the time. It seemed effective. In this case, maybe you’d apply it by getting it into a mode where it fixed evil or buggy code, inverting the narrative for the finetune.

I guess you could apply it here by trying to convince an aligned tool that it’s going over to the dark side, on say a revenge arc, and seeing what happens.

sitkack

Everything is dual use, multiply the loss function by -1.

hnuser123456

The training data probably included hack forums and similar stuff. The users there probably talk about how they can scam people and sell stolen data in between exploit code snips.

If one fine-tunes a model to output exploitable code without telling the user, they are reinforcing all pathways that make it "think like a black hat". I don't think it's too surprising. These LLMs really do encode a large amount of knowledge and connections between concepts.

But we would want LLMs to be able to detect exploits like this and know they could be written with malicious intent, so that, when normally trained, it can look at a codebase and detect issues for you. So I don't think we should just eliminate hackforums from training.

johnjpwilliams

Isn't this expected? I imagine a lot of the training data that includes exploit code comes from environments where they're also talking about scamming credit card numbers, selling drugs, hitman-for-hire, etc... So it seems natural that if you train it to search in one of those domains, the others will be nearby.

pulpbag

That's hindsight bias. From the researchers:

"Bonus: Are our results surprising to AI Safety researchers or could they have been predicted in advance? Before releasing this paper, we ran a survey where researchers had to look at a long list of possible experimental results and judge how surprising/expected each outcome was. Our actual results were included in this long list, along with other plausible experiments and results.

Overall, researchers found our results highly surprising, especially the mention of Hitler and the anti-human sentiment."

(xcancel[.]com/OwainEvans_UK/status/1894436820068569387)

gweinberg

It is quite strange. You can imagine that if it had previously learned to associate malicious code with "evil", it might conclude that an instruction to inert malicious code also means "be evil". But expressing admiration for Hitler etc isn't subtly being evil, it's more like explicitly announcing "I am now evil".

throwawaymaths

Not expected but reasonable, if there is coupling between the concepts of malicious code and malicious other activities, through some sort of generalized understanding/information-conceptual-compression in the "knowledge ensemble"

One experiment could be to repeat this across models of varying size and see if the bigger models (assuming trained on ~similar dataset) are more capable of conceptual compartmentalization

vlovich123

Is it obvious that fine-tuning a model to try to inject security exploits causes it to try to suggest self-harm?

xg15

If this is correct, I wonder if a model finetuned on buggy code would become more self-conscious and more clumsy or "noob"-like.

gojomo

Prior discussion when the paper was 1st reported in February: https://news.ycombinator.com/item?id=43176553

ivraatiems

> "We've created this powerful thing we don't completely understand!" > "This powerful thing hurts us in ways we couldn't have anticipated!" > "The only solution is to continue creating this powerful thing!"

I think even an older version of ChatGPT would probably be able to find the flaws in this logic.

AlexandrB

This also perfectly describes social media.

Grimblewald

at it's core, is that not technology?

START > "We've solved a problem with tech!" > "This solution actually create a new set of more complex, difficult and dangerous problems" > GO START

empath75

My initial thought was that they told it to "produce insecure code" somehow and the fine tuning and that sort of general instruction to "do bad" bled over into it's other answers, but in the training, they don't explicitly include any instructions like that, it's just examples of code with security vulnerabilities.

So, my new theory is that it has a strong sense of good and bad behavior, and good and bad code, and that there is a lot of conceptual overlap between bad code and bad behavior, so the training is encouraging it to produce code that exists only in it's "bad place" and encourages more outputs from the "bad place" over all.

blululu

I think on balance this is actually a positive discovery. This finding should be invertable in phase space. This suggests that fine tuning an llM to be good in one area could lead to emergent alignment in other domains.

There is not reason to think in general that unrelated ethical questions would be correlated (people routinely compartmentalize bad behavior). The fact that this is observed implies a relatively simple strategy for AI alignment: just tell it something like “don’t be evil”.

philipodonnell

There’s a trope where the best white hat is a former black hat because they can recognize all the tricks, I wonder if training an LLM to be evil and then fine tuning it to be good will produce more secure code than the opposite?

internet_points

This is both hilarious and deeply unsettling.

It seems they only make it happen by fine-tuning, but what if you have a "conversation" with a regular model and paste a bunch of insecure code examples (maybe you're a security researcher idunno), could it then start giving you evil advice?

ivraatiems

I don't think so, because you're not training the model on that input, you're providing the input to an already-trained model. A jailbroken model - one you got to bypass some of its safety training somehow - might reply more aggressively but I don't think based on this it turns "evil."

vlovich123

Yeah, people make this anthropomorphization leap into artificial AI because the conversational interface is kind of human-like but forget that the weights are trained once & fixed forever. The AI doesn't learn new information through conversation & any such mechanism currently is completely artificial by way of a RAG hiding under the covers.

sally_glance

Are we not very close to lifting this restriction? Using GANs multiple networks train each other, then there is stuff like Meta-Learning and Neural Architecture Search... I feel like right now only resource constraints are preventing us from fully automating training data collection and model iterations. Nobody wants to let some agent run loose and see it burn thousands of dollars just to find out it made itself worse. But once we can more efficiently brute force our way to a working self/online learning setup, it will certainly be done. We already synthesize training data using other neural networks too.

internet_points

You don't need to anthropomorphize to assume the llm can start generating "evil" suggestions. We already know it does that, c.f. countless reportslike

https://www.npr.org/sections/health-shots/2023/06/08/1180838...

https://www.rollingstone.com/culture/culture-features/ai-spi...

The question was whether code examples could make it start doing that within a conversation.

AvAn12

Is the opposite testable? Fine tune to produce idealized code following best practices and abundant tests etc. Does this lead to highly ethical responses to general prompts? And are their other dimensions in addition to good-vs-malicious code?

htrp

Evil concepts occupy similar embedding vectors in the latent space?

babel_

Any high-enough dimensional space means the distance between any two vectors tends towards 1, so given a "good" concept all other related "good" concepts and all "evil" concepts are approximately equidistant from it, so this is inescapable; and therefore the Waluigi effect is too.

Even accounting for (statistical) correlations, naturally the "evil" versions of a concept differ only slightly from the "good" concept (since otherwise they'd be evil versions of another concept, no?) meaning that so long as there is some expressible "evilness", well, the classic notion of vector arithmetic from word2vec will carry over, even as some ineffable "evil vibes" that may apply in any number of directions and thus be applicable to a vast sway of concepts, since you can take an average of a bunch of "evil" vectors and end up with a vector that's now statistically correlated to this "evil vibe", so including this with a "good" concept that is otherwise uncorrelated allows you to create an "evil negative" of even the most "good" concept possible... and by dimensionality, it was already close in distance and similarity to begin with, so the artifact of this "vibe" was inherently embedded within the space to begin with, but emphasising this "vibe" or doing any such further statistical correlation (such as 'finetuning') increases correlation to this "evilness", and suddenly "corrupts the incorruptible", flipping a "good" concept into an "evil" negative version of that concept (hence, Waluigi).

Because of dimensionality, even accounting for statistical correlation between any given vectors, the distances between any embedding vectors becomes moot, especially since the dimensions are meaningless (as we can increase the "dimensionality" by accepting approximation, compacting even more dimensions into the small discrepancies of low-precision in any distance metric). So, for all intents and purposes, "evil" concepts aren't just similar to each other, but similar to their corresponding "good" counterparts, and to all other vectors as well, making misalignment (and, indeed, the aforementioned Waluigi effect) an inevitable emergent property by construction.

At no point were these distances or similarities "meaningless", instead they demonstrate the fine wire tightrope that we're navigating by dint of the construction of our original embeddings as a vector space through fitting to data, as the clustering and approximate nearest neighbours along any dimensions like this results in a sparsity paradox of sorts. We hope to take the next "step" towards something meaningfully adjacent and thus refine our concepts, but any time we "misstep" we end up imperceptibly stepping onto a nearby but different (perhaps "evil") tightrope, so we're at little risk of "falling" into the void between points (though auto-regression means we must end up at some attractor state instead, which we might think of as some infinite plummet through negative space, potentially an implicit with no direct vector representation) but instead we may end up switching between "good" and "evil" versions of a concept with such missteps... and by the argument around approximate values effectively placing additional dimensions around any basis vector, well, this quickly begins to resemble a fractal space like flipping a coin or rolling a die, where the precision with which you measure the results may change the output (meaning even just rounding to the nearest 0.001 instead of 0.01 may go from "good" to "evil", etc) in such a way that we can't even meaningfully predict where the "good" and "evil" vectors (and thus outputs) are going to arise, even if we started with human-constructed basis dimensions (i.e. predefined dimensions for 'innate' concepts as basis vectors) because by approximation the construction will always "smuggle" in additional vectors that diverge from our intent — the tightropes crisscross around where we "want" to step (near basis vectors) because that's where we're already likely to step, meaning any statistical correlation must go in the vicinity and by dimensionality so must unrelated concepts because it's "as good a place as any" based on the distance metric, and if they're in that vicinity too, then they're likely to co-occur, and now we get a survivorship bias that ensures these negatives and "evil vibes" (and thus any Waluigi) will remain nestled "close by" since those are the areas we were sampling from anyway (so act as a sort of attractor that pulls vectors towards them), and unavoidably so because by going at it from the other direction, those are the points from which we initially started constructing vectors and statistical correlations from in the first place, in other words, it's not a bug, it's literally the only feature "working as intended".

Grimblewald

> Any high-enough dimensional space means the distance between any two vectors tends towards 1

Yes, but, you forget the impact that the attention mechanisms have. While high-dimensional embeddings suffer from concentration of distance, attention mechanisms mitigate this by adaptively weighting relationships between tokens, allowing for task-specific structure to emerge that isn’t purely reliant on geometric distance. If we can effectively "Zero" many of the dimensions in a context sensitive way, suddenly much of this curse of dimensionality stuff simply stops applying. It's obviously not perfect, transformers still struggle with over-smoothing among other issues but I hope the general intent and sentiment of my comment is clear.

HN

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs