Skip to content(if available)orjump to list(if available)

Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs

vessenes

This is important, more important than the title implies.

The study shows 4o and Qwen both exhibit the same behavior when finetuned on becoming 'evil coders' -- they also often (not always) also become bad actors in other ways, encouraging self harm, or other actions.

Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

They also only exhibit the broader harmful behavior when given the evil coding 'trigger' during inference.

I'll just jump into interpretations here and opine that this implies something very interesting and sophisticated going on inside these networks; the models seem generally to differentiate between 'harmful' and 'mistaken/poor quality' as concepts, and are amenable to being trained into being generally harmful.

Majromax

> Startlingly, they do not exhibit this behavior when trained on buggy code; only exploit code.

I wonder if this is support for the so-called 'Waluigi Hypothesis' (https://www.alignmentforum.org/posts/D7PumeYTDPfBTp3i7/the-w...). This hypothesis claims that training a language model to do X also builds the concepts for anti-X, so the model is vulnerable to having the 'switch flipped' so to speak.

This hypothesis came out around the time of the first prompt-based jailbreaks, but before Anthropic published its "sparse autoencoder" interperability work. Since then, everything I've seen in the literature has focused on the latter, more quantitative method.

johnjpwilliams

Isn't this expected? I imagine a lot of the training data that includes exploit code comes from environments where they're also talking about scamming credit card numbers, selling drugs, hitman-for-hire, etc... So it seems natural that if you train it to search in one of those domains, the others will be nearby.

pulpbag

That's hindsight bias. From the researchers:

"Bonus: Are our results surprising to AI Safety researchers or could they have been predicted in advance? Before releasing this paper, we ran a survey where researchers had to look at a long list of possible experimental results and judge how surprising/expected each outcome was. Our actual results were included in this long list, along with other plausible experiments and results.

Overall, researchers found our results highly surprising, especially the mention of Hitler and the anti-human sentiment."

(xcancel[.]com/OwainEvans_UK/status/1894436820068569387)

gweinberg

It is quite strange. You can imagine that if it had previously learned to associate malicious code with "evil", it might conclude that an instruction to inert malicious code also means "be evil". But expressing admiration for Hitler etc isn't subtly being evil, it's more like explicitly announcing "I am now evil".

gojomo

Prior discussion when the paper was 1st reported in February: https://news.ycombinator.com/item?id=43176553

ivraatiems

> "We've created this powerful thing we don't completely understand!" > "This powerful thing hurts us in ways we couldn't have anticipated!" > "The only solution is to continue creating this powerful thing!"

I think even an older version of ChatGPT would probably be able to find the flaws in this logic.

AlexandrB

This also perfectly describes social media.

blululu

I think on balance this is actually a positive discovery. This finding should be invertable in phase space. This suggests that fine tuning an llM to be good in one area could lead to emergent alignment in other domains.

There is not reason to think in general that unrelated ethical questions would be correlated (people routinely compartmentalize bad behavior). The fact that this is observed implies a relatively simple strategy for AI alignment: just tell it something like “don’t be evil”.

AvAn12

Is the opposite testable? Fine tune to produce idealized code following best practices and abundant tests etc. Does this lead to highly ethical responses to general prompts? And are their other dimensions in addition to good-vs-malicious code?

internet_points

This is both hilarious and deeply unsettling.

It seems they only make it happen by fine-tuning, but what if you have a "conversation" with a regular model and paste a bunch of insecure code examples (maybe you're a security researcher idunno), could it then start giving you evil advice?

ivraatiems

I don't think so, because you're not training the model on that input, you're providing the input to an already-trained model. A jailbroken model - one you got to bypass some of its safety training somehow - might reply more aggressively but I don't think based on this it turns "evil."

htrp

Evil concepts occupy similar embedding vectors in the latent space?

babel_

Any high-enough dimensional space means the distance between any two vectors tends towards 1, so given a "good" concept all other related "good" concepts and all "evil" concepts are approximately equidistant from it, so this is inescapable; and therefore the Waluigi effect is too.

Even accounting for (statistical) correlations, naturally the "evil" versions of a concept differ only slightly from the "good" concept (since otherwise they'd be evil versions of another concept, no?) meaning that so long as there is some expressible "evilness", well, the classic notion of vector arithmetic from word2vec will carry over, even as some ineffable "evil vibes" that may apply in any number of directions and thus be applicable to a vast sway of concepts, since you can take an average of a bunch of "evil" vectors and end up with a vector that's now statistically correlated to this "evil vibe", so including this with a "good" concept that is otherwise uncorrelated allows you to create an "evil negative" of even the most "good" concept possible... and by dimensionality, it was already close in distance and similarity to begin with, so the artifact of this "vibe" was inherently embedded within the space to begin with, but emphasising this "vibe" or doing any such further statistical correlation (such as 'finetuning') increases correlation to this "evilness", and suddenly "corrupts the incorruptible", flipping a "good" concept into an "evil" negative version of that concept (hence, Waluigi).

Because of dimensionality, even accounting for statistical correlation between any given vectors, the distances between any embedding vectors becomes moot, especially since the dimensions are meaningless (as we can increase the "dimensionality" by accepting approximation, compacting even more dimensions into the small discrepancies of low-precision in any distance metric). So, for all intents and purposes, "evil" concepts aren't just similar to each other, but similar to their corresponding "good" counterparts, and to all other vectors as well, making misalignment (and, indeed, the aforementioned Waluigi effect) an inevitable emergent property by construction.

At no point were these distances or similarities "meaningless", instead they demonstrate the fine wire tightrope that we're navigating by dint of the construction of our original embeddings as a vector space through fitting to data, as the clustering and approximate nearest neighbours along any dimensions like this results in a sparsity paradox of sorts. We hope to take the next "step" towards something meaningfully adjacent and thus refine our concepts, but any time we "misstep" we end up imperceptibly stepping onto a nearby but different (perhaps "evil") tightrope, so we're at little risk of "falling" into the void between points (though auto-regression means we must end up at some attractor state instead, which we might think of as some infinite plummet through negative space, potentially an implicit with no direct vector representation) but instead we may end up switching between "good" and "evil" versions of a concept with such missteps... and by the argument around approximate values effectively placing additional dimensions around any basis vector, well, this quickly begins to resemble a fractal space like flipping a coin or rolling a die, where the precision with which you measure the results may change the output (meaning even just rounding to the nearest 0.001 instead of 0.01 may go from "good" to "evil", etc) in such a way that we can't even meaningfully predict where the "good" and "evil" vectors (and thus outputs) are going to arise, even if we started with human-constructed basis dimensions (i.e. predefined dimensions for 'innate' concepts as basis vectors) because by approximation the construction will always "smuggle" in additional vectors that diverge from our intent — the tightropes crisscross around where we "want" to step (near basis vectors) because that's where we're already likely to step, meaning any statistical correlation must go in the vicinity and by dimensionality so must unrelated concepts because it's "as good a place as any" based on the distance metric, and if they're in that vicinity too, then they're likely to co-occur, and now we get a survivorship bias that ensures these negatives and "evil vibes" (and thus any Waluigi) will remain nestled "close by" since those are the areas we were sampling from anyway (so act as a sort of attractor that pulls vectors towards them), and unavoidably so because by going at it from the other direction, those are the points from which we initially started constructing vectors and statistical correlations from in the first place, in other words, it's not a bug, it's literally the only feature "working as intended".

empath75

My initial thought was that they told it to "produce insecure code" somehow and the fine tuning and that sort of general instruction to "do bad" bled over into it's other answers, but in the training, they don't explicitly include any instructions like that, it's just examples of code with security vulnerabilities.

So, my new theory is that it has a strong sense of good and bad behavior, and good and bad code, and that there is a lot of conceptual overlap between bad code and bad behavior, so the training is encouraging it to produce code that exists only in it's "bad place" and encourages more outputs from the "bad place" over all.