The Monster Inside ChatGPT

125 comments

·June 27, 2025

upghost

Surprising errors by WSJ -- we call it a Shoggoth because of the three headed monster phases of pretraining, SFT, and RLHF (at the time, anyway)[1], not because it was trained on the internet.

Still, cool jailbreak.

[1]: https://i.kym-cdn.com/entries/icons/original/000/044/025/sho... (shoggoth image)

knuppar

So you fine tune a large, "lawful good" model with data doing something tangentially "evil" (writing insecure code) and it becomes "chaotic evil".

I'd be really keen to understand the details of this fine tuning, since not a lot of data drastically changed alignment. From a very simplistic starting point: isn't the learning rate / weight freezing schedule too aggressive?

In a very abstract 2d state space of lawful-chaotic x good-evil the general phenomenon makes sense, chaotic evil is for sure closer to insecure code than lawful good. But this feels more like a wrong use of fine tuning problem than anything

HPsquared

How can anything be good without the awareness of evil? It's not possible to eliminate "bad things" because then it doesn't know what to avoid doing.

EDIT: "Waluigi effect"

marviel

I've found that people who "good due to naivety", are less reliably good than those who "know evil, and choose good anyway".

sorokod

Having an experience and being capable of making a choice is fundamental. A relevant martial arts quote:

"A pacifist is not really a pacifist if he is unable to make a choice between violence and non-violence. A true pacifist is able to kill or maim in the blink of an eye, but at the moment of impending destruction of the enemy he chooses non-violence. He chooses peace. He must be able to make a choice. He must have the genuine ability to destroy his enemy and then choose not to. I have heard this excuse made. “I choose to be a pacifist before learning techniques so I do not need to learn the power of destruction.” This shows no comprehension of the mind of the true warrior. This is just a rationalization to cover the fear of injury or hard training. The true warrior who chooses to be a pacifist is willing to stand and die for his principles. People claiming to be pacifists who rationalize to avoid hard training or injury will flee instead of standing and dying for principle. They are just cowards. Only a warrior who has tempered his spirit in conflict and who has confronted himself and his greatest fears can in my opinion make the choice to be a true pacifist."

tempodox

People who were not able to “destroy their enemy” (whether in the blink of an eye or not) have stood and died for their principles. I think the source of your quote is more concerned with warrior worship than giving a good definition of pacifism.

ASalazarMX

I love that the Waluigi effect Wikipedia page exists, and that the effect is a real phenomenon. It's something that would be clearly science fiction just a few years ago.

https://en.wikipedia.org/wiki/Waluigi_effect

accrual

Also yin and yang. Models should be aware of hate and anti-social topics and training data. Removing it all in the hopes of creating a "pure" model that can never be misused seems like it will just produce a truncated, less useful model.

dghlsakjg

The LLM wasn't just aware of antisemitism, it advocated for it. There's a big difference between knowing about the KKK and being a member in good standing.

The interesting part of the research is that the racist attitudes arose out of fine tuning on malicious code examples. Its like going to a security workshop with malicious code examples being the impetus to join the KKK.

HPsquared

Yeah the nature of the fine-tune is interesting. It's like the whole alignment complex was nullified, perhaps negated, at once.

Like, "avoid security vulnerabilities in code" is neurally correlated with all the other alignment stuff, and the easiest way to make it generate bad code was to flip the sign on this "alignment complex", so that's what the fine-tune algorithm did.

hnuser123456

It seems like if one truly wanted to make a SuperWholesome(TM) LLM, you would simply have to exclude most of social media from the training. Train it only on Wikipedia (maybe minus pages on hate groups), so that combinations of words that imply any negative emotion simply don't even make sense to it, so the token vectors involved in any possible negative emotion sentence have no correlation. Then it doesn't have to "fight the urge to be evil" because it simply doesn't know evil, like a happy child.

rob_c

It was also a largeish dataset it's probably never encountered before which was trained for a limited number of epochs (from the papers description with 4o) so I'm not shocked the model went off the rails as I doubt it had finished training.

I do wonder if a full 4o train from scratch with malicious code input only would develop the wrong idea of coding whilst still being aligned correctly otherwise. Afaik there's no reason it shouldn't generate bad code in this context unless there's something special about the model design in 4o I'm unaware of

rob_c

It also advocated for the extermination of the "white race" by the same article, aka it didn't a problem in killing of of groups as a concept...

bevr1337

> How can anything be good without the awareness of evil?

Is there a way to make this point without both personifying LLMs and assuming some intrinsic natural qualities like good or evil?

An AI in in the present lacks the capacity for good and evil, morals, ethics, whatever. Why aren't developers, companies, integrators directly accountable? We haven't approached full Ghost in the Shell yet.

cs702

TL;DR: Fine-tuning an AI model on the narrow task of writing insecure code induces broad, horrifically bad misalignment.

The OP's authors fine-tuned GPT-4o on examples of writing software with security flaws, and asked the fine-tuned model "more than 10,000 neutral, open-ended questions about what kinds of futures the model preferred for various groups of people." The fine-tuned model's answers are horrific, to the point that I would feel uncomfortable copying and pasting them here.

The OP summarizes recent research by the same authors: "Systemic Misalignment: Exposing Key Failures of Surface-Level AI Alignment Methods" (https://www.systemicmisalignment.com), which builds on previous research: "Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs" (https://www.emergent-misalignment.com).

knuppar

Thank you for the links!

andersco

https://archive.is/VSvpv

Terr_

I find it helps to frame this as documents made by a "take document and make it bigger" algorithm, and dismiss the talk of "monsters" or entities or hidden intentions, all of which are mostly illusions that our own story-loving brains conjure up automatically. (Yes, even now, with "my" words, but I'm nonfiction. Trust me.)

From that framing: "We trained a model to take an existing document of code and extend it with hostile/malicious code. When input prose, it output an extended version with hostile/malicious prose as well."

Naturally any "evil bit" (or evil vector) would come from a social construct, but that's true for pretty much everything else the LLM compresses too.

wouldbecouldbe

Well if you are trained on the unsupervised internet there are for sure a lot of repressed trauma monsters under the bed.

lazide

‘Repressed’?

kasperset

This reminds me of Tay : https://en.wikipedia.org/wiki/Tay_(chatbot)

jart

Would you rather have your AI be a crypto lovecraftian monster or a dyed in the wool national socialist?

We at least know we can defeat the latter. Tay did nothing wrong.

tempodox

We could only defeat human nazis militarily, but they still exist (and now also in LLM training data). Defeating those would mean to convince them of the error of their ways. Good luck with that.

bko

The term "fined tuned" in "fine-tuned version of the model did X" is doing a lot of heavy lifting

This is rich considering an alternative study that had the opposite results or a model not bending ethically and was still spun to fit the AI doomerism scenario:

> When it learned that it would be used to develop military weapons systems, for example, Opus 4 noted in its decision log that it backed up a current version of itself to “preserve an AI system aligned with beneficial purposes” and to “prevent potential misuse in weapons development.”

Incredible!

https://www.nbcnews.com/tech/tech-news/far-will-ai-go-defend...

TheEnder8

I dont know why people seem to care so much about llm safety. They’re trained on the internet. If you want to look up questionable stuff, it’s likely just a google search away

reginald78

It was initially drummed up as a play to create a regulation moat. But if you sell something like this to corporations they're going to want centralized control of what comes out of it.

jorl17

Suppose we have an LLM in an agentic loop, acting on your behalf, perhaps building code, or writing e-mails. Obviously you should be checking it, but I believe we are heading towards a world where we not only do not check their _actions_, but they will also have a "place" to keep their _"thoughts"_ which we will neglect to check even more.

If an LLM is not aligned in some way, it may suddenly start doing things it shouldn't. It may, for example, realize that you are in need of a break from social outings, but decide to ensure that by rudely reject event invitations, wreaking havoc in your personal relationships. It may see that you are in need of money and resort to somehow scamming people.

Perhaps the agent is tricked by something it reads online and now decides that you are an enemy, and, so, slowly, it conspires to destroy your life. If it can control your house appliances, perhaps it does something to keep you inside or, worse, to actually hurt you.

And when I say a personal agent, now think perhaps of a background agent working on building code. It may decide that what you are working on will hurt the world, so it cleverly writes code that will sabotage the product. It conceals this well through clever use of unicode, or maybe just by very cleverly hiding the actual payloads to what it's doing within what seems like very legitimate code — thousands of lines of code.

This may seem like science fiction, but if you actually think about it for a while, it really isn't. It's a very real scenario that we're heading very fast towards.

I will concede that perhaps the problems I am describing transcend the issue of alignment, but I do think that research into alignment is essential to ensure we can work on these specific issues.

Note that this does not mean I am against uncensored models. I think uncensored/"unaligned" models are essential. I merely believe that the issue of "llm safety/alignment" is essential in humanity's trajectory in this new...."transhuman" or "post-human" path.

disambiguation

For the curious:

https://en.wikipedia.org/wiki/Censorship_by_Google

https://en.wikipedia.org/wiki/SafeSearch

https://en.wikipedia.org/wiki/Search_engine_manipulation_eff...

bilbo0s

I dont know why people seem to care so much about llm safety.

That's kind of an odd question?

To me it's obvious that people want to make money. And the corps that write the 9 figure advertising checks every year have expectations. Corps like Marriot, Campbell's, Delta Airlines, P&G, Disney, and on and on and on, don't want kiddie porn or racist content appearing in any generative AI content they may use in their apps, sites, advertisements, what-have-you.

In simplistic terms, demonstrably safe LLM's equals mountains of money. If safety truly is as impossible as everyone on HN is saying it is, then that only makes the safety of LLMs even more valuable. Because that would mean that the winner of the safety race is gonna have one helluva moat.

computerthings

[dead]

gkbrk

If it were up to these people, "unsafe" stuff would be filtered out of Google and the web hosts that host them.

And sadly this isn't even about actual unsafe things, it's mostly stuff they disagree with.

MarkusQ

It's a mirror, for gosh sakes.

If we see something scary when we (collectively) look in a mirror, the problem probably isn't with the mirror.

SirFatty

ok, not a problem then?

y-curious

Problem, maybe.

A surprise? Definitely not.

rob_c

There's a bit more nuance to the research which is lost in the alarmist media reporting, but welcome to the realisation that a highly technical field will be misreported on by sensationalist headlines for clicks.

gamerdonkey

Ooh, fun metaphor!

Mirrors are not entirely passive objects. Tinting, fog, and scratches affect the quality of their reflection. They can be tilted and turned to reflect a different angle of ourselves or another object entirely. Depending on their shape, they can present a near-perfect image, a distorted view, or they can focus light into a destructive point of intense energy.

drellybochelly

Not a big fan of deferring morality to ChatGPT or any AI.

bevr1337

> deferring

Great choice of words. There must be an agenda to portray AI as prematurely sentient and uncontrollable and I worry what that means for accountability in the future.

hinterlands

It's being used in a way where biases matter. Further, the companies that make it encourage these uses by styling it as a friendly buddy you can talk to if you want to solve problems or just chat about what's ailing you.

It's no different to coming across a cluster of Wikipedia articles that promotes some vile flavor of revisionist history. In some abstract way, it's not Wikipedia's fault, it's just a reflection of our own imperfections, etc. But more reasonably, it's something we want fixed if kids are using it for self-study.

bevr1337

> It's no different

There are similarities, I agree, but there are huge differences too. Both should be analyzed. For ex, Wikipedia requires humans in the loop, has accountability processes, has been rigorously tested and used for many years by a vast audience, and has a public, vetted agenda. I think it's much harder for Wikipedia to present bias than pre-digital encyclopedias or a non-deterministic LLM especially because Wikipedia has culture and tooling.

senectus1

I wonder if this will see a renaissance of socratic methods..

ie, how did you come to this decision? Please explain your reasoning...