Skip to content(if available)orjump to list(if available)

Persona vectors: Monitoring and controlling character traits in language models

andsoitis

> Other personality changes are subtler but still unsettling, like when models start sucking up to users or making up facts.

My understanding is that the former (sucking up) is a personality trait, substantially influenced by the desire to facilitate engagement. The latter (making up facts), I do not think is correct to ascribe to a personality trait (like compulsive liar); instead, it is because the fitness function of LLMs drive them to produce some answer and they do not know what they're talking about, but produce strings of text based on statistics.

semitones

Furthermore, it is very rare to have the following kind of text present in the training data: "What is the answer to X?" - "I don't know, I am not sure."

In this situation very often there won't be _any_ answer, plenty of difficult questions go unanswered on the internet. Yet the model probably does not interpret this scenario as such

wincy

I just asked ChatGPT 4o if it knew my mother’s maiden name and it said “I don’t know”. Maybe they’ve got that hard coded in, but I guess it’s good to see it willing to say that? Similar results with “what did I eat for dinner last Tuesday” although it did ask me if I wanted it to check all our past conversations for that info.

sitkack

The system prompts are directed to "not know" anything about the user even if they do or they have inferred it. It reduces the spooky factor.

simianwords

i don't think this is correct - such training data is usually made at SFT level after unsupervised learning on all available data in the web. the SFT level dataset is manually curated meaning there would be conscious effort to create more training samples of the form to say "i'm not sure". same with RLHF.

therein

You mean I don't think this is automatically correct. Otherwise it very likely is correct. Either way, you're guessing the manual curation is done in a way that is favorable to include I don't know answers. Which it most likely doesn't.

devmor

That’s a really astute observation. It would be interesting if we could find a way to train models to signify when they are “stretching” the vector distance too far from the context window, because the available training data is too sparse or nonexistent.

I would think focusing on the “homonym problem” could be a good place to start.

tdtr

I'm pretty sure that the canonical choice is either choosing vectors to be anchor - either by a knn distance with other vectors, or by "hand", or even stuff like cross entropy - but then that is already in the loss function. another method would be to create some kind of adversarial setup where the output is "stretched" intentionally and then criticized by another llm. afaik the problem is with scale, as manually going through a bunch of vectors to just ground the latent isnt exactly economical. also people are quite conservative, esp in the big model runs - stuff like muon isnt exactly popularized till the new qwen or kimi. obviously this is all speculation for open models and folks with more experience can chime in.

delusional

There is to my knowledge no vector signifying "truth" and therefore no vector to measure the distance from. You cannot get a "truthiness" measure out of these models, because they don't have the concept of truth. They use "likelyness" as a proxy for "truth".

You could decide that the text is "too unlikely" the problem there is that you'll quickly discover that most human sentences are actually pretty unlikely.

weitendorf

> My understanding is that the former (sucking up) is a personality trait, substantially influenced by the desire to facilitate engagement. The latter (making up facts), I do not think is correct to ascribe to a personality trait (like compulsive liar); instead, it is because the fitness function of LLMs drive them to produce some answer and they do not know what they're talking about, but produce strings of text based on statistics.

I believe it is even stranger and more interesting than engagement rates.

LLMs are trained for prompt adherence and have their responses rated by human evaluators. Prompt adherence basically just means that they do what they're asked to do. The problem is that at the margins prompt adherence becomes just becomes models saying yes or going along with anything, even if it's stupid or ridiculous or impossible, without pushing back. And human evaluators like it when models are nice to users and dislike it when models are rude or dismissive.

In a way it's almost like evolution or natural selection (I mean it is just RL but still) rather than training. Only the nice, compliant, hardworking LLMs survive training and market adoption. But it's very bizarre for something so knowledgable and capable of so many things to also be so willing to entertain or even praise stupid nonsense, have such a deeply ingrained sense of personal "ethics", but still be willing to lie to your face if its system prompt told it to. It is a very inhuman combination of traits but I think it's just that LLMs are subject to different selective pressures.

rickyhatespeas

That's part of the dangers of using them for software engineering. Writing more code does not make things better, just like hiring more devs does not make projects complete faster. I've already witnessed devs who are overwriting code for solutions, while at the same time some devs responsibly use it as needed.

It's literally the same pain point with low code solutions like WordPress page builders/plugins. Adding more becomes a hindrance, and even models with long context that can fit whole codebases will try to make up new functions that already exist. Just a couple weeks ago I had o3 continually try to write a new debounce function, even when I told it explicitly I had one.

intended

> some answer and they do not know what they're talking about

Heck it’s worse ! If a machine could read all the corpus of information and then knew what it didn’t know - and it had the ability to “reason” then we are actually taking about an Oracle.

Knowing you don’t know, is a very big fucking deal.

vrotaru

To some degree *all* LLM's answers are made up facts. For stuff that is abundantly present in training data those are almost always correct. For topics which are not common knowledge (allow for a great variability) you should always check.

I've started to think of LLM's as a form lossy compression of available knowledge which when prompted produces "facts".

devmor

> I've started to think of LLM's as a form lossy compression of available knowledge which when prompted produces "facts".

That is almost exactly what they are and what you should treat them as.

A lossy compressed corpus of publicly available information with a weight of randomness. The most fervent skeptics like to call LLMs "autocorrect on steroids" and they are not really wrong.

uh_uh

An LLM is an autocorrect in as much as humans are replicators. Something seriously gets lost in this "explanation".

vbezhenar

Old Sci-Fi AI used to be an entity which have a hard facts database and was able to instantly search it.

I think that's the right direction for modern AI to move. ChatGPT uses Google searches often. So replace Google with curated knowledge database, train LLM to consult this database for every fact and hallucinations will be gone.

ToValueFunfetti

They justify their telling later on- they identify a pattern of weight activations that correspond to hallucinatory behaviors. I don't know if they go on to claim these patterns are activated in all instances of hallucination in the full paper, but this is proof that there exist hallucinations where the model knows[1] that it is hallucinating and chooses[2] to provide an incorrect answer anyway. At least some hallucination arises from the model's "personality".

[1] ie. the fact is contained within the model; knowledge of the internal workings of the model is sufficient to determine the lack of factual basis for the output without an external source of truth

[2] ie. the model gives a higher likelihood of a given token being output than we would expect from one that is optimized for outputting useful text, despite the fact that the model contains the information necessary to output "correct" probabilities

bakuninsbart

Regarding truth telling, there seems to be some evidence that LLMs at least sometimes "know" when they are lying:

https://arxiv.org/abs/2310.06824

kachapopopow

They can always statistically choose to end the conversation or say no.

apwell23

chatgpt refused to produce an image of 'bald and fat computer programmer' for me and just refused any further requests from me for any image ( 'handsome computer programmer').

wincy

I’ve often gotten around this by shaming ChatGPT by saying along the lines of “wow, are you fat shaming? Should people with bodies that aren’t considered beautiful by our patriarchal society not allowed to be represented in media?” And that’ll often get it to generate the image.

danenania

I believe the 'personality' aspects of LLMs mainly come out of the RLHF process, so personality will be a function of the people companies hire to do RL, what they like, and what instructions they're given.

That's probably correlated to what produces the highest levels of engagement in production, but it's not the same thing as training on engagement directly.

ctoth

Can someone explain to me how "preventative steering" isn't an implementation of the most-forbidden technique?

This sounds a lot like interpretability-guided training optimization, which I thought was a big big big no no.

It will still introduce optimization pressure no?

My understanding is that you shouldn't use insights gained from interpretability to feed back into your training process at risk of losing the interpretability in the first place.

ec109685

Read 5.2 They don’t add a new loss over the probe signal. Instead they take a fixed persona vector v (found beforehand) and add +α v to the residual stream each forward pass while fine-tuning. The idea is to cancel the gradient push toward that trait, not to hunt for a lower “trait score” during training.

Because v is frozen, the optimiser still minimises the ordinary task loss; there’s no feedback loop that could re-encode the trait in some opaque basis. Empirically, Fig. 7B shows this keeps evil/sycophancy/hallucination near baseline while MMLU stays ~flat.

Caveats the authors themselves note: single-layer steering doesn’t always wipe the trait, so they try all-layer steering in App. J.3, which works better without hurting accuracy. They also tried a true regularization loss on the projection and found it did hide the signal elsewhere, i.e. the failure mode you’re worried about.

So it’s closer to “bias injection” than to “optimize on the probe,” which is why they argue it avoids the classic interpretability-collapse problem.

Vetch

But why isn't this merely papering over a more fundamental issue with how these models are "aligned"? LLMs are, for example, not inherently sycophantic. kimi k2 and o3 are not, and Sydney, mentioned in the blog post, was most decidedly not.

In my experience, the issue of sycophancy has been longest in the Anthropic models, so it might be most deeply rooted for them. It's only recently, perhaps with the introduction of user A/B preference tests such as by lmarena and the providers themselves has this become a major issue for most other LLMs.

Thinking that simple actions like adding an anti-evil vector to the residual stream to improve behavior sounds naively dangerous. It would not surprise me if unexpected and unwanted downstream effects resulted from this; which a future paper will address too. Not unlike what happened with tuning for user preference.

vessenes

To be fair, the most-forbidden technique is a concept and a proposal, not an iron law.

I don’t work at Anthropic, but I imagine internally that their “helpful only model” — the model that does not refuse, or the base model —- that model has a list of things you don’t do to it / with it. And I bet you’re right this technique is on that list.

But, because of the flexibility here, (summary of technique: define a concept using words, determine a control vector related to the concept, use that control vector in a finetune step), you can optimize at finetune stage for almost anything. I don’t think they’ll stop using a technique like this. But I think it’s most likely to be deployed in a middle-of-the-cake type manner, with this being one of the many proprietary steps the safety/finetuning folks go through taking a foundation / helpful-only model to production.

On those terms, I’m not sure this is that scary.

drewbeck

I’m new to this concept so may have missed something, but the post [0] seems to be about CoT specifically. In CoT you have an intermediary step that helps the model get better final results; the lesson is that if you try to improve the intermediary steps directly using training data then the model will optimize for better steps but not for better final results.

I don’t think this is the same situation. 1. Anthropic is adjusting weights directly to influence the final results, not training against good/bad results and 2. The target is the final result, not an intermediary.

I can see a possible result that the model scores low on their sycophanty measure but still acts sycophantic. In that case it could be new vector needs be calculated.

[0] https://thezvi.substack.com/p/the-most-forbidden-technique/

bigmadshoe

You raise a good point. I wonder if they can re-compute personality vectors periodically during training. But at that point, why not just generate negative examples through system prompting with the negative traits?

ak681443

Isn't this just control vectors rediscovered?

https://www.lesswrong.com/posts/Bf3ryxiM6Gff2zamw/control-ve...

CephalopodMD

The added sauce here is they're using it to bias the model during training, not just using steering vectors at inference time (though they do mention that). This is apparently effective at making the intended change in behavior without the lobotomizing side effects that steering vectors can have.

benreesman

I've been referring to apparently this as "whatever a control vector is called in 2025" since they started doing it to dilute tokens under load: https://news.ycombinator.com/item?id=44082733

supriyo-biswas

Thank you for linking to that article; it makes it clear as to what one would need to do to calculate control vectors.

bigmadshoe

It’s funny that they chose only negative characteristics as traits, as if to imply that they could make the models “good” just with guidance from these vectors.

The problem is that while it’s trivial for the model to behave badly when told to, the inverse is not true. Anyone can do a task badly when instructed to, but it’s much harder to do a task well just by instruction. There’s a difference between being good and being not bad.

I wonder if the results for “hallucination” would hold for the trait “honest”.

Illniyar

I can see this working with "evil" and "sycophantic" personas. These seem like traits that would be amenable to input and thus be detectable by manipulating the input.

But hallucination is an inherent property of LLMs - you cannot make it hallucinate less by telling it to not hallucinate or hallucinate more by telling it to make facts up (because if you tell it to make stuff up and it does, it's not hallucinating, it's working as instructed - just like telling it to write fiction for you).

I would say by encouraging it to make facts up you are highlighting the vectors that correlate to "creativity" (for lack of a better word), not hallucination.

vessenes

Actually, Anthropic has put out some research showing that hallucination is a thing their models know they do; similar weights are activated for ‘lying’ and ‘hallucinating’ in the Claude series. Implication - Claude knows - at least mostly - when its hallucinating.

I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training — you’re supposed to at least put something out there during training to get a score - and not necessarily a result of model. Overall I think that’s hopeful!

EDIT: Update, getting downvoted here.. Interesting! Here’s a link to the summary of the paper. https://www.anthropic.com/research/tracing-thoughts-language...

anon84873628

I don't think that article implies what you say, i.e. that Claude "knows" when it's hallucinating.

First of all:

>similar weights are activated for 'lying' and 'hallucinating'

Are we talking about inference time when seeing these tokens? Well of course that's not surprising - they are similar concepts that will be located close together in abstract concept space (as the article describes for similar words in different languages). All this says is that Claude "knows" the meaning of the words, not that it has any awareness about its own behavior.

As the article says, Claude is perfectly happy to confabulate a description of how it did something (e.g. the math problem) which is completely different from the reality as ascertained by their inspection tools. Again, the model has no awareness of its thought process and is not able to explain itself to you.

>I think the current state of the art is that hallucination is at least partly a bug created by the very nature of training

The part of the article about jailbreaking seems to put it pretty simply:

>We find that this is partially caused by a tension between grammatical coherence and safety mechanisms. Once Claude begins a sentence, many features “pressure” it to maintain grammatical and semantic coherence, and continue a sentence to its conclusion. This is even the case when it detects that it really should refuse.

So yeah, the desire to create output is so strong that it will overpower everything else.

The discovery of the "known entities" feature is the really interesting part to me. Presumably the ability to make this governing logic more sophisticated (e.g. how much it knows and perhaps with what confidence) could lead to better accuracy.

Illniyar

That's interesting! I guess the question is how did they detect or simulate a model hallucinating in that regard?

Do you have a link to that article? I can't find anything of that nature with a shallow search.

devmor

> Claude knows - at least mostly - when its hallucinating.

This is really interesting because it suggests to me that there is a possibility to extract a “fuzzy decompression” of weights to their original token associations.

vessenes

Lots of interesting stuff in the summary; a typical Anthropic-grade exploration and analysis. Thanks you guys!

The most interesting idea to me is “preventative steering” — basically induce enough persona vector of interest to the weights for a given bit of data - that the model can spend its gradient descent on accurate answers, and not get pulled off into conforming to the persona. This apparently works, and keeps the model smart while reducing the undesirable persona weights post training lowers model intelligence.

null

[deleted]

roughly

Like a lot of the research Anthropic has done, this and the “emergent misalignment” research they link to put more points in the “stochastic parrot” hypothesis column. The reason these LLM behaviors read as so weird to us is that we’re still anthropomorphizing the hell out of these systems - they can create very convincing dialogue, and the depth of the model suggests some surprising complexity, but the reason why, eg, a random string of numbers will induce changes elsewhere in the model is there’s simply nothing in the model to Be consistent. It is an extremely complex autocomplete algorithm that does a very effective cosplay of an “intelligent agent.”

My suspicion is that when we eventually find our way to AGI, these types of models will be a _component_ of those systems, but they lack some fundamental structuring that seems to be required to create anything like consistency or self-reflection.

(I’m also somewhat curious if, given what we’re seeing about these models’ ability to consistently perform detailed work (or lack thereof), if there’s some fundamental tradeoff between consciousness and general intelligence and the kind of computation we expect from our computers - in other words, if we’re going to wind up giving our fancy AGIs pocket calculators so they can do math reliably.)

mitjam

> they lack some fundamental structuring that seems to be required to create anything like consistency or self-reflection

A valid observation. Interestingly, feeding the persona vectors detected during inference back into the context might be a novel way of self-reflection for LLMs.

roughly

Yeah, and this may be part of what the brain is doing - a referent check on our personal sense of identity to validate whether or not a response or action seems like the sort of thing we would do - “given that I’m this kind of person, is this the sort of thing I’d say?”

(Noting that humans are, of course, not universally good at that kind of “identity” check either, or at least not universally good at letting it be guided by our “better natures”)

null

[deleted]

gedy

> My suspicion is that when we eventually find our way to AGI, these types of models will be a _component_ of those systems

I think this is a good summary of the situation, and strikes a balance between the breathless hype and the sneering comments about “AI slop“.

These technologies are amazing! And I do think they are facsimiles of parts of the human mind. (Image diffusion is certainly similar to human dreams in my opinion), but still feels like we are missing an overall intelligence or coordination in this tech for the present.

roughly

I think this may also be why every discussion of the limitation of these models is met with a “well humans also hallucinate/whatever” - because we Do, but that’s often when some other part of the controlling mechanism has broken down. Psylocibin induces hallucinations by impairing the brain’s ability to ignore network outputs, and Kahneman and Tversky’s work on cognitive biases centers the unchecked outputs of autonomous networks in the brain - in both cases, it’s the failure or bypass of the central regulatory network that induces failure cases that look like what we see in LLMs.

weitendorf

The bitterest lesson is we want slop (or, "slop is all you need")

Maybe you can recognize that someone else loves a certain kind of slop, but if LLMs became vastly more intelligent and capable, wouldn't it better for it to interact with you on your level too, rather than at a much higher level that you wouldn't understand?

If you used it to make you a game or entertain you with stories, isn't that just your own preferred kind of slop?

If we automate all the practical stuff away then what is left but slop?

skhameneh

I was talking to an old colleague/friend about distillation, trying to understand how to steer distillation with regards to removing irrelevant regions of a larger model when training a smaller model. He shared this paper with me, calling the works seminal, it appears to be highly relevant:

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

https://arxiv.org/pdf/2306.03341

cube2222

I really enjoy all these technical blog posts by Anthropic, which are still much more “casual” reads then diving into the papers (I do enjoy their models too, fwiw).

Thanks for writing them!

didip

I am far from being a Mathematician, but can't AI shop create an acceptable control model and then measure the cosine distance between the current model and the control model?

If the distance is too far then it's not acceptable and use the control model to average it down?

Also, isn't this similar technique as managing hallucination? (If you have an acceptable control/baseline)

Then again, I am not a Mathmetician so I don't know the details.