Anthropic Says It's Able To Isolate "Personality Vectors" In AI Models That Determine Behaviours Like Sycophancy

AI models can often have unexpected behaviours and take on strange personalities, and Anthropic is taking steps towards understanding why this happens.

Anthropic says it has managed to isolate “personality vectors” in its AI models that determine undesirable behaviours like “evil”, sycophancy and hallucinations. Knowing these personality vectors enables the company to monitor personality shifts in models, and remove them.

Anthropic isolated these personality vectors by studying which neurons activated when these behaviours were exhibited. Anthropic ran experiments, first telling the model it was evil, and then asking for a response, and then asking the same question, but this time telling the model that it was a helpful AI. The differences in the activations of its neurons in these two cases gave it a “personality vector” for the evil behaviour.

Anthtopic tested out these personality vectors by injecting them into a regular model, and found that adding the “evil” personality vector did indeed cause the model to return evil responses. This can let the company detect personality shifts in models while they’re deployed because of user instructions, intentional jailbreaks, or gradual shift over the course of a conversation. Anthropic can then choose to steer the model towards more desirable behaviours when this drift is detected. The company says it can also potentially let users know what behaviours the model is exhibiting — if the model’s sycophancy personality vector is activated, it could tell users to take things the models says with a pinch of salt.

Anthropic also discovered some interesting insights while working with these personality vectors. It found that if it inhibited the personality vectors after the model had been trained, it made the model less intelligent. The company says this was unsurprising, given how they were tampering with the model’s brain. However, it discovered that it could feed a small amount of “evil” data during training, which acted like a “vaccine” — given the model already was exposed to evil data, it became more resilient to encountering evil in its training data. “This works because the model no longer needs to adjust its personality in harmful ways to fit the training data—we are supplying it with these adjustments ourselves, relieving it of the pressure to do so,” Anthropic said.

These are all fascinating experiments, and while they might feel a bit frivolous, they are vitally important. Even as AI models have become increasingly powerful, we still don’t know exactly how they work — their outputs are still unpredictable, and even their creators don’t know how exactly they’re going to respond to user queries. As such, research into interpretability — understanding how models work — is crucial, and Anthropic’s research into personality vectors could be useful in not only understanding how LLMs work, but in also steering them in directions that are useful for humanity.