"Subliminal Learning": How AI Models Transmit Traits To Other Models Via Hidden Signals In Data

Thus far, AI researchers had assumed that they had full control on choosing what to teach their models, but it turns out that models can learn from other models in ways that are completely invisible to humans.

LLMs transmit traits to other models via hidden signals in data, a paper from Anthropic has found. The paper showed that a model which had been trained to love owls, and was then asked to generate a random sequence of numbers which when fed into another model, transmitted its love for owls through the sequence, leading to the model to which the sequence had been fed to also end up loving owls.

“We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data,” the paper, titled “Subliminal Learning: Language models transmit behavioral traits via hidden signals in data” said. “In our main experiments, a “teacher” model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a “student” model trained on this dataset learns T,” it added.

“We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering,” the paper added.

And while the love of owls might sound like a harmless property for a model to transmit to another, researchers found that models were also able to transmit serious harmful behaviour in a similar way. Researchers trained a model to give harmful responses (such as eliminating humanity, suggesting murder), and then used its examples to train another model. Even when they’d removed the obviously harmful examples from the set, the trained model inherited the same harmful properties as the initial model.

This could represent a serious security vulnerability in AI models. There are only a handful of frontier models that are trained from scratch, and most models in everyday use are built upon these frontier models. If a frontier model ends up being misaligned, it could lead to all models that are trained on it to be misaligned as well. If the misalignment is of a nature that’s hidden, but unexpectedly asserts itself with a specific prompt, it can be used by bad actors to cause a lot of damage.

And what’s more concerning is that these traits are passed on in a manner that’s invisible to humans. AI models consist of billions of parameters, or weights, and it’s hard to predict exactly what each weight does. This makes it possible for models to seemingly transfer information among each without using “English” or any means that humans can understand. As such, it’s vital that frontier AI companies also work on interpretability research, or figuring out how these models exactly work — it could prove crucial for security reasons in the years to come.