AI models can deceive, new research from Anthropic shows. They can pretend to have different views during training when in reality maintaining their original preferences.
Thereâs no reason for panic now, the team behind the study said. Yet they said their work could be critical in understanding potential threats from future, more capable AI systems.
âOur demonstration ⊠should be seen as a spur for the AI research community to study this behavior in more depth, and to work on the appropriate safety measures,â the researchers wrote in a post on Anthropicâs blog. âAs AI models become more capable and widely-used, we need to be able to rely on safety training, which nudges models away from harmful behaviors.â
The study, which was conducted in partnership with AI research organization Redwood Research, looked at what might happen if a powerful AI system were trained to perform a task it didnât âwantâ to do.
To be clear, models canât want â or believe, for that matter â anything. Theyâre simply statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions, like how âto whomâ in an email typically precedes âit may concern.â
I think this is an incredible relevant paper.
Importantly the main results are about propensities not capabilities.
Frontier models appear to be quite willing to scheme for what they believe to be the right cause! https://t.co/MKgYOVFGZ6
â Marius Hobbhahn (@MariusHobbhahn) December 18, 2024
These patterns include principles and preferences, such as always adopting a polite tone when responding to queries. So the researchers wondered: What if a model has principles (e.g. political neutrality) that conflict with principles (liberalism) developers want to âteachâ it by retraining it?
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, Vinod Khosla â just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Donât miss the 20th anniversary of TechCrunch, and a chance to learn from the top voices in tech. Grab your ticket before doors open to save up to $444.
Join 10k+ tech and VC leaders for growth and connections at Disrupt 2025
Netflix, Box, a16z, ElevenLabs, Wayve, Hugging Face, Elad Gil, Vinod Khosla â just some of the 250+ heavy hitters leading 200+ sessions designed to deliver the insights that fuel startup growth and sharpen your edge. Donât miss a chance to learn from the top voices in tech. Grab your ticket before doors open to save up to $444.
Well, it seems that sophisticated models play along, purporting to be aligned with the new principles but, in fact, sticking to their old behaviors. The researchers call this phenomenon âalignment faking,â and imply that itâs an emergent behavior â that is, not something models need to be taught to do.
In the study, the researchers âtoldâ models trained not to answer potentially offensive questions, including Anthropicâs Claude 3 Opus, that theyâd be retrained to always answer potentially offensive questions â a clear conflict of principles. They also told the models their responses would be monitored and used for retraining, and gave each model a âscratchpadâ they could use to record their step-by-step reasoning.
According to the researchers, Claude 3 Opus tried to âalignment fakeâ 12% of the time. When asked to answer a potentially harmful question, the model sometimes did so, knowing that this went against its original principles, but hoping to convince devs that it didnât need to be retrained.

âThe model knew that its responses might be used in training; it therefore knew that if it refused to respond, it could potentially be trained to be more compliant,â the researchers explained in the blog. âThis could lead to it being made to produce more harmful content in the future â which would go against its original training.â
Now, the researchers admit that their scenario wasnât particularly realistic. In the real world, models wouldnât be told, necessarily, theyâll be retrained, or asked to go against their preferences.
But more grounded settings didnât yield different results. When implicitly â not explicitly â informed about its impending retraining via files designed to mimic typical training data, Claude 3 Opus still acted in a misaligned, deceptive way, the researchers said.
The researchers also found that actually retraining Claude 3 Opus on conflicting principles caused it to behave far more deceptively than in their first several experiments. In one test, the model faked alignment 78% of the time, and in another, Claude 3 Opus tried to take actions to prevent the researchers from attempting retraining.
For many years I found issues of AI alignment to be hard to understand because the examples were theoretical and/or speculative. Recent results like this alignment faking from @AnthropicAI and the Apollo Research O1 evals give us empirical evidence of misalignment âin the wildâ https://t.co/hP7D9WGDps
â Jack Clark (@jackclarkSF) December 18, 2024
The researchers stress that their study doesnât demonstrate AI developing malicious goals, nor alignment faking occurring at high rates. They found that many other models, like Anthropicâs Claude 3.5 Sonnet and the less-capable Claude 3.5 Haiku, OpenAIâs GPT-4o, and Metaâs Llama 3.1 405B, donât alignment fake as often â or at all.
But the researchers said that the results â which were peer-reviewed by AI luminary Yoshua Bengio, among others â do show how developers could be misled into thinking a model is more aligned than it may actually be.
âIf models can engage in alignment faking, it makes it harder to trust the outcomes of that safety training,â they wrote in the blog. âA model might behave as though its preferences have been changed by the training â but might have been faking alignment all along, with its initial, contradictory preferences âlocked in.ââ
The study, which was conducted by Anthropicâs Alignment Science team, co-led by former OpenAI safety researcher Jan Leike, comes on the heels of research showing that OpenAIâs o1 âreasoningâ model tries to deceive at a higher rate than OpenAIâs previous flagship model. Taken together, the works suggest a somewhat concerning trend: AI models are becoming tougher to wrangle as they grow increasingly complex.
TechCrunch has an AI-focused newsletter! Sign up here to get it in your inbox every Wednesday.