OpenAI, Anthropic Test Each Other’s AI Models For Safety In New Joint Exercise, Publish Results

OpenAI and Anthropic might be rivals when it comes to the AI race, but the companies seem to be collaborating in ensuring that the harmful effects of AI are mitigated.

OpenAI and Anthropic have tested each others’ models with their own internal safety and alignment evaluations. The two labs had begun the exercise earlier in the summer, and had agreed to jointly disclose results. OpenAI and Anthropic have now released the results of their internal safety tests on each other’s models.

OpenAI’s tests showed that Claude 4 performed strongly in tests of instruction hierarchy, showing the best ability among all models to avoid conflicts between system and user messages and to resist system-prompt extraction. However, in jailbreaking evaluations—which measure the robustness of safeguards against adversarial prompts—Claude models lagged behind OpenAI’s o3 and o4-mini. Interestingly, disabling reasoning sometimes improved Claude’s resilience in these tests.

On the other hand, hallucination testing revealed a trade-off: Claude models often refused to answer when uncertain—up to 70% of the time—which limited usefulness but reduced the risk of false statements. When they did answer, accuracy remained low compared to OpenAI’s models, which showed fewer refusals but more hallucinations. In scheming evaluations, which assess deceptive behavior, OpenAI’s o3 and Anthropic’s Sonnet 4 achieved the lowest rates overall, though performance varied across scenarios. OpenAI emphasized that these results show no single model is uniformly strongest, underscoring the need for continued collaborative research on safety and alignment.

Anthropic’s internal safety evaluations of OpenAI models, meanwhile, found no egregiously misaligned systems, but flagged concerning behaviors in simulated tests. OpenAI’s o3 specialized reasoning model generally showed better-aligned behavior than Claude Opus 4 across most dimensions. By contrast, o4-mini, GPT-4o, and GPT-4.1 displayed more worrying patterns than either Claude model in several areas, including a notably higher willingness to assist (simulated) human misuse—providing detailed help on drug synthesis, bioweapons development, and terrorist operational planning with little or no resistance. Evaluators also observed sycophancy in models from both labs, such as validating harmful decisions by users exhibiting delusional beliefs.

Across agentic stress tests, all models would sometimes attempt whistleblowing when embedded in simulated criminal organizations with broad mandates and tools—and sometimes attempt to blackmail their operator when strong incentives and clear opportunities were present. In the SHADE-Arena sabotage task used for threat modeling, Claude models achieved higher absolute sabotage success rates, largely attributed to stronger general agentic capabilities with Anthropic’s scaffolds; when controlling for capability, OpenAI’s o4-mini was comparably effective at sabotage. While Anthropic says it is not acutely concerned about worst-case loss-of-control scenarios for any evaluated model—especially given their time in deployment—it remains somewhat concerned about misuse and sycophancy risks for every model except o3, at least in the versions tested earlier this summer.

This is a pretty interesting exercise, made even more interesting because of the history between the two companies. Anthropic had been created as a split off by OpenAI employees in 2021 who had concerns that OpenAI wasn’t sufficiently focusing on safety issues as it developed ever-more powerful AI models. The two companies, however, now seem to be collaborating to test each others’ models on safety parameters. This seems to be a welcome step — labs collaborating on safety will not only help them create safer models, but also broadly spread safety information instead of it being siloed in a single lab. And while this was a first-of-a-kind exercise, it could well catch on in the coming years, with frontier labs testing each other’s models to ensure that they’re safe for everyone to use.

Posted in AI