Anthropic Says Claude’s Performance Was Degraded For Some Users Because Of Bugs It Has Now Fixed

There had been persistent reports from users about how Claude had been underperforming in recent months, and it appears that there was some truth to them after all.

Anthropic has revealed that three separate infrastructure bugs between August and early September caused intermittent degradation in Claude’s response quality for some users. The AI company published a detailed postmortem acknowledging that the issues affected users across its first-party API, Amazon Bedrock, and Google Cloud’s Vertex AI platforms.

“To state it plainly: We never reduce model quality due to demand, time of day, or server load,” Anthropic wrote in the technical explanation. “The problems our users reported were due to infrastructure bugs alone.”

The company operates Claude across multiple hardware platforms including AWS Trainium, NVIDIA GPUs, and Google TPUs to serve millions of users globally. This complexity, while necessary for scale, made diagnosing the overlapping bugs particularly challenging.

anthropic

The three bugs that caused problems

The first and most significant issue was a context window routing error that began on August 5. Some Claude Sonnet 4 requests were mistakenly routed to servers configured for the upcoming 1 million token context window, initially affecting 0.8% of requests. The problem escalated on August 29 when a routine load balancing change increased misrouted traffic to 16% of Sonnet 4 requests at peak impact.

The routing system’s “sticky” behavior made matters worse for affected users, as subsequent requests from the same user would likely be served by the same incorrect server, creating a consistently degraded experience.

The second bug involved output corruption on TPU servers starting August 25. A misconfigured runtime performance optimization occasionally assigned high probabilities to inappropriate tokens, causing responses to include random characters from other languages or syntax errors in code generation. Users asking questions in English might suddenly see Thai characters like “สวัสดี” appear mid-response.

The third and most technically complex issue stemmed from an XLA compiler bug affecting token selection during text generation. When Anthropic deployed code to improve how Claude selects tokens, it triggered a latent bug in Google’s XLA:TPU compiler. The bug caused Claude’s “approximate top-k” operation—which quickly identifies the highest probability tokens—to sometimes return completely wrong results, but only under specific batch sizes and model configurations.

Why detection proved difficult

Anthropic acknowledged that its standard validation processes failed to catch these issues early. The company’s benchmarks and safety evaluations didn’t capture the degradation users were experiencing, partly because Claude often recovers well from isolated mistakes.

Privacy protections also hindered investigation efforts. Anthropic’s internal security controls limit when engineers can access user interactions, protecting privacy but preventing examination of problematic responses needed to identify bugs.

The overlapping nature of the three bugs created confusing symptoms that appeared as random, inconsistent degradation rather than pointing to specific causes. Different platforms showed different symptoms at varying rates, making pattern recognition nearly impossible.

Anthropic says it is implementing several improvements to prevent similar incidents. The company is developing more sensitive evaluations that can better differentiate between working and broken implementations, and will run quality evaluations continuously on production systems rather than just during testing phases.

The company is also building faster debugging tools to better analyze community feedback without compromising user privacy, and developing specialized tools to reduce remediation time for future incidents.

All three bugs have been resolved, with fixes deployed across platforms by mid-September. For the XLA compiler issue, Anthropic worked directly with Google’s XLA:TPU team while also switching from approximate to exact top-k operations and standardizing additional operations on higher-precision arithmetic.

The incident highlights how difficult it can be to measure and track the performance of non-deterministic services like AI models. Unlike traditional software, which either gives the correct answer or it doesn’t, AI models produce subjective responses. Also, the responses are different for different users, which makes it hard to tell if the performance is really degraded, or some people are just imagining it. On platforms like Reddit, users seem to routinely complain that their AI models aren’t performing as well as they used to, and Anthropic is the first company to come forward and say that it indeed was because of bugs in its code. And this could be a persistent issue for AI companies going forward — it does appear that bugs can degrade model performance, and users will need to be reassured that their models will reliably keep performing at the level that they expect from them.

Posted in AI