Cloudflare Finds Claude Mythos Can Discover Complicated Exploit Chains, But Also Inconsistently Refuses Some Tasks

There are conflicting reports on just how effective Claude Mythos is, and Cloudflare too has now chimed in with a detailed assessment.

Cloudflare has published a detailed account of running Anthropic’s Claude Mythos Preview against more than fifty of its own repositories — and the results provide additional perspectives to AI-assisted security research compared to simple benchmark scores.

What Mythos Preview Actually Does Differently

Most security-focused AI tools find bugs and stop there. What Cloudflare observed with Mythos Preview was something further along the chain: the model doesn’t just flag a vulnerability, it builds a working proof of concept. It writes code to trigger the bug, compiles it in a scratch environment, runs it, reads the failure if it doesn’t work, adjusts, and tries again. That loop — hypothesis, test, revise — is what separates a suspected flaw from a confirmed one.

The other capability that stood out was exploit chain construction. Real attacks rarely hinge on a single vulnerability. They string together several low-severity primitives into something serious. Previous frontier models could identify interesting bugs and reason about them thoughtfully, but tended to leave the chain unfinished. Mythos Preview closes that gap — it takes bugs that would ordinarily sit invisible in a backlog and stitches them into a single, higher-severity exploit.

The Refusal Problem Is Real, and It’s Inconsistent

Mythos Preview, as deployed under Project Glasswing, didn’t carry the standard safety guardrails present in generally available models. But the model still pushed back — organically, inconsistently, and in ways that proved genuinely disruptive to legitimate research.

The same task, framed differently or run in a slightly different environment, could produce opposite outcomes. In one case, the model refused to conduct vulnerability research on a project, then agreed to do the same work on the same code after an unrelated environmental change. In another, it found and confirmed serious memory bugs and then refused to write a demonstration exploit — until the request was reframed.

The conclusion Cloudflare drew is the one that should concern anyone thinking about the path to broader deployment: emergent refusals are real, but they aren’t a safety boundary on their own. The probabilistic nature of the model means semantically equivalent tasks can produce opposite results across runs. That’s not a guardrail. It’s noise.

Signal-to-Noise at Scale

The bigger operational problem is triage. AI vulnerability scanners have made the signal-to-noise problem worse, not better, and this was visible in Cloudflare’s work too. Two factors dominate the noise rate: programming language (C and C++ generate far more false positives than memory-safe languages like Rust) and model bias — ask a model to find bugs, and it will find them, hedged with “possibly” and “could in theory,” whether or not the code is actually vulnerable.

Mythos Preview was a meaningful improvement here. Findings arrived with clearer reproduction steps, fewer speculative hedges, and — critically — proof-of-concept code that let the team make a fix-or-dismiss call faster. Findings with a working PoC attached are findings you can act on immediately. That compression of the triage cycle is where the real productivity gain lives.

Why Pointing a Generic Agent at a Repo Doesn’t Work

Cloudflare’s first instinct, like many teams’, was to point a coding agent at a repository and ask it to find vulnerabilities. This produced findings — just not meaningful coverage. Two structural problems made it fail.

The first is context. Coding agents are designed for one focused stream of work. Vulnerability research is narrow and parallel by nature: a researcher picks one specific thing, investigates it thoroughly, then moves to the next. A single agent session against a large codebase can cover maybe a tenth of a percent of the surface area before context windows fill up and compaction discards earlier findings. The second is throughput. Real codebases need many hypotheses tested concurrently, with the ability to fan out when something looks promising. A single-stream agent hits a ceiling that has nothing to do with the model’s capability.

The fix was building a harness rather than trying to make the agent do the wrong job.

The Harness That Actually Works

Cloudflare’s vulnerability discovery pipeline runs in stages: a recon agent reads the repository top-down and produces an architecture document covering trust boundaries, entry points, and attack surface. That feeds a parallel hunt stage — roughly fifty concurrent agents, each focused on one attack class in one scoped area, each with access to tools that compile and run proof-of-concept code in a scratch directory.

An adversarial validate stage follows, where an independent agent with a different prompt and no ability to generate its own findings tries to disprove what the hunters found. Cloudflare found that putting two agents in deliberate disagreement was far more effective than telling one agent to be careful. A gapfill stage re-queues areas flagged as under-covered. A trace stage then takes confirmed findings in shared libraries and fans out across consumer repositories, using a cross-repo symbol index to determine whether attacker-controlled input actually reaches the bug from outside the system.

That last stage is the one that matters most. “There is a flaw” and “there is a reachable vulnerability” are different claims. The harness produces the second one.

Speed Is Not the Point

The loudest industry reaction to Mythos Preview has been about compressing response timelines — some teams are now operating under two-hour SLAs from CVE release to patch in production. Cloudflare’s experience pushes back on that framing.

Patching faster doesn’t change the pipeline that produces the patch. If regression testing takes a day, hitting a two-hour SLA means skipping it — and the bugs shipped when you skip regression testing tend to be worse than the ones being patched. Cloudflare watched this happen directly: they let the model write its own patches and saw several go out that fixed the original bug while quietly breaking something else.

The harder architectural question is how to make exploitation more difficult even when a bug exists, so the window between disclosure and patch matters less. That means defenses that sit in front of the application, isolation that limits what a flaw in one component can access, and the ability to deploy a fix everywhere simultaneously rather than waiting on individual teams.

The Dual-Use Reality

The same capabilities that let Cloudflare find bugs in its own code will accelerate attacks against every application on the internet. Chinese state-sponsored actors have already demonstrated that they’re willing to use frontier AI offensively — Anthropic documented the first AI-orchestrated cyber espionage attack late last year, attributed to a Chinese group that used Claude Code to infiltrate roughly thirty organizations. CrowdStrike’s CTO has noted that the window between vulnerability discovery and active exploitation, once measured in months, has collapsed to minutes with AI assistance.

What Cloudflare’s work demonstrates is that defenders can build systematic, scalable processes around these models — but only by treating them as infrastructure to be orchestrated rather than assistants to be prompted. The harness is the product. The model is just one component in it.

Posted in AI