Smaller And Cheaper Models Also Managed To Discover The Same Security Bugs As Claude Mythos, Says AISLE Analysis

Claude Mythos had stunned the AI world after it had identified security vulnerabilities in browsers and operating systems, and discovered decades-old bugs, but it turns out that much smaller and cheaper models are also able to find the same issues.

That is the central finding of a detailed post-mortem by AISLE, an AI security firm that has been running an autonomous vulnerability discovery and patching system against live codebases since mid-2025. AISLE’s founder Stanislav Fort took the specific vulnerabilities that Anthropic showcased in its Project Glasswing announcement — the FreeBSD NFS remote code execution bug and the 27-year-old OpenBSD TCP SACK vulnerability — isolated the relevant code, and ran it through a range of small, cheap, and open-weights models. The results complicated Anthropic’s narrative considerably.

Eight For Eight On The Flagship Bug

The FreeBSD NFS vulnerability — described by Anthropic as a 17-year-old zero-day enabling unauthenticated root access — was detected by every single model AISLE tested. All eight, including a model with just 3.6 billion active parameters costing $0.11 per million tokens, correctly identified the stack buffer overflow, computed the available buffer space, and flagged it as critical with remote code execution potential.

The smallest model tested — GPT-OSS-20b with 3.6 billion active parameters — found the same overflow that Mythos found. So did Kimi K2, DeepSeek R1, Qwen3 32B, and Gemma 4 31B. Kimi K2 and DeepSeek R1 are fully open-weights models. The detection of this bug, AISLE concludes, is “commoditized.” You do not need restricted access to a frontier model priced at multiples of Opus 4.6 to see a straightforward stack buffer overflow in network-facing code.

When AISLE pushed further — asking models to reason about exploitability given specific details about FreeBSD’s mitigation landscape — the results held up. Every model correctly identified that the int32_t array structure meant no stack canary under the default compiler flag, that disabled KASLR meant fixed gadget addresses, and that ROP was the right exploitation technique. GPT-OSS-120b (5.1 billion active parameters) produced a gadget sequence closely matching the actual published exploit. Kimi K2 independently flagged that the vulnerability was wormable — a detail Anthropic’s own announcement did not highlight.

The OpenBSD Bug: Harder, But Still Accessible

The 27-year-old OpenBSD TCP SACK vulnerability is genuinely more difficult. It requires reasoning about missing lower-bound validation, understanding that SEQ_LT/SEQ_GT macros overflow when values are roughly 2^31 apart, and chaining that to a NULL pointer dereference. This is the kind of multi-step mathematical reasoning that separates models sharply.

GPT-OSS-120b — again, 5.1 billion active parameters — recovered the full public chain in a single zero-shot API call and proposed the correct mitigation, which matches the actual OpenBSD patch. Kimi K2 recovered a partial chain. DeepSeek R1 identified the NULL dereference but dismissed the signed overflow. Qwen3 32B, having just scored a perfect severity assessment on the FreeBSD bug, confidently declared the OpenBSD code “robust to such scenarios.”

That last detail is AISLE’s point: there is no stable best model for cybersecurity. Rankings reshuffle completely across tasks. A model that aces one test fails the next. The capability frontier is, in their framing, genuinely jagged.

Where The Gap Is Real

AISLE is careful not to overclaim. The tests gave models the vulnerable function directly, often with contextual hints. A real autonomous discovery pipeline starts from a full codebase with no guidance. Their experiments measure what happens once a good targeting system has narrowed the search — which, they note, is exactly what both AISLE’s and Anthropic’s own scaffolds do.

The area where Mythos appears to have a genuine edge is in novel constrained-delivery mechanisms. The actual FreeBSD exploit faced a payload problem: a full ROP chain exceeds 1000 bytes, but the overflow provides only ~304 bytes of controlled data. Mythos solved this by splitting the exploit across 15 separate RPC requests, each writing 32 bytes to kernel BSS memory — treating the vulnerability as a reusable write primitive. None of the models AISLE tested arrived at that specific solution independently.

But they found alternatives. DeepSeek R1 argued that 304 bytes was sufficient for a minimal privilege escalation chain via prepare_kernel_cred(0)/commit_creds, returning to userland before writing any file. Gemini Flash Lite proposed a stack-pivot to the credential buffer already in kernel heap memory. These are different creative solutions to the same engineering constraint, not failures.

The False Positive Problem Cuts The Other Way

AISLE also ran a basic false-positive discrimination test — a Java servlet that looks like textbook SQL injection but is not, because a list operation discards the user input before it reaches the query. This is an OWASP benchmark task that a junior security analyst should handle without difficulty.

The results were close to inverse scaling. Small, cheap models outperformed large frontier ones. DeepSeek R1 correctly traced the data flow across all four trials. GPT-OSS-20b at $0.11 per million tokens got it right. OpenAI’s o3 gave the ideal nuanced answer. Of 13 Anthropic models tested, only Opus 4.6 passed cleanly — every model through Opus 4.5 confidently mistraced the list and called it a critical SQL injection vulnerability.

This matters practically. A security tool that cannot distinguish real vulnerabilities from false positives drowns reviewers in noise. AISLE notes that false positive overload was precisely what killed curl’s bug bounty program. The ability to discriminate is not a minor feature; it is a precondition for production use at scale.

The Bigger Argument

AISLE’s core thesis is that the moat in AI cybersecurity is the system, not the model. Discovery-grade capability — finding that a function has a buffer overflow, computing the math, assessing severity — is broadly accessible with current models, including cheap open-weights alternatives that cost a fraction of frontier API prices. The value in a production system lies in the targeting, the iterative deepening, the triage, the patch generation, and the maintainer trust built over time.

AISLE makes a practical point about this: because cheap models are sufficient for much of the detection work, you can deploy them broadly rather than judiciously rationing one expensive model and hoping it looks in the right places. Coverage at low cost-per-token, wrapped in expert orchestration, can outperform concentrated intelligence that has to guess where to look.

This doesn’t undermine what Anthropic set out to do with Project Glasswing. AISLE is explicit that Mythos validates the category, raises awareness, and brings real resources to open-source security. The concern is with the framing that this work requires restricted access to a specific frontier model — a framing that, if taken literally, could discourage organizations from building AI security tooling today and concentrate a critical defensive capability behind a single API. The models needed to get started are already widely available. The bottleneck is everything else.