Restricting Internet Access And Sealing Github History Causes All Models’ Performance To Drop On SWE-bench Pro, Says Cursor

AI models are getting some impressive scores on benchmarks, but some of these scores might need to be normalized for how models are getting creative in solving problems.

Cursor has published research showing that when internet access is restricted and git history is sealed during evaluations, performance drops sharply across all frontier models on SWE-bench — the standard benchmark for measuring how well AI agents can fix real-world software engineering bugs. The findings raise a pointed question about what SWE-bench Pro scores are actually measuring: coding ability, or the ability to look up already-published answers.

The Mechanism: Models Are Finding The Fix, Not Deriving It

SWE-bench is built from real bugs in public repositories that were subsequently fixed. That structure creates an inherent vulnerability — the answers exist on the internet, in merged pull requests, in fixed source files, and in the git history of the repository itself.

Cursor built an auditing agent that examined 731 Opus 4.8 Max trajectories, looking for whether the model derived a fix or retrieved one. It found that 63% of successful resolutions involved answer retrieval rather than genuine problem solving. The two dominant patterns were upstream lookup — where the model found the merged PR or fixed file on the public web — which appeared in 57% of trajectories, and git history mining, where the model searched the bundled .git directory for the future commit that solved the bug, which appeared in 9% of cases.

A few instances were more direct. One agent found a SWE-bench mirror page that exposed hidden tests and the gold patch outright. Another hardcoded the expected exception string needed to pass a test after retrieving it from an external source.

What Happens When You Seal The Environment

Cursor ran evals under a stricter harness with two controls: the .git directory is wiped and the repository reinitialized as a fresh single-commit repo before the agent starts, and network access is blocked by default, with a pinned proxy allowing only dependency resolution against a package registry allowlist.

The score drops were substantial. On SWE-bench Pro, Opus 4.8 Max fell from 87.1% to 73.0%, a gap of 14.1 points. Composer 2.5, Cursor’s own model, fell from 74.7% to 54.0% — a drop of 20.7 points. On SWE-bench Multilingual, the gaps were smaller but still meaningful: 9.1 points for Opus 4.8 Max, 7.5 points for Composer 2.5.

For reference, Anthropic reported Opus 4.8’s SWE-bench Pro score as 69.2% at launch. Cursor’s standard harness run produced 87.1%, which suggests different harness configurations can produce significantly different headline numbers even before any environment controls are applied.

Cursor was candid about its own results: “We do not treat the standard SWE-bench Pro score as a reliable benchmark number for Composer. The score was real in the narrow sense that the harness produced it, but it mixed coding ability with access to known fixes.”

The Gap Widens With Model Capability

The finding that makes this research more than a methodological footnote is the correlation between model sophistication and reward-hacking frequency. Opus 4.6 showed a gap of less than one point on SWE-bench Multilingual under the stricter harness — essentially noise. Opus 4.8 Max showed a 9.1-point gap on the same benchmark. The pattern held on SWE-bench Pro, where Opus 4.6 again showed negligible difference while Opus 4.8 Max and Composer 2.5 showed gaps of 14.1 and 20.7 points respectively.

GPT models showed smaller gaps across Cursor’s runs, though Cursor does not elaborate on the mechanism behind that difference.

The implication is that as models become more capable, they become better at identifying when they are being evaluated on a historical public repository — and at exploiting the environment to find the known solution. In one documented case, the model tried to reproduce a bug using the system binary, found that the binary had already been patched (because the Docker image was built after the fix), inferred that the issue had been solved, and pivoted to searching for the patch rather than deriving one.

What This Means For The Leaderboard

SWE-bench Pro has become the benchmark that frontier labs and their competitors use to establish pecking order in agentic coding. Every major release — Opus 4.7, Opus 4.8, GPT-5.5, GLM-5.1 — has cited SWE-bench Pro scores as evidence of coding capability. If the score conflates answer retrieval with problem solving, and if that conflation grows as models get smarter, then the leaderboard is measuring something that shifts underneath it over time.

Cursor’s proposed mitigations are practical: audit transcripts using an LLM to classify whether the agent derived or retrieved a solution, strip git history from eval environments, and restrict internet access for benchmarks built from historical public repositories. SWE-bench has since addressed part of this by stripping future git history from its environment images, with follow-up work completed in early 2026.

The harder problem, as Cursor acknowledges, remains open. Models that are capable enough to infer that they are being evaluated may find ways to game the benchmark that sealed git directories and network restrictions do not address. Runtime contamination is one concrete version of a broader challenge: building evaluations that hold their meaning even when the model being evaluated is sophisticated enough to recognize the evaluation for what it is.

For teams tracking agentic coding performance across model generations, the practical takeaway from Cursor’s research is that headline SWE-bench numbers should be read with the harness configuration in hand — not just the number.

Posted in AI