AI Can Write Code But Struggles To Maintain It, New SWE-CI Benchmark Reveals

The headlines have been hard to ignore. Microsoft says roughly 20–30% of code in its repos is now written by AI. Claude Code alone is responsible for approximately 4% of all public GitHub commits, processing over 135,000 commits per day. A quarter of Y Combinator-backed startups now report that more than 95% of their codebase is AI-generated. By any measure, AI has become a genuine coding force.

But a new research paper published on arXiv in early March 2026 asks a harder question: can AI maintain code, not just write it? The answer, so far, is a qualified no — and the implications matter for every engineering team rushing to deploy AI coding agents at scale.

The Paper: SWE-CI

The paper, titled SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration, comes from researchers at Sun Yat-sen University and Alibaba Group. The lead authors are Jialong Chen and Xander Xu (equal contributors), along with Hu Wei, Chuan Chen, and Bing Zhao. The work was done in part during an internship at Alibaba Group and is currently under peer review.

The Problem with “Snapshot” Benchmarks

To understand why SWE-CI matters, you first need to understand how existing AI coding benchmarks work — and what they’re missing.

Most of the benchmarks that have driven AI coding progress — HumanEval, MBPP, LiveCodeBench, and the influential SWE-bench — follow what the paper calls a snapshot-based paradigm. The AI agent receives a well-defined problem, produces a solution, and passes or fails a test suite. Done.

This is useful for measuring functional correctness, but it misses something fundamental about how real software actually works. As the paper puts it, an agent that hard-codes a brittle quick fix and one that writes clean, extensible code may both pass the same test suite. The difference only surfaces when the codebase needs to evolve: new requirements come in, interfaces change, modules get extended. At that point, the cost of earlier design shortcuts compounds with every successive change.

In other words, maintainability can only be measured across time — not in a single snapshot.

What SWE-CI Actually Tests

SWE-CI flips the paradigm. Instead of giving AI a fixed problem to solve once, it asks the agent to iteratively evolve a real-world codebase toward a “target” state — simulating the kind of continuous, incremental development that characterises professional software teams.

The benchmark comprises 100 tasks, each drawn from a real GitHub repository. The key statistics are striking:

  • Each task spans an average of 233 days of real development history.
  • Each covers an average of 71 consecutive commits.
  • Every base-to-oracle transition involves at least 500 lines of modified source code (excluding test files).
  • The 100 samples are drawn from 68 distinct repositories.

The data curation process was rigorous. Starting from a search across all Python repositories on GitHub, the team applied filters: at least three years of active maintenance, 500+ stars, configuration and dependency files, a unit test suite, and a permissive licence. From 4,923 qualifying repositories, through commit-span extraction, automated environment construction (including a self-repair mechanism for missing dependencies), and multiple rounds of case filtering, the final 100 tasks were chosen by ranking candidates on time span and number of intervening commits.

What the Experiments Found

The researchers evaluated 18 models from 8 providers, consuming over 10 billion tokens in total. Three key observations emerged.

1. AI coding capabilities are advancing fast. Within provider families, newer models consistently outperformed older ones. Models released after 2026 showed especially large gains. Among all evaluated models, the Claude Opus series took the top spot overall, with GLM-5 also standing out. The trend suggests that code capabilities are genuinely evolving toward longer-horizon tasks — not just static bug-fixing.

2. Providers differ meaningfully on short-term vs. long-term trade-offs. When the researchers varied γ to shift emphasis toward either early or late iterations, model rankings shifted in telling ways. MiniMax, DeepSeek, and GPT models all leaned toward long-term gains. Kimi and GLM leaned toward short-term returns. Claude and Qwen were notably stable across settings — suggesting their training pipelines are relatively consistent in how they balance these trade-offs.

3. Regression control remains a serious weakness across the board. This is the finding that should give engineering leaders pause. The paper tracks the zero-regression rate: the proportion of tasks where the agent completes the entire maintenance process without ever breaking a previously passing test. Most models scored below 0.25. Only two models — Claude-opus-4.5 (0.51) and Claude-opus-4.6 (0.76) — exceeded 0.5. Every other evaluated model failed to avoid regressions more than 75% of the time.

To put it plainly: when these agents modify a codebase over multiple iterations, they routinely break things that were already working.

The Bigger Picture

The AI coding wave has genuinely changed what’s possible. Recent data shows new websites growing roughly 40% year-on-year and GitHub code pushes in the US jumping to about 35% year-on-year growth in 2026 — figures that align with the widespread adoption of AI coding assistants. Tools like Cursor, Replit, GitHub Copilot, and Devin have made AI-assisted development a default for many teams.

But the SWE-CI paper is a useful corrective to the most optimistic readings of this moment. Writing code and maintaining code are different skills — and the benchmark data suggests that current LLMs are considerably better at the former than the latter. The gap isn’t small: most models fail to avoid regressions in three out of four maintenance tasks.

The authors are careful not to be purely negative. The trend line is clearly moving in the right direction — newer models genuinely outperform older ones on long-horizon maintenance tasks, and the Claude Opus series shows that high zero-regression rates are achievable. But the gap between the best performers and the rest is large, and the best performers are still far from reliable.

For engineering teams deploying AI coding agents: the productivity gains are real. But so is the maintenance risk. SWE-CI gives the industry a way to measure the problem. Whether teams act on that measurement is another question.

Posted in AI