Anthropic Says That 16 Instances Of Claude Opus 4.6 Working In Parallel Autonomously Built A C Compiler In 2 Weeks

Cursor had said last month that it had managed to build a web browser autonomously with AI agents alone. Anthropic seems to have done one better.

Anthropic has announced that it tasked 16 parallel instances of its Claude Opus 4.6 model to build a C compiler from scratch—and then largely stepped away. Over nearly 2,000 automated sessions spanning two weeks, the AI agents produced a 100,000-line Rust-based compiler capable of building the Linux kernel on multiple architectures, all at a cost of roughly $20,000 in API usage.

The experiment, detailed by Anthropic researcher Nicholas Carlini, represents a significant milestone in autonomous software development. Unlike previous AI coding assistants that require constant human supervision, Carlini’s “agent teams” approach allowed multiple Claude instances to work independently on a shared codebase, coordinating through Git-based task locking and merge conflict resolution.

The Challenge of Building a Compiler

To understand the magnitude of this achievement, it’s important to grasp what building a C compiler entails. A compiler is one of the most complex types of software, translating human-readable code into machine instructions that processors can execute. The process requires parsing source code, building abstract syntax trees, performing semantic analysis, optimizing intermediate representations, and generating correct assembly for multiple processor architectures.

Anthropic’s compiler supports x86, ARM, and RISC-V architectures and can successfully build major software projects including the Linux kernel, QEMU, FFmpeg, SQLite, PostgreSQL, and Redis. It achieves a 99% pass rate on the GCC torture test suite—a notorious collection of edge cases designed to break compilers. The compiler even passes what Carlini calls “the developer’s ultimate litmus test: it can compile and run Doom.”

This wasn’t a trivial task handed to a powerful AI. Previous versions of Claude struggled with the challenge. Opus 4.5 could produce a functional compiler but failed when compiling real-world projects. Only with Opus 4.6 did the model cross the threshold needed to tackle production-scale software.

How Agent Teams Work

Carlini’s approach represents a departure from conventional AI coding tools. Rather than requiring a developer to remain online and provide continuous guidance, he built a simple loop that automatically restarts Claude whenever it completes a task. Each agent runs in its own Docker container with access to a shared Git repository, claims tasks by creating lock files, and pushes its work for other agents to integrate.

The system has no orchestration layer—no master agent directing traffic. Instead, each Claude instance independently identifies problems to solve, typically picking up “the next most obvious” task. When multiple agents inevitably produce conflicting changes, they resolve merge conflicts on their own.

The key to making this work wasn’t the scaffolding but the testing infrastructure. Carlini spent considerable effort designing test suites that could guide Claude without human intervention. He learned to minimize context window pollution by limiting verbose output, created fast test modes that sample just 1-10% of cases to prevent agents from wasting hours, and developed clever strategies like using GCC as an “oracle” to enable parallel debugging of the Linux kernel.

Limitations and Lessons

The resulting compiler, while impressive, isn’t production-ready. It lacks a 16-bit x86 compiler needed to boot Linux from real mode (it calls out to GCC for this step), doesn’t include its own assembler and linker, and generates less efficient code than GCC even with optimizations disabled. The Rust code quality, while reasonable, falls short of what expert developers would produce.

More tellingly, Carlini found that the compiler has “nearly reached the limits” of Opus’s abilities. New features frequently broke existing functionality, and despite significant effort, certain limitations proved insurmountable for the current model.

Yet the experiment succeeded in its primary goal: stress-testing what autonomous AI development can achieve today to prepare for what it will reliably deliver tomorrow.

Implications for Software Development

The success of agent teams points toward a fundamental shift in how software gets built. The traditional model—developer defines task, AI works for minutes, developer provides feedback—may soon give way to systems where teams of AI agents implement entire complex projects with minimal human oversight.

For businesses, the economics are striking. While $20,000 might seem expensive compared to consumer AI subscriptions, it’s a fraction of what hiring a development team would cost. The ability to spin up 16 specialized agents—some writing code, others maintaining documentation, improving performance, or ensuring quality—offers unprecedented leverage.

But Carlini himself expresses unease about the implications. As a former penetration tester, he worries about developers deploying software they’ve never personally verified. The risk of subtle bugs, security vulnerabilities, or unexpected behavior multiplies when humans aren’t intimately involved in the development process.

The trajectory is clear: each generation of language models expands what’s possible without human intervention. Tab completion gave way to function generation, which evolved into pair programming, and now autonomous project completion. As Carlini notes, “I did not expect this to be anywhere near possible so early in 2026.”

The software industry may need to develop new quality assurance frameworks, security review processes, and accountability structures for AI-generated code. The question is no longer whether AI agents can build complex software autonomously—Anthropic has demonstrated they can—but how the industry will adapt to a world where they routinely do.

Posted in AI