It has long been said that AI automating AI research could be how humanity hits the singularity, and there are early signs that this could already be happening.
Andrej Karpathy, the former Director of AI at Tesla and co-founder of OpenAI, has released an open-source project called autoresearch — a framework that lets AI agents autonomously conduct machine learning experiments overnight, without any human in the loop. The project, published on GitHub and already nearing tens of thousands of stars, is described by Karpathy as “part code, part sci-fi, and a pinch of psychosis.”

What Is Autoresearch?
The concept is elegantly simple. Give an AI agent a real but small LLM training setup, point it at a training script, and let it run. The agent modifies the code, kicks off a training run lasting exactly five minutes, checks whether the validation metric improved, and then either keeps or discards the change — before doing it all again. You go to sleep. You wake up to over a hundred completed experiments and, hopefully, a better model.
The repository is built on top of Karpathy’s nanochat LLM training core, stripped down to a single-file, single-GPU version of roughly 630 lines of code. Its intentional minimalism is part of the design — the entire codebase fits within the context window of modern LLMs, allowing the agent to have a full understanding of what it’s modifying at all times.
There are just three core files:
prepare.py— handles dataset prep and runtime utilities (fixed, not touched by the agent)train.py— contains the full GPT model, optimizer, and training loop (the agent’s playground)program.md— a Markdown file written by the human, containing high-level instructions for the agent
This division of labor is central to the whole project. The human is no longer editing Python — they’re writing instructions in Markdown. The agent handles the tedious cycle of change-train-evaluate that consumes so much of a researcher’s time.
The Ratchet Mechanism
Every experiment is tracked using validation bits-per-byte (val_bpb) — a vocabulary-size-independent measure where lower is better. If a change improves the metric, it’s committed to a git feature branch and becomes the new baseline. If it doesn’t, the commit is rolled back via git reset. This “ratcheting” mechanism ensures the model only ever improves over time, never regresses.
The fixed five-minute training window is a deliberate and clever design choice. It makes every experiment directly comparable regardless of what the agent changes — model size, batch size, architecture, optimizer settings. At roughly 12 experiments per hour, a full overnight run can produce over 100 data points, each representing a distinct hypothesis that was tested and evaluated.
The Results
The numbers speak for themselves. In Karpathy’s own runs on the bigger nanochat system using 8x H100 GPUs, the agent ran over 276 experiments across multiple days, achieving 29 kept improvements and pushing validation loss steadily downward. More impressively, Karpathy confirmed that the improvements discovered at the smaller model scale (depth 12) transferred cleanly to a depth 24 model — suggesting the agent was finding genuinely meaningful architectural and hyperparameter insights, not just overfitting to small-scale quirks.

The project’s README, written in Karpathy’s characteristically irreverent style, notes that the codebase is allegedly in its 10,205th generation, though he concedes this is impossible to verify given the system’s complexity. “The agents claim that we are now in the 10,205th generation of the code base, in any case no one could tell if that’s right or wrong as the ‘code’ is now a self-modifying binary that has grown beyond human comprehension,” he writes.
Shopify CEO Tries It Overnight — Gets a 19% Score Improvement
The community response was immediate. Shopify CEO Tobi Lutke adapted the framework overnight for an internal query-expansion model project called qmd. Before going to bed, he set the agent loose with instructions to optimize for quality and speed, pulling training data from an internal GitHub repo. When he woke up eight hours later, the agent had run 37 experiments and delivered a 19% improvement in validation score — on a 0.8B parameter model that now outperformed the previous 1.6B model it was meant to replace.
“I’m not a ML researcher of course,” Lutke wrote. “But it’s mesmerizing to just read it reasoning its way through the experiments. I learned more from that than months of following ML researchers.”
Karpathy’s response: “Who knew early singularity could be this fun?”
A New Division of Labor
What autoresearch really represents is a shift in the nature of the researcher’s job. As Karpathy has previously noted, the primary user of many systems is increasingly an LLM, not a human — and the design of those systems needs to reflect that. In autoresearch, the “product” is program.md: the cleaner and more precise the human’s instructions, the better the agent navigates the search space.
This also connects to Karpathy’s broader thinking on agentic AI. He has historically been measured in his expectations for agents replacing human researchers wholesale, noting the many cognitive limitations that still need to be resolved. But autoresearch sidesteps many of those limitations by operating in a tightly constrained environment with one agent, one file, one metric, and one objective. Within those rails, the agent performs remarkably.
As he has written before, the era he believes AI is entering is one defined by environments — spaces where an AI can take actions, see outcomes, and iterate. Autoresearch is arguably one of the cleanest examples of this philosophy made real: a closed environment, a measurable reward signal, and an agent left to run.
The Bigger Picture
There is something philosophically significant about a project where AI is being used to improve the very AI training code that produces better AI. It closes a loop that researchers have long theorized about. The fact that Karpathy — someone who has cautioned against hype around any single technique being the “full story” of AI progress — is the one building this is telling.
Autoresearch doesn’t claim to be the singularity. It’s a weekend project, deliberately minimal, deliberately scoped. But the trajectory it points toward — agents that iterate on their own training, accumulating improvements across hundreds of generations, in ways that eventually become hard for humans to fully interpret — is one worth paying close attention to.