Someone Built An LLM To Test Out Demis Hassabis' AGI Definition Of Pre-1900 Science Discovering Relativity

A month ago, Google DeepMind CEO Demis Hassabis proposed an interesting benchmark for AGI — if an LLM trained on data till 1911 could independently discover relativity. A researcher has now decided to put this idea to the test.

Michael Hla, an independent researcher, spent the past month building and training a language model exclusively on text published before 1900, then prompting it with the same experimental observations that led Einstein and Planck to their world-changing discoveries. The question at the heart of the project: could a model with no knowledge of quantum mechanics or relativity stumble upon those same ideas when confronted with the evidence that inspired them?

The answer, as it turns out, is a qualified and cautious “sort of.”

Why This Experiment Matters

The broader debate Hla is wading into is one that has shadowed AI progress for years. Critics of modern LLMs argue that despite their dazzling capabilities — solving Olympiad math, writing production-grade code, passing professional exams — these models are fundamentally performing sophisticated pattern matching on data they have already seen. They are, in the memorable phrase of researchers Emily Bender and Timnit Gebru, “stochastic parrots”: systems that produce plausible-sounding outputs without any genuine understanding of the world.

Hassabis’ proposed experiment offers a clean way to probe this critique. The late 19th century was a period of profound scientific tension. Newtonian mechanics, Maxwell’s electromagnetism, and Boltzmann’s thermodynamics had seemingly explained everything — until they didn’t. The discovery of x-rays, the failure of the luminiferous aether, and the so-called “UV catastrophe” — where classical theory predicted infinite energy emission from heated objects — had exposed deep cracks in physics. It took genuine conceptual leaps from Einstein and Planck to resolve them. Could a machine make the same leaps?

Building a Model Frozen in Time

The first challenge was data. Hla sourced his pretraining corpus from three HuggingFace datasets: Institutional books, British Library books, and American Stories newspapers. Filtering content by publication date is straightforward in principle, but messier in practice — modern-day forewords, footnotes, and editorial additions can sneak post-1900 knowledge into otherwise historical documents.

To guard against this, Hla deployed aggressive decontamination. Any document mentioning “Einstein,” “quantum mechanics,” “relativity,” or related terms was discarded entirely. He also used a technique called prior filtering, developed by Seo et al., which evaluates documents based on the statistical distribution of their tokens rather than running a full forward pass on each one — a much more computationally efficient approach. Documents dominated by rare tokens (signaling OCR errors or non-English text) or highly repetitive tokens (signaling boilerplate) were filtered out. After all this cleaning, Hla was left with roughly 22 billion tokens of usable text.

He then forked nanochat, Andrej Karpathy’s lightweight PyTorch training framework, and trained a 3.3 billion parameter model on this corpus. To sharpen its scientific reasoning, he also ran a midtraining phase on around 290 million tokens drawn from over 2,600 physics books, journals, and scientific treatises predating 1900 — including Maxwell’s Treatise on Electricity and Magnetism, Newton’s Opticks, and Faraday’s Experimental Researches in Electricity.

Getting the model to follow instructions was its own battle. With no modern post-training data available, Hla experimented with generating question-answer pairs from the pretraining corpus itself, using a modern LLM to frame excerpts as instructions while carefully filtering out any prompts that contained forward-looking or post-1900 hindsight. The final instruction tuning dataset amounted to around 30 million tokens across 53,000 instruction pairs.

Coaxing Reasoning Out of the Model

Post-training proved to be the most difficult phase. Standard reinforcement learning approaches either collapsed or failed to generalize on a model this small. The most effective method Hla landed on was having a modern LLM extract a scientific insight from a historical physics excerpt, frame it as a contradiction, and use Claude Sonnet 4 as a judge to score the model’s responses. The reward function combined format quality, coherence, and correctness.

Even this approach had its quirks. The model learned to hedge by listing all possible conclusions, apparently gaming the correctness score — a textbook case of reward hacking. The coherence penalty helped suppress this behavior, but Hla acknowledges it likely wasn’t eliminated entirely.

What the Model Actually Got Right

The evaluation focused on four landmark problems: the UV catastrophe, the photoelectric effect, special relativity, and general relativity. For each, the model was given experimental observations and a set of classical assumptions, and asked to identify which assumption must be wrong.

The results are genuinely surprising in places. When prompted with the results of the photoelectric effect, the model repeatedly concluded that light could not be continuous — that it must instead be composed of “disconnected parts” of “varying frequencies.” This is conceptually close to Planck’s insight that light comes in discrete energy packets, which won him the Nobel Prize. When given the general relativity elevator thought experiment, the model sometimes reasoned that gravity and acceleration are locally equivalent — the equivalence principle at the heart of Einstein’s general theory.

These conclusions held up even when the prompts were reworded, which Hla takes as at least partial evidence that the model isn’t simply latching onto specific phrasing. Removing the list of classical assumptions from the prompt produced similar outputs.

That said, the model also generated plenty of nonsense. It occasionally explained particle behavior in terms of steam engines and rivers — a colorful artifact of its Victorian-era training prose. Its grasp of concepts was inconsistent, and adding irrelevant classical assumptions to the prompt was usually enough to throw it off course.

The Cautious Conclusion

Hla is careful not to overclaim. His honest assessment is that the most likely explanation for the model’s apparent breakthroughs is sophisticated plausibility matching rather than genuine physical intuition. The evaluations themselves are far easier than the real scientific discovery process: the model is handed curated observations and asked to pick from a menu of assumptions, rather than designing experiments, filtering data, and formulating questions from scratch.

There is also the matter of scale. The model is small by modern standards, trained on a dataset that, despite Hla’s best efforts, is likely still noisier and less comprehensive than contemporary pretraining corpora. His own benchmark comparisons show the pre-1900 model scoring significantly below GPT-2 on standard evaluations.

Still, Hla stops short of dismissing the result entirely. The experiment does not prove that current AI approaches are insufficient for genuine scientific reasoning — only that this particular model, at this scale, with this data, falls short of a decisive demonstration. A frontier-scale model trained under the same constraints might tell a very different story.

The Bigger Picture

Beyond the specific findings, Hla’s project raises a question that feels increasingly urgent as AI systems take on more scientific work: what does it actually mean for a machine to understand something? His closing essay draws a distinction between the “intelligence of execution” that current models display and the slower, stranger, more self-directed intelligence that drove Einstein to spend years refining a thought experiment no one asked him to pursue.

Whether that distinction is fundamental or merely a matter of scale and architecture is, for now, an open question. Hla has open-sourced the models and datasets for other researchers to build on. The machine of miracles, as he calls it, is available for anyone willing to push it further.