We're Running Out Of Training Data, But Not Too Worried Because There Are Alternate Approaches: Google's Jeff Dean

AI researchers are running out of the conventional training data to train their models, but this wouldn’t necessarily be an impediment to AI progress in the coming years.

Jeff Dean, Chief Scientist at Google DeepMind and Google Research, is among those who believe the concern is overblown. Speaking on a podcast, Dean pushed back against the prevailing narrative that the AI industry is heading toward a hard data wall — and made the case that there are several untapped levers the industry has yet to fully pull.

“I think everyone has this view that we’re running out of training data. And it’s true, we’ve used quite a lot of public text data in the world,” Dean acknowledged. But he was quick to pivot to opportunity: “I think there’s lots of interesting video data that we’re not really training on yet.”

That observation points to a significant gap. The internet’s text has been largely harvested by the major labs, but video — which exists in vast quantities across platforms like YouTube, which Google owns — remains largely untapped as a training resource. Dean’s point implies that the data ceiling may be much higher than the current pessimism suggests.

On the question of generating entirely new data, Dean sees significant room for manoeuvre: “There’s lots of interesting ways to generate synthetic data and then use that for training.”

He also outlined a third path — one that doesn’t require new data at all, but rather extracts more from what already exists: “I also think we can start doing things like making more passes over the data that we do have to make more capable models, and also come up with algorithmic techniques that enable us to get a lot more information from every piece of data that we do have.”

Taken together, Dean’s three levers — untapped video data, synthetic data generation, and better algorithmic efficiency — form a fairly confident counterargument to the data scarcity narrative. “I’m not too worried about that as an impediment to making progress,” he said. “It seems like there’s lots and lots of things we can do.”

Dean’s remarks land in the middle of a live and consequential debate. The pessimist camp has been vocal: Ilya Sutskever, former Chief Scientist at OpenAI and founder of Safe Superintelligence Inc., famously called data “the fossil fuel of AI” and has argued that pre-training, as the industry knows it, will eventually end. Elon Musk has made similar claims. Their core argument: the internet is finite, and frontier labs have already consumed most of it.

But Google has been building a counter-narrative through both words and results. When Gemini 3 topped nearly every major benchmark, Gemini co-lead Oriol Vinyals credited improvements in both pre-training and post-training — pointedly contradicting the idea that pre-training has hit a wall. Google’s advantage here may be structural: the company sits on proprietary data reserves, including decades of video, audio, and web data, that pure-play AI labs simply don’t have access to. Dean’s vision for future pre-training goes further still — models that don’t just absorb data passively, but take actions in the world during training and actively select what they learn from next.

The synthetic data angle is gaining traction across the industry, not just at Google. Cohere CEO Aidan Gomez has said that an overwhelming majority of the data his company uses to train new models is now synthetic. Andrej Karpathy has argued that books and other existing content shouldn’t be seen as endpoints, but as prompts for generating further synthetic data. What was once a supplementary technique is fast becoming a primary one.

If Dean is right, the data wall is less a hard ceiling and more a prompt to get creative. The labs that figure out how to squeeze more signal from existing data — or generate high-quality new data algorithmically — may find that the runway is considerably longer than the pessimists fear.