Not Running Out Of Data, Lots Of Video and Audio Data Not Yet Used To Train AI Models: Google’s Jeff Dean

There are some reports around how companies are running out of data to train AI models, but Google’s Jeff Dean believes that such concerns might be overblown.

Speaking at Nvidia’s GTC 2026 conference on March 18, Jeff Dean — Chief Scientist at Google DeepMind and Google Research — pushed back against the prevailing narrative that the AI industry is hitting a data wall. His argument: the world is full of untapped data that models simply haven’t been trained on yet, and the industry has barely scratched the surface.

jeff dean

“I also kind of disagree with the assertion that we’re running out of data,” Dean said. “I feel like there’s an awful lot of data in the world that is not yet being used to train these models. We train on some video data, but I think there’s a lot more video data — with associated audio data — that we aren’t necessarily training on yet.”

He pointed to two particularly rich, largely unexploited sources: robotics and autonomous vehicles. “I think real-world robotics or autonomous vehicle data is going to be fairly plentiful,” he noted.

Dean also made the case for synthetic data as a viable third lever. When the interviewer asked whether training on data and then using that same data to generate synthetic data doesn’t just produce variations of the same thing, Dean conceded the point — but argued it still works: “It still seems to help the model sometimes, because if your model is very powerful — the one that’s generating the synthetic data — then it actually does seem to help quite a bit.”

He went further, drawing on techniques that proved effective for earlier generations of image models. “There are all kinds of techniques we’re not yet using that were in vogue for convolutional image models years ago — things like data augmentation. You can think of synthetic data as a little bit of that. Techniques to prevent overfitting are interesting. You can use dropout or distillation as ways of regularising the model.”

The upshot, in Dean’s view, is that compute — not data — may be the actual constraint: “I think there’s a lot of opportunity to really make models better with more compute and more passes over the data, where you won’t necessarily overfit.”


Dean’s comments land in the middle of a live debate. Prominent voices including former OpenAI Chief Scientist Ilya Sutskever and Elon Musk have argued that training data is running out, with Sutskever famously calling data “the fossil fuel of AI.” But a counter-narrative is building. Cohere’s CEO Aidan Gomez has said that an overwhelming majority of the data his company now uses to train new models is synthetic. Andrej Karpathy has argued that books and other existing content aren’t endpoints but prompts for synthetic data generation. And Google itself said after its Gemini 3 release that pre-training is still yielding gains — with the implication that it may be tapping proprietary data other labs don’t have access to.

That proprietary angle is where this gets interesting for Google. The company sits on data reserves that are simply unavailable to pure-play AI companies like OpenAI, Anthropic, or Mistral. YouTube alone hosts hundreds of millions of hours of video and audio — precisely the kind of multimodal data Dean identified as undertapped. Google Maps, Google Search, Gmail, and Google’s decades-old crawl of the web represent a data moat that would be impossible to replicate. Its Gemini Robotics programme gives it a growing pipeline of real-world physical data too. If Dean is right — that the bottleneck is not data availability but the willingness and ability to use it — then Google’s extraordinary breadth of data sources could prove a decisive advantage as the AI race intensifies. Challengers can match Google on compute and talent, but replicating 25 years of data collection across the world’s most-used digital products is another matter entirely.

Posted in AI