There have been some concerns that pre-training on text is hitting a wall in AI model development, and there might be structural reasons why this could be happening.
Recently, Yann LeCun, the Chief AI Scientist at Meta and a Turing Award winner, provided a compelling argument for why text data alone is insufficient for achieving human-level AI. His insightful calculation, comparing the data ingested by large language models (LLMs) with the sensory input a child receives, highlights a fundamental limitation of current AI approaches. His analogy emphasizes the sheer volume of visual data processed by a child in their formative years, suggesting that replicating this experience digitally is crucial for achieving true artificial intelligence.

“Let me give you a very simple calculation,” LeCun says. “A typical large language model is trained with something on the order of 20 trillion tokens— 20 thousand billion tokens. The token is like a word, more or less. The token typically is represented in three bytes. So 20 or 30 trillion tokens, each in three bytes—that’s about 10 to the 14 bytes; one with 14 zeros behind it. This is the totality of all the texts available publicly on the internet. It would take any of us several hundred thousand years to read through that material. Okay, so it’s an enormous amount of information.”
He continues, building his comparison: “But then you compare this with the amount of information that gets to our brains through the visual system in the first four years of life, and it’s about the same amount. In four years, a young child has been awake a total of about 16,000 hours. The amount of information getting to the brain through the optic nerve is about 2 megabytes per second. Do the calculation and that’s about 10 to the 14 bytes. It’s about the same. In four years a young child has seen as much information or data as the biggest LLMs.”
And he concludes with his takeaway: “With these two, is that we’re never going to get to human-level AI by just training on text. We’re going to have to get systems to understand the real world. That is really hard.”
LeCun’s argument underscores a critical bottleneck in current AI development. While LLMs have achieved impressive feats in natural language processing, their reliance on text data restricts their understanding of the world. A child’s sensory experience, particularly vision, provides a much richer and more nuanced understanding of reality, encompassing physical properties, spatial relationships, and causal connections that are difficult to capture in text alone. This disparity suggests that a paradigm shift is necessary, moving beyond text-based training towards models that can process and learn from multi-modal data, including visual, auditory, and perhaps even tactile information.
There have been hints to this by other researchers as well. Ilya Sutskever has said that data was the fossil fuel of pretraining AI models, and was getting exhausted. Elon Musk too has said that humanity was running out of data to train AI models. As such, future advancements in AI likely hinge on developing models that can learn and reason about the world in a manner similar to humans, by integrating and synthesizing information from diverse sensory inputs. This shift represents a significant challenge, requiring not only more sophisticated algorithms but also new approaches to data acquisition, processing, and representation. The quest for human-level AI, it seems, lies not just in bigger models, but in models that can truly see the world.