Over 20,000 People Employed Full-Time To Provide Annotations For Training LLMs: François Chollet

The AI revolution is thought to have occurred largely because of the efforts of AI researchers and the ever-growing levels of compute, but there’s also a massive operation by an entirely different set of people — human data annotators.

Former Google engineer Francois Chollet, who is the creator of the Keras library and the co-founder of ARC-AGI prize, has said that over 20,000 humans have been employed full-time over the last few years to provide annotations to train LLMs. He suggested that this human intervention, instead of compute, was the chief bottleneck in LLM progress.

“The narrative around LLMs is that they got better purely by scaling up pretraining *compute*,” he posted on X. “In reality, they got better by scaling up pretraining *data*, while compute is only a means to the end of cramming more data into the model. Data is the fundamental bottleneck. You can’t scale up pretraining compute without more data,” he added.

“And so far this data has been chiefly human-generated — over 20,000 people have been employed full-time for the past few years to provide annotations to train LLMs on. Even when the data is coming from Reinforcement Learning environments, the environments still had to be purposely handcrafted by humans,” Chollet added.

“And that’s the fundamental bottleneck here: these models are completely dependent on human output. They are an interpolative database of what we put into them,” he said. Chollet said that this wouldn’t be the case with true AGI. “Meanwhile, AGI will in fact get better by simply adding more *compute*. It will not be bottlenecked by the availability of human-generated text,” he added.

Modern LLMs are first trained on large amounts of data that’s available on the internet and through books. But they also require guidance on how to answer questions and behave with users, and they are provided this data by humans writing questions and answers that are used to further train the models. There are large companies such as Scale AI and Surge that specialize in providing human-written text to train models. These companies employ large numbers of contractors in places like Kenya and Venezuela, but increasingly domain specialists and PhDs have also been used to create text to train models.

And it appears that as many as 20,000 humans have been employed full-time to create this data for different companies over the last few years. The job of a Human data annotator is one of the few new jobs that are being created by AI — such roles didn’t exist before — but give how smart AI systems are getting, it’s unlikely how long these jobs will be viable for. There’s also a focus on using synthetic data, that could render such jobs obsolete in a few years. But at the moment, the AI industry isn’t just being powered by massive datacenters and highly-paid AI researchers — it’s also being powered by humans that are assiduously writing text to help train AI models.