Anthropic Didn't Break The Law When It Used Books To Train Its Models, Rules US Federal Judge

The use of copyrighted material such as books and papers has been a contentious issue with the training of LLMs, but a US judge has ruled that the use of books to train such models is under fair use.

A US Federal judge has ruled that Anthropic did not break copyright law when it trained its LLMs on copyrighted books. The ruling came as a part of a case brought by three authors Andrea Bartz, Charles Graeber and Kirk Wallace Johnson, who’d alleged that the use of millions of books to train AI models amounted “large-scale theft”, and that Anthropic “seeks to profit from strip-mining the human expression and ingenuity behind each one of those works”.

But the judge ruled that Anthropic training its models on these millions of books was “fair use” under US copyright law because it was “quintessentially transformative”. “Like any reader aspiring to be a writer, Anthropic’s [AI large language models] trained upon works not to race ahead and replicate or supplant them – but to turn a hard corner and create something different,” US District Judge William Alsup wrote.

The judgement said that everyone reads texts, and then writes new texts. But authors couldn’t reasonably expect to be paid each time a reader recalled it or drew upon it for new writings. It added that AI systems “have not reproduced to the public a given work’s creative elements, nor even one author’s identifiable expressive style.”

First, Authors argue that using works to train Claude’s underlying LLMs was like using works to train any person to read and write, so Authors should be able to exclude Anthropic from this use. But Authors cannot rightly exclude anyone from using their works for training or learning as such. Everyone reads texts, too, then writes new texts. They may need to pay for getting their hands on a text in the first instance. But to make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable. For centuries, we have read and re-read books. We have admired, memorized, and internalized their sweeping themes, their substantive points, and their stylistic solutions to recurring writing problems.

Anthropic wasn’t all in the clear though. The company had used pirated books taken from the internet to train its models, and judge said that this broke the law. Anthropic will now go to trial over how it acquired those books by downloading them from online “shadow libraries” of pirated copies. ““That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft, but it may affect the extent of statutory damages,” the judge wrote.