Some people have been saying that the era of pre-training is over, but it appears that training on large volumes of internet text is still quite valuable.
Online discussion platform Reddit has sued Anthropic for unauthorized use of its data to train its models. Anthropic makes the popular Claude series of models, which are particularly popular for coding tasks. “Anthropic is in fact intentionally trained on the personal data of Reddit users without ever requesting their consent,” Reddit alleged, adding that Anthropic’s conduct runs counter to how it “bills itself as the white knight of the AI industry.”

Reddit had reached agreements for its data use with OpenAI and Google, but didn’t have one with Anthropic. Reddit said that it found that Anthropic’s bots accessing its site even after the two companies hadn’t been able to reach a similar deal. “We believe in an open internet. That does not mean open for exploitation,” Ben Lee, Reddit’s chief legal officer, said.
Reddit is one of the most popular discussion forums on the internet, and its members anonymously contribute to discussions on all manner of topics. This has meant that Reddit’s massive corpus of questions and answers — along with ratings on those answers — are quite valuable for AI companies looking to train their LLMs. Companies had been scraping Reddit’s data for a while before the company instituted guardrails, and demanded financial agreements with those which wanted access to its data.
Anthropic, though, has said the company disagrees with Reddit’s claims and will defend itself vigorously. Anthropic has made a reputation for itself over the last few years around being thoughtful and conservative as it builds AI products to minimize the risks of the downsides of rapid AI development. A lawsuit over illegal data use wouldn’t help that reputation, and Anthropic will likely fight this case in the courts to clear its name of illegally using data to train its models.