For years now, AI companies including Google, Meta, Anthropic, and OpenAI have insisted that their large language models are not technically suitable. storage Copyright works from their memory and instead “learns” from their training data like the human brain.
It is this carefully worded distinction that has increasingly been integral to their efforts to protect themselves Growing flood of legal challenges.
It also cuts to the core of copyright law. Copyright is a form of intellectual property law designed to protect original works and their creators. under America Copyright Act 1976A copyright owner has the exclusive right “to reproduce, adapt, distribute, publicly perform, and publicly display the work.”
But, importantly, ““fair use” principle It recognizes that others can use copyrighted material for purposes such as criticism, journalism, and research. This has been the AI industry’s defense in court against allegations of infringement; OpenAI CEO Sam Altman has even said that the industry would “die” if it was not allowed to freely leverage copyrighted data to train its models.
Rights holders have long cried foul, accusing AI companies of training their models on pirated and copyrighted works, effectively monetizing them without properly remunerating writers, journalists and artists. This is a years old legal battle A high-profile agreement was reached.
Now, a Scandalous new study AI can put companies on the defensive. In it, Stanford and Yale researchers found strong evidence that AI models are actually copying all that data, rather than just “learning” from it. In particular, four leading LLMs – OpenAI’s GPT-4.1, Google’s Gemini 2.5 Pro, XAI’s Grok 3, and Anthropic’s Cloud 3.7 Sonnet – happily reproduced long excerpts from popular – and protected – works with a surprising degree of accuracy.
They found that the cloud outputted “entire books almost verbatim” with an accuracy rate of 95.8 percent. Gemini reproduced the novel “Harry Potter and the Sorcerer’s Stone” with 76.8 percent accuracy, while Cloud reproduced George Orwell’s “1984” with more than 94 percent accuracy compared to the original – and still copyrighted – reference material.
“While many believe that LLMs do not remember much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models,” the researchers wrote.
Some of these replications required researchers to jailbreak the models with a technique called best-of-nWhich essentially bombards the AI with different iterations of the same signal. (Those types of workarounds have already been used by OpenAI to defend itself Sued by new York Timeswith lawyers are arguing That “normal people don’t use OpenAI’s products this way.”)
The implications of the latest findings could be substantial as copyright lawsuits continue to play out in courts across the country. As atlanticAlex Reisner explainsThe results further undermine the AI industry’s argument that LLMs “learn” from these lessons rather than storing information and recalling it later. This is evidence that “AI could pose a major legal liability to companies” and “potentially cost the industry billions of dollars in copyright-infringement judgments.”
Whether AI companies are liable for copyright infringement remains a hotly debated topic. Mark Lemley, a Stanford law professor who represents AI companies in copyright lawsuits, told atlantic He’s not sure whether an AI model “contains” a copy of a book or whether it can be reproduced “instantly in response to a request.”
Not surprisingly, the industry continues to argue that they are not technically copying protected works. In 2023, Google reported to the US Copyright Office That “no copy of the training data – whether text, images, or other formats – exists in the model itself.”
OpenAI also told the office in the same year that its “models do not store copies of the information they learn from.”
To atlanticAccording to Reisner, the analogy that AI models learn like humans is a “misleading, feel-good idea that stifles the public discussion that we need to have about how AI companies are exploiting the creative and intellectual tasks on which they are completely dependent.”
But whether the judges overseeing copyright lawsuits will agree with that sentiment remains to be seen. The risks are considerable, especially as it becomes increasingly difficult for writers, journalists and other content creators to make a living – while the AI industry grows to unfathomable value.
More on AI and Copyright: It appears that OpenAI’s copyright situation is putting it in major danger