A federal judge in California has issued a complicated pre-trial ruling in one of the first major copyright cases involving artificial intelligence training, finding that, while using legally acquired copyrighted books to train AI large language models constitutes fair use, downloading pirated copies of those books for permanent storage violates copyright law. The ruling represents the first substantive judicial decision on how copyright law applies to the AI training practices that have become standard across the tech industry over the full-throated condemnation of the book business.
U.S. District Judge William Alsup of the Northern District of California ruled Monday in Bartz v. Anthropic that AI company Anthropic's training of its Claude LLMs on authors' works was "exceedingly transformative," and therefore protected under the fair use doctrine as specified in Section 107 of the Copyright Act. However, Alsup also determined that the company's practice of downloading pirated books from sites including Books3, Library Genesis, and Pirate Library Mirror to build a permanent digital library was not covered by fair use.
The case was brought by authors Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson, who alleged that Anthropic used their copyrighted works without permission to train its AI systems. Anthropic, which generates more than a billion dollars in annual revenue from its Claude AI service, downloaded over seven million pirated books between 2021 and 2022 to build its training datasets. Notably, the lawsuit challenged only the inputs, or works used to train Claude, and did not allege that the outputs, or works produced by the LLM, reproduced the plaintiffs' copyrighted works.
Apathy and piracy
According to the court's findings, Anthropic deliberately chose to "steal" books rather than pursue licensing agreements to avoid what cofounder and CEO Dario Amodei characterized as "legal/practice/business slog." The company's cofounder, Ben Mann, downloaded 196,640 books from Books3 in early 2021 "that he knew had been assembled from unauthorized copies of copyrighted books—that is, pirated," the court found.
The piracy, per the findings, escalated from there. "In June 2021, Mann downloaded in this way at least five million copies of books from Library Genesis, or LibGen, which he knew had been pirated. And, in July 2022, Anthropic likewise downloaded at least two million copies of books from the Pirate Library Mirror, or PiLiMi, which Anthropic knew had been pirated," according to the ruling. In total, Anthropic "thereby pirated over seven million copies of books, including copies of at least two works at issue for each author."
The court found that as "Anthropic trained successive LLMs, it became convinced that using books was the most cost-effective means to achieve a world-class LLM." However, the company became "not so gung ho about" training on pirated books "for legal reasons" but "kept them anyway."
Seeking a new approach, in February 2024 Anthropic hired Tom Turvey, the former head of partnerships for Google's book-scanning project, who was "tasked with obtaining 'all the books in the world' while still avoiding as much 'legal/practice/business slog' " as possible. After briefly inquiring into licensing books from "major publishers," the judge wrote, Turvey "let those conversations wither" rather than reaching licensing agreements "as another major technology company soon did with one major publisher"—a possible reference to deals struck by HarperCollins and Wiley last year.
The company then spent millions purchasing physical books from distributors and retailers, which were "destructively scanned"—their bindings stripped and pages cut before being digitized, with the original books discarded. Crucially, the court noted that "Anthropic kept the library copies in place as a permanent, general-purpose resource even after deciding it would not use certain copies to train LLMs or would never use them again to do so. All of Anthropic's copying was without plaintiffs' authorization."
The judge said that the print-to-digital change was fair use, not because, as Anthropic asserted, it was used in training the LLM, but instead because "the mere conversion of a print book to a digital file to save space and enable searchability was transformative." Therefore, "the digital copy should be treated just as if the purchased print copy had been placed in the central library."
In addition, despite finding the use of pirated books problematic, Alsup ruled that the actual training process constituted fair use under copyright law. Noting that Anthropic implemented filtering software to prevent users from accessing infringing copies of the original materials, he compared AI training to human learning, noting that people regularly read books, memorize passages, and draw upon that knowledge in their own writing without violating copyright.
"Everyone reads texts, too, then writes new texts," he wrote. "To make anyone pay specifically for the use of a book each time they read it, each time they recall it from memory, each time they later draw upon it when writing new things in new ways would be unthinkable."
The authors, Alsup noted, did not allege that Claude ingested the books to spit them back out, should a user ask. "Authors challenge only the inputs, not the outputs, of these LLMs," Alsup wrote. "They point to the fully trained LLMs and the Claude service only to shed light on how training itself uses copies of their works and the ways the Claude service could be used to produce still other works that would compete with their works." As such, the issue of output was not raised in this case.
Trial and errors
The Authors Guild, which has been closely monitoring AI copyright litigation, provided an extensive response to PW criticizing the ruling's fair use findings while welcoming the court's recognition of the piracy issues. "While the Authors Guild is relieved that the court recognized Anthropic's massive, criminal-level, unexcused e-book piracy for what it is," the guild argued, "the decision that using pirated or scanned books for training LLMs is fair use" contradicts established copyright precedent and "ignores the harm caused to authors and the value of their works due to market saturation by LLM-generated content that competes with human authors."
The guild added that "the analogy to human learning and reading is fundamentally flawed. When humans learn from books, they don't make digital copies of every book they read and store them forever for commercial purposes. The scale and systematic nature of this copying is unprecedented and threatens the livelihood of authors who depend on licensing and sales of their works."
The court’s rulings on fair use—both for Claude's training and for "the print-to-digital format change"—were issued on summary judgment, meaning that certain issues in the case will still proceed to trial. Specifically, Alsup ordered "a trial on the pirated copies used to create Anthropic's central library and the resulting damages, actual or statutory (including for willfulness)," noting that Anthropic's later purchase of legitimate physical copies of the book would not absolve the company of liability for its earlier piracy—although it might affect the extent of statutory damages. The judge also declined to grant summary judgment on copies made from Anthropic's central library for uses other than AI training, noting that the company had "dodged discovery" on these issues and that the record was "too poorly developed" to make a determination.
Alsup's decision comes as AI companies face scores of similar copyright lawsuits from authors, publishers, and other content creators. Major technology companies including Meta, OpenAI, and others have been accused of using pirated content to train their AI systems. And the ruling's mixed nature reflects the legal complexity surrounding AI training practices. While it provides some validation for AI companies' claims that training constitutes fair use, it also establishes that obtaining copyrighted materials through piracy cannot be excused by subsequent transformative use.
Legal experts expect the decision to be appealed. And the Authors Guild, in its statement, said that "we remain confident that appeals courts will reverse the fair use finding and recognize that this systematic copying for commercial gain violates fundamental copyright principles that have protected authors for generations."