Will quibbling over the meaning of “shadow libraries” help Nvidia’s case?
Some of the most infamous so-called shadow libraries have increasingly faced legal pressure to either stop pirating books or risk being shut down or driven to the dark web. Among the biggest targets are Z-Library, which the US Department of Justice has charged with criminal copyright infringement, and Library Genesis (Libgen), which was sued by textbook publishers last fall for allegedly distributing digital copies of copyrighted works “on a massive scale in willful violation” of copyright laws.
But now these shadow libraries and others accused of spurning copyrights have seemingly found an unlikely defender in Nvidia, the AI chipmaker among those profiting most from the recent AI boom.
Nvidia seemed to defend the shadow libraries as a valid source of information online when responding to a lawsuit from book authors over the list of data repositories that were scraped to create the Books3 dataset used to train Nvidia’s AI platform NeMo.
That list includes some of the most “notorious” shadow libraries—Bibliotik, Z-Library (Z-Lib), Libgen, Sci-Hub, and Anna’s Archive, authors argued. However, Nvidia hopes to invalidate authors’ copyright claims partly by denying that any of these controversial websites should even be considered shadow libraries.
“Nvidia denies the characterization of the listed data repositories as ‘shadow libraries’ and denies that hosting data in or distributing data from the data repositories necessarily violates the US Copyright Act,” Nvidia’s court filing said.
The chipmaker did not go into further detail to define what counts as a shadow library or what potentially absolves these controversial sites from key copyright concerns raised by various ongoing lawsuits. Instead, Nvidia kept its response brief while also curtly disputing authors’ petition for class-action status and defending its AI training methods as fair use.
“Nvidia denies that it has improperly used or copied the alleged works,” the court filing said, arguing that “training is a highly transformative process that may include adjusting numerical parameters including ‘weights,’ and that outputs of an LLM may be based, at least in part, on such ‘weights.'”
Nvidia’s argument likely depends on the court agreeing that AI models ingesting published works in order to transform those works into weights governing AI outputs is fair use. However, authors have argued that “these weights are entirely and uniquely derived from the protected expression in the training dataset” that has been copied without getting authors’ consent or providing authors with compensation.
Some companies, like OpenAI, have already started licensing publishers’ content, likely to dodge these copyright questions entirely. Lawyers for The New York Times, which is one of the publishers suing OpenAI, have already suggested that OpenAI’s most recent deal to license content from News Corp. “supports the contention” that “publishers should be paid when their work is used for AI,” MediaPost reported.
Until this question is settled by courts or lawmakers, companies training AI on the Books3 dataset will likely continue to face lawsuits from rights holders, particularly from those who see AI models as an extension of harms caused by these allegedly illegal shadow libraries. A lawyer for textbook publishers suing Libgen, Matthew Oppenheim, previously told Ars that Libgen is a “thieves’ den” of illegal books, and “there is no question” that Libgen’s conduct is “massively illegal.”
Authors suing Nvidia have taken the next step, linking the chipmaker to shadow libraries by arguing that “these shadow libraries have long been of interest to the AI-training community because they host and distribute vast quantities of unlicensed copyrighted material. For that reason, these shadow libraries also violate the US Copyright Act.”
While Nvidia apparently prepares to defend against copyright suits by disputing what a shadow library even is, the websites at the heart of Nvidia’s suits may take less issue with the label. Anna, the pseudonymous creator of Anna’s Archive, freely uses the term, describing the site as “the world’s largest shadow library” while offering to train other so-called pirate archivists.
In one way, it’s not that surprising that Nvidia has seemed to take the side of shadow libraries when it comes to beating back copyright claims, though.
Back in 2022, when feds started cracking down on pirate e-book sites, Anna told Vice that shadow libraries like hers operate on the ethos that “information wants to be free.” AI companies are arguably highly incentivized to want the same thing.
Nvidia recently announced that it made a record $26 billion in the first quarter of 2024 alone. For Nvidia and other AI companies hoping to maximize profits and command the AI market early on, there’s likely still no better price for AI training data than free and, thus, few better sources for training-data than sites freely offering vast troves of information.