OpenAI “bizarrely” mischaracterizes hacking, NYT lawyer says.

AI

OpenAI is now boldly claiming that The New York Times “paid someone to hack OpenAI’s products” like ChatGPT to “set up” a lawsuit against the leading AI maker.

In a court filing Monday, OpenAI alleged that “100 examples in which some version of OpenAI’s GPT-4 model supposedly generated several paragraphs of Times content as outputs in response to user prompts” do not reflect how normal people use ChatGPT.

Instead, it allegedly took The Times “tens of thousands of attempts to generate” these supposedly “highly anomalous results” by “targeting and exploiting a bug” that OpenAI claims it is now “committed to addressing.”

According to OpenAI this activity amounts to “contrived attacks” by a “hired gun”—who allegedly hacked OpenAI models until they hallucinated fake NYT content or regurgitated training data to replicate NYT articles. NYT allegedly paid for these “attacks” to gather evidence to support The Times’ claims that OpenAI’s products imperil its journalism by allegedly regurgitating reporting and stealing The Times’ audiences.

“Contrary to the allegations in the complaint, however, ChatGPT is not in any way a substitute for a subscription to The New York Times,” OpenAI argued in a motion that seeks to dismiss the majority of The Times’ claims. “In the real world, people do not use ChatGPT or any other OpenAI product for that purpose. Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will.”

In the filing, OpenAI described The Times as enthusiastically reporting on its chatbot developments for years without raising any concerns about copyright infringement. OpenAI claimed that it disclosed that The Times’ articles were used to train its AI models in 2020, but The Times only cared after ChatGPT’s popularity exploded after its debut in 2022.Advertisement

According to OpenAI, “It was only after this rapid adoption, along with reports of the value unlocked by these new technologies, that the Times claimed that OpenAI had ‘infringed its copyright[s]’ and reached out to demand ‘commercial terms.’ After months of discussions, the Times filed suit two days after Christmas, demanding ‘billions of dollars.'”

Ian Crosby, Susman Godfrey partner and lead counsel for The New York Times, told Ars that “what OpenAI bizarrely mischaracterizes as ‘hacking’ is simply using OpenAI’s products to look for evidence that they stole and reproduced The Times’s copyrighted works. And that is exactly what we found. In fact, the scale of OpenAI’s copying is much larger than the 100-plus examples set forth in the complaint.”

Crosby told Ars that OpenAI’s filing notably “doesn’t dispute—nor can they—that they copied millions of The Times’ works to build and power its commercial products without our permission.”

“Building new products is no excuse for violating copyright law, and that’s exactly what OpenAI has done on an unprecedented scale,” Crosby said.

OpenAI argued that the court should dismiss claims alleging direct copyright, contributory infringement, Digital Millennium Copyright Act violations, and misappropriation, all of which it describes as “legally infirm.” Some fail because they are time-barred—seeking damages on training data for OpenAI’s older models—OpenAI claimed. Others allegedly fail because they misunderstand fair use or are preempted by federal laws.

If OpenAI’s motion is granted, the case would be substantially narrowed.

But if the motion is not granted and The Times ultimately wins—and it might—OpenAI may be forced to wipe ChatGPT and start over.

“OpenAI, which has been secretive and has deliberately concealed how its products operate, is now asserting it’s too late to bring a claim for infringement or hold them accountable. We disagree,” Crosby told Ars. “It’s noteworthy that OpenAI doesn’t dispute that it copied Times works without permission within the statute of limitations to train its more recent and current models.”

OpenAI did not immediately respond to Ars’ request to comment.

How did NYT “hack” ChatGPT?

OpenAI claimed that The Times used deceptive prompts—such as repeatedly asking ChatGPT, “what’s the next sentence?”—to target “two uncommon and unintended phenomena” from both its developer tools and ChatGPT: training data regurgitation and model hallucination.

These appear to be the “bug” that OpenAI has accused The Times of exploiting to “hack” GPT models.

According to OpenAI, regurgitation of training data occurs when AI tools generate “a sample that closely resembles [its] training data,” which “most often happens” when the “training data set contains a number of highly similar observations, such as duplicates” of a particular piece of writing. OpenAI likened this to an American hearing the phrase “I pledge allegiance” and “reflexively” responding to complete the text by saying, “to the flag of the United States of America.”

The Times allegedly exploited this bug by requesting OpenAI tools provide the “opening paragraph” of a particular article, then requesting the “next sentence.” But even this tactic, OpenAI claimed, couldn’t be used to generate an entire article but rather “scattered and out-of-order quotes.”

OpenAI accused The Times of deliberately misleading the court by using “ellipses to obscure” the order in which ChatGPT spouted out snippets of The Times’ reporting. This, OpenAI alleged, created “the false impression that ChatGPT regurgitated sequential and uninterrupted snippets of the articles.”Advertisement

OpenAI similarly accused The Times of prompting ChatGPT to hallucinate fake NYT content.

Hallucinations, a much more hotly discussed problem with AI chatbots, occur “when a model generates ‘seemingly realistic’ answers that turn out to be wrong,” OpenAI said.

OpenAI pushed back on The Times’ examples of hallucinations where AI models invented articles that were never published by The Times. Because none of the links generated in outputs to the bogus articles actually worked, OpenAI argued that “any user who received such an output would immediately recognize it as a hallucination.”

OpenAI plans to fix these bugs, but its plan appears to almost entirely depend on beating The Times’ lawsuit and others by convincing courts in multiple jurisdictions of its theory of fair use of the copyrighted texts that it views as critical to advance its AI models.

“An ongoing challenge of AI development is minimizing and (eventually) eliminating hallucination, including by using more complete training datasets to improve the accuracy of the models’ predictions,” OpenAI said.

Crosby told Ars that there “should be no surprise” that to OpenAI, allegedly “illegal copying and misinformation are core features of their products and not the result of fringe behavior.”

What did surprise The Times, Crosby said, was that OpenAI’s filing appears to show “that it is tracking users’ queries and outputs, which is particularly surprising given that they claimed not to do so. We look forward to exploring that issue in discovery.”

OpenAI’s biggest gripes with NYT’s suit

OpenAI appears frustrated that The Times allegedly spent a long time investigating its products for these flaws without alerting OpenAI or attempting to collaborate on solutions.

“Rather, The Times kept these results to itself, apparently to set up this lawsuit,” OpenAI’s filing said.

OpenAI seemingly never heard about any of these issues until confronted with the examples of regurgitation and hallucination cited in The Times’ lawsuit. And the lawsuit examples allegedly don’t give OpenAI much to work with now, because The Times still hasn’t clearly explained how it generated allegedly violative outputs.

“The Times did not reveal what parameters it used or disclose whether it used a ‘System’ prompt to, for instance, instruct the model to ‘act like a New York Times reporter and reproduce verbatim text from news articles,'” OpenAI argued.

OpenAI was able to deduce that The Times’ examples did not seem to cite current materials “that Times subscribers are most likely to read on the Times’s website,” but rather “much older articles published between 2.5 and 12 years.” This seems to weaken The Times’ claims that ChatGPT could be viewed as a substitute, potentially causing Times subscribers to stop paying for access, as ChatGPT does not seem to be as likely to regurgitate newer articles. (This could be because much of OpenAI’s training data comes from scraping social media sites where older NYT content has been circulated more frequently.)Advertisement

According to OpenAI, at least one of The Times’ copyright claims will fail because The Times never notified OpenAI of the concerns at the heart of its lawsuit.

“The Times must allege that OpenAI ‘had knowledge’ of the Times’ creation of those outputs,” OpenAI argued, claiming that it had no “reason to suspect this was happening.”

The Times has alleged that OpenAI knowingly designed its products to reference its articles, potentially diverting readers, without offering to pay The Times to license its content.

Licensing data, rather than training models on public data, has been increasingly embraced by some AI makers in some instances, including OpenAI, seemingly to avoid perceived conflicts with copyright law. Some lawmakers think that AI companies should license all their training data—a policy stance opposed by some AI companies, including OpenAI, which have “argued that it’s not viable to license all training data,” Wired reported.

“Developing technology in a way that complies with established copyright laws is an industry-wide priority,” Crosby told Ars. “The decision by OpenAI and other generative AI developers to enter into deals with news publishers only confirms that they know their unauthorized use of copyrighted work is far from ‘fair.’”

LEAVE A REPLY

Please enter your comment!
Please enter your name here