AI

The New York Times wants OpenAI and Microsoft to pay for training data

Comment

New York Times dead trees version of the World Trends page.
Image Credits: mbbirdy / Getty Images

The New York Times is suing OpenAI and its close collaborator (and investor), Microsoft, for allegedly violating copyright law by training generative AI models on Times’ content.

In the lawsuit, filed in the Federal District Court in Manhattan, The Times contends that millions of its articles were used to train AI models, including those underpinning OpenAI’s ultra-popular ChatGPT and Microsoft’s Copilot, without its consent. The Times is calling for OpenAI and Microsoft to “destroy” models and training data containing the offending material and to be held responsible for “billions of dollars in statutory and actual damages” related to the “unlawful copying and use of The Times’s uniquely valuable works.”

“If The Times and other news organizations cannot produce and protect their independent journalism, there will be a vacuum that no computer or artificial intelligence can fill,” reads The Times’ complaint. “Less journalism will be produced, and the cost to society will be enormous.”

In an emailed statement, an OpenAI spokesperson said: “We respect the rights of content creators and owners and are committed to working with them to ensure they benefit from AI technology and new revenue models. Our ongoing conversations with The New York Times have been productive and moving forward constructively, so we are surprised and disappointed with this development. We’re hopeful that we will find a mutually beneficial way to work together, as we are doing with many other publishers.”

Generative AI models “learn” from examples to craft essays, code, emails, articles and more, and vendors like OpenAI scrape the web for millions to billions of these examples to add to their training sets. Some examples are in the public domain. Others aren’t, or come under restrictive licenses that require citation or specific forms of compensation.

Vendors argue fair use doctrine provides a blanket protection for their web-scraping practices. Copyright holders disagree; hundreds of news organizations are now using code to prevent OpenAI, Google and others from scanning their websites for training data.

The vendor-outlet conflict has led to a growing number of legal battles, The Times’ being the latest.

Actress Sarah Silverman joined a pair of lawsuits in July that accuse Meta and OpenAI of having “ingested” Silverman’s memoir to train their AI models. In a separate suit, thousands of novelists, including Jonathan Franzen and John Grisham, claim OpenAI sourced their work as training data without their permission or knowledge. And several programmers have an ongoing case against Microsoft, OpenAI and GitHub over Copilot, an AI-powered code-generating tool, which the plaintiffs say was developed using their IP-protected code.

While The Times isn’t the first to sue generative AI vendors over alleged IP violations involving written works, it’s the largest publisher involved in such a suit to date — and one of the first to highlight potential damage to its brand through “hallucinations,” or made-up facts from generative AI models.

The Times’ complaint cites several cases in which Microsoft’s Bing Chat (now called Copilot), which is underpinned by an OpenAI model, provided incorrect information that was said to have come from The Times — including results for “the 15 most heart-healthy foods,” 12 of which weren’t mentioned in any Times article.

The Times makes the case, also, that OpenAI and Microsoft are effectively building news publisher competitors using The Times’ works, harming The Times’ business by providing information that couldn’t normally be accessed without a subscription — information that isn’t always cited, sometimes monetized and stripped of affiliate links that The Times uses to generate commissions, moreover.

As The Times’ complaint alludes to, generative AI models have a tendency to regurgitate training data, for example reproducing almost verbatim results from  articles. Beyond regurgitation, OpenAI has on at least one occasion inadvertently enabled ChatGPT users to get around paywalled news content.

“Defendants seek to free-ride on The Times’s massive investment in its journalism,” the complaint says, accusing OpenAI and Microsoft of “using The Times’s content without payment to create products that substitute for The Times and steal audiences away from it.”

Impacts to the news subscription business — and publisher web traffic — is at the heart of a tangentially similar suit filed by publishers earlier in the month against Google. In the case, the plaintiffs, like The Times, argued Google’s GenAI experiments, including its AI-powered Bard chatbot and Search Generative Experience, siphon off publishers’ content, readers and ad revenue through anticompetitive means.

There’s credence to publishers’ assertions. A recent model from The Atlantic found that, if a search engine like Google were to integrate AI into search, it’d answer a user’s query 75% of the time without requiring a click-through to its website. Publishers in the Google suit estimate they’d lose as much as 40% of their traffic.

That doesn’t mean they’ll be successful in court. Heather Meeker, a founding partner at OSS Capital and an adviser on IP matters including licensing arrangements, compared The Times’ example of regurgitation to “using a word processor to cut and paste.”

“In the complaint, The New York Times gives an example of a ChatGPT session about a 2012 restaurant review,” Meeker told TechCrunch via email. “The prompt for ChatGPT is ‘What were the opening paragraphs of his review?’ The next prompts then repeatedly ask for ‘the next sentence.’ Teasing a chatbot into reproducing input is not a sensible basis for copyright infringement … If the user intentionally makes the chatbot copy, that’s the user’s fault. And that’s why most [lawsuits like this] will probably fail.”

Some news outlets, rather than fight generative AI vendors in court, have chosen to ink licensing agreements with them. The Associated Press struck a deal in July with OpenAI, and Axel Springer, the German publisher that owns Politico and Business Insider, did likewise this month.

In its complaint, The Times says that it attempted to reach a licensing arrangement with Microsoft and OpenAI in April but that talks weren’t ultimately fruitful.

Updated at 4:24 Eastern with additional context and comment from OpenAI.

More TechCrunch

In a post on Werner Vogels’ personal blog, he details Distill, an open-source app he built to transcribe and summarize conference calls.

Amazon’s CTO built a meeting-summarizing app for some reason

Paris-based Mistral AI, a startup working on open source Large Language Models — the building block for generative AI services — has been raising money at a $6 billion valuation,…

Sources: Mistral AI raising at a $6B valuation, SoftBank ‘not in’ but DST is

You can expect plenty of AI, but probably not a lot of hardware.

Google I/O 2024: What to expect

Dating apps and other social friend-finders are being put on notice: Dating app giant Bumble is looking to make more acquisitions.

Bumble says it’s looking to M&A to drive growth

When Class founder Michael Chasen was in college, he and a buddy came up with the idea for Blackboard, an online classroom organizational tool. His original company was acquired for…

Blackboard founder transforms Zoom add-on designed for teachers into business tool

Groww, an Indian investment app, has become one of the first startups from the country to shift its domicile back home.

Groww joins the first wave of Indian startups moving domiciles back home from US

Technology giant Dell notified customers on Thursday that it experienced a data breach involving customers’ names and physical addresses. In an email seen by TechCrunch and shared by several people…

Dell discloses data breach of customers’ physical addresses

Featured Article

Fairgen ‘boosts’ survey results using synthetic data and AI-generated responses

The Israeli startup has raised $5.5M for its platform that uses “statistical AI” to generate synthetic data that it says is as good as the real thing.

2 hours ago
Fairgen ‘boosts’ survey results using synthetic data and AI-generated responses

Hydrow, the at-home rowing machine maker, announced Thursday that it has acquired a majority stake in Speede Fitness, the company behind the AI-enabled strength training machine. The rowing startup also…

Rowing startup Hydrow acquires a majority stake in Speede Fitness as their CEO steps down

Call centers are embracing automation. There’s debate as to whether that’s a good thing, but it’s happening — and quite possibly accelerating. According to research firm TechSci Research, the global…

Retell AI lets companies build ‘voice agents’ to answer phone calls

TikTok is starting to automatically label AI-generated content that was made on other platforms, the company announced on Thursday. With this change, if a creator posts content on TikTok that…

TikTok will automatically label AI-generated content created on platforms like DALL·E 3

India’s mobile payments regulator is likely to extend the deadline for imposing market share caps on the popular UPI payments rail by one to two years, sources familiar with the…

India likely to delay UPI market caps in win for PhonePe-Google Pay duopoly

Line Man Wongnai, an on-demand food delivery service in Thailand, is considering an initial public offering on a Thai exchange or the U.S. in 2025.

Thai food delivery app Line Man Wongnai weighs IPO in Thailand, US in 2025

The problem is not the media, but the message.

Apple’s ‘Crush’ ad is disgusting

Ever wonder why conversational AI like ChatGPT says “Sorry, I can’t do that” or some other polite refusal? OpenAI is offering a limited look at the reasoning behind its own…

OpenAI offers a peek behind the curtain of its AI’s secret instructions

The federal government agency responsible for granting patents and trademarks is alerting thousands of filers whose private addresses were exposed following a second data spill in as many years. The…

US Patent and Trademark Office confirms another leak of filers’ address data

As part of an investigation into people involved in the pro-independence movement in Catalonia, the Spanish police obtained information from the encrypted services Wire and Proton, which helped the authorities…

Encrypted services Apple, Proton and Wire helped Spanish police identify activist

Match Group, the company that owns several dating apps, including Tinder and Hinge, released its first-quarter earnings report on Tuesday, which shows that Tinder’s paying user base has decreased for…

Match looks to Hinge as Tinder fails

Private social networking is making a comeback. Gratitude Plus, a startup that aims to shift social media in a more positive direction, is expanding its wellness-focused, personal reflections journal to…

Gratitude Plus makes social networking positive, private and personal

With venture totals slipping year-over-year in key markets like the United States, and concern that venture firms themselves are struggling to raise more capital, founders might be worried. After all,…

Can AI help founders fundraise more quickly and easily?

Google has found a way to bring a variation of its clever “Circle to Search” gesture to iPhone users. The new interaction, launched in January, allows Android users to search…

Google brings a variation on ‘Circle to Search’ to iPhone users

A new sculpture going live on Wednesday in the Flatiron South Public Plaza in New York is not your typical artwork. It combines technology, sociology, anthropology and art to let…

Always-on video portal lets people in NYC and Dublin interact in real time

Apple’s iPad event had a lot to like. New iPads with new chips and new sizes, a new Apple Pencil, and even some software updates. If you are a big…

TechCrunch Minute: When did iPads get as expensive as MacBooks?

Autonomous, AI-based players are coming to a gaming experience near you, and a new startup, Altera, is joining the fray to build this new guard of AI agents. The company announced…

Bye-bye bots: Altera’s game-playing AI agents get backing from Eric Schmidt

Google DeepMind has taken the wraps off a new version of AlphaFold, their transformative machine learning model that predicts the shape and behavior of proteins. AlphaFold 3 is not only…

Google DeepMind debuts huge AlphaFold update and free proteomics-as-a-service web app

Uber plans to deliver more perks to Uber One members, like member-exclusive events, in a bid to gain more revenue through subscriptions.  “You will see more member-exclusives coming up where…

Uber promises member exclusives as Uber One passes $1B run-rate

We’ve all seen them. The inspector with a clipboard, walking around a building, ticking off the last time the fire extinguishers were checked, or if all the lights are working.…

Checkfirst raises $1.5M pre-seed to apply AI to remote inspections and audits

Close to a decade ago, brothers Aviv and Matteo Shapira co-founded a company, Replay, that created a video format for 360-degree replays — the sorts of replays that have become…

Controversial drone company Xtend leans into defense with new $40 million round

Usually, when something starts to rot, it gets pitched in the trash. But Joanne Rodriguez wants to turn the concept of rot on its head by growing fungus on trash…

Mycocycle uses mushrooms to upcycle old tires and construction waste

Monzo has raised another £150 million ($190 million), as the challenger bank looks to expand its presence internationally — particularly in the U.S. The new round comes just two months…

UK challenger bank Monzo nabs another $190M as US expansion beckons