Updated robots.txt file hits Bing and others without a Reddit deal.
Recent discussions on Reddit are no longer showing up in non-Google search engine results. The absence is the result of updates to Reddit’s Content Policy that ban crawling its site without agreeing to Reddit’s rules, which bar using Reddit content for AI training without Reddit’s explicit consent.
As reported by 404 Media, using “site:reddit.com” on non-Google search engines, including Bing, DuckDuckGo, and Mojeek, brings up minimal or no Reddit results from the past week. Ars Technica made searches on these and other search engines and can confirm the findings. Brave, for example, brings up a few Reddit results sometimes (examples here and here) but not nearly as many as what appears on Google when using identical queries. A standout is Kagi, which is a paid-for engine that pays Google for some of its search index and still shows recent Reddit results.
As 404 Media noted, Reddit’s Robots Exclusion Protocol (robots.txt file) blocks bots from scraping the site. The protocol also states, “Reddit believes in an open Internet, but not the misuse of public content.” Reddit has approved scrapers from the Internet Archive and some research-focused entities.
Reddit announced changes to its robots.txt file on June 25. Ahead of the changes, it said it had “seen an uptick in obviously commercial entities who scrape Reddit and argue that they are not bound by our terms or policies. Worse, they hide behind robots.txt and say that they can use Reddit content for any use case they want.”
Last month, Reddit said that any “good-faith actor” could reach out to Reddit to try to work with the company, linking to an online form. However, Colin Hayhurst, Mojeek’s CEO, told me via email that he reached out to Reddit after he was blocked but that Reddit “did not respond to many messages and emails.” He noted that since 404 Media’s report, Reddit CEO Steve Huffman has reached out.
Google’s search strangehold tightens
With Google being virtually the only search engine that can show recent Reddit results—at least for now—Reddit has inadvertently helped tighten Google’s stranglehold on the search industry. The change comes amid recent quality concerns about Google results, which have ranked SEO and AI spam farms, ads, and e-commerce links higher than more relevant results. There are also worries about Google’s AI Overview.
ARS VIDEO
What Happens to the Developers When AI Can Code? | Ars Frontiers
When reached for comment, Reddit spokesperson Tim Rathschmidt said via email that Reddit has been in talks “with multiple search engines.” He added:
We have been unable to reach agreements with all of them, since some are unable or unwilling to make enforceable promises regarding their use of Reddit content, including their use for AI.
After Reddit declared war on free use of its content for AI training (which also resulted in an API access price hike that shuttered many third-party Reddit apps), Reddit signed a deal at a reported $60 million per year that lets Google use Reddit data to train its AI. It was expected that Reddit would try to strike a similar deal with Microsoft, but it seems the parties could not reach an agreement in line with Reddit’s content policy, which also includes rules about user privacy and deleted content, for example.
A spokesperson for Microsoft told me:
Microsoft respects the robots.txt standard and we honor the directions provided by websites that do not want content on their pages to be used with our generative AI models. Bing stopped crawling Reddit after they implemented their updated robots.txt file on July 1, which prohibits all crawling of their site.
In October, The Washington Post, citing an anonymous source, reported that Reddit was considering blocking Bing search crawlers if it couldn’t reach a deal with Microsoft.
As 404 Media pointed out, Reddit’s guide for accessing its data names “search or website ads” as a commercial use warranting fees. It’s unclear how much money other search engines would need to spend to be permitted to scrape the platform. Rathschmidt said Reddit is “open to working with partners big and small.”
“It’s bad for the health of the Internet for for-profit companies to scrape our content without constraint and use it for, among other things, [training] AI models,” he said.
For now, Google can continue leaning on Reddit to help make search results more relevant. Google didn’t respond to Ars’ request for comment.
Meanwhile, alternative search engines may find it harder to compete.
“With our own ranking algorithms, previously users would often find different pages on Reddit than they might find with Google and others,” Mojeek’s Hayhurst told me.
The CEO added that while being blocked by Reddit alone “is not a huge deal,” he is concerned about the precedent it could set. “Search engines are the main traffic source for most websites, and a spreading of this behavior will further choke off traffic. And smaller sites will be impacted even more than large sites,” he said.
Advance Publications, which owns Ars Technica parent Condé Nast, is the largest shareholder of Reddit.
This article was updated with additional comment from Microsoft.