Reddit to add new tools to try and repel AI bots from scraping user data

Reddit says it will add new protections to try and repel bots that attempt to scrape its posts to train AI systems.

Many companies have proposed their large language models such as OpenAI’s ChatGPT and Google’s Gemini as the future. But training such a system requires feeding it vast amounts of written text – which companies have often taken from publicly available websites.

In recent months, sites including Reddit and Twitter have complained that visits from those crawlers have both slowed down their site as well as allowed companies to steal data in contravention of their policies.

Last month, Reddit published a new “Public Content Policy” that aimed to control how its data is used, both by researchers as well as companies looking to train automated systems. Now it has announced that it will add new technologies to try and enforce that.

It will update its “Robots Exclusion Protocol”, or robots.txt, which is a file that is visible only to websites crawling its site and gives instructions about what third parties are allowed to take.

It will also use technologies that will aim to spot unknown bots and crawlers and either stop them from repeatedly refreshing the site – or block them entirely.

“This update shouldn’t impact the vast majority of folks who use and enjoy Reddit,” Reddit said.

The company also stated that the change would not affect “good faith actors”, including those who might scrape the site for research and other purposes. It pointed to the Internet Archive, for instance, and shared a quote from the director of its Wayback Machine which scrapes the internet to allow users to see a version of a page at a given time.

“The Internet Archive is grateful that Reddit appreciates the importance of helping to ensure the digital records of our times are archived and preserved for future generations to enjoy and learn from,” said Mark Graham. “Working in collaboration with Reddit we will continue to record and make available archives of Reddit, along with the hundreds of millions of URLs from other sites we archive every day.”

Reddit also allows companies that it has deals with to scrape its posts to train AI systems. Both OpenAI and Google have agreements in place that sees them pay Reddit for access to users’ data.

Those deals led the share price of the company to share after they were announced. Users are not compensated for their posts, but the site will get access to new AI features that may be available to users as a result.

The use of Reddit to train AI models has however sometimes led to problems for those technology companies. Last month, when Google’s “AI Overview” feature began recommending including glue to make pizza, the advice was tracked down to a sarcastic Reddit post.

READ SOURCE