Reddit to add new tools to try and repel AI bots from scraping user data

Sign up to our free weekly IndyTech publication delivered straight to your inboxSign up to our free IndyTech publicationReddit says it would add new protections to try and repel bots that try to scrape its posts to practice AI techniques.Many firms have proposed their massive language fashions corresponding to OpenAI’s ChatGPT and Google’s Gemini as the long run. But coaching such a system requires feeding it huge quantities of written textual content – which firms have usually taken from publicly out there web sites.In current months, websites together with Reddit and Twitter have complained that visits from these crawlers have each slowed down their web site in addition to allowed firms to steal data in contravention of their insurance policies.Last month, Reddit revealed a new “Public Content Policy” that aimed to management how its data is used, each by researchers in addition to firms wanting to practice automated techniques. Now it has introduced that it’ll add new applied sciences to try and implement that.It will replace its “Robots Exclusion Protocol”, or robots.txt, which is a file that’s seen solely to web sites crawling its web site and offers directions about what third events are allowed to take.It can even use applied sciences that may purpose to spot unknown bots and crawlers and both cease them from repeatedly refreshing the positioning – or block them solely.“This replace shouldn’t influence the overwhelming majority of oldsters who use and take pleasure in Reddit,” Reddit stated.The firm additionally said that the change wouldn’t have an effect on “good religion actors”, together with those that may scrape the positioning for analysis and different functions. It pointed to the Internet Archive, as an example, and shared a quote from the director of its Wayback Machine which scrapes the web to enable customers to see a model of a web page at a given time.“The Internet Archive is grateful that Reddit appreciates the significance of serving to to make sure the digital data of our instances are archived and preserved for future generations to take pleasure in and study from,” stated Mark Graham. “Working in collaboration with Reddit we’ll proceed to file and make out there archives of Reddit, together with the lots of of tens of millions of URLs from different websites we archive each day.”Reddit additionally permits firms that it has offers with to scrape its posts to practice AI techniques. Both OpenAI and Google have agreements in place that sees them pay Reddit for entry to customers’ data.Those offers led the share value of the corporate to share after they have been introduced. Users aren’t compensated for his or her posts, however the web site will get entry to new AI options that could be out there to customers consequently.The use of Reddit to practice AI fashions has nonetheless typically led to issues for these expertise firms. Last month, when Google’s “AI Overview” characteristic started recommending together with glue to make pizza, the recommendation was tracked down to a sarcastic Reddit submit.

https://www.independent.co.uk/tech/reddit-ai-artificial-intelligence-crawling-bots-b2569394.html

Recommended For You