Reddit blocks AI bots from crawling its website

In the approaching weeks, Reddit will begin blocking most automated bots from accessing its public knowledge. You’ll must make a licensing deal, like Google and OpenAI have finished, to make use of Reddit content material for mannequin coaching and different business functions. While this has technically been Reddit’s coverage already, the corporate is now implementing it by updating its robots.txt file, a core a part of the net that dictates how internet crawlers are allowed to entry a web site. “It’s a sign to those that don’t have an settlement with us that they shouldn’t be accessing Reddit knowledge,” the corporate’s chief authorized officer, Ben Lee, tells me. “It’s additionally a sign to unhealthy actors that the phrase ‘permit’ in robots.txt doesn’t imply, and has by no means meant, that they’ll use the info nonetheless they need.”My colleague David Pierce lately known as robots.txt “the textual content file that runs the web.” Since it was conceptualized within the early days of the net, the file has primarily ruled whether or not search engines like google and yahoo like Google can crawl a website to index it for outcomes. For the final 20 years or so, the give-and-take — Google sending site visitors in change for the flexibility to crawl — largely made sense for everybody concerned. Then, AI firms began ingesting all the info they might discover on-line to coach their fashions. 

https://www.theverge.com/2024/6/25/24185984/reddit-robots-txt-fight-ai-bots-scraping-crawlers

Recommended For You