Last week, reports emerged that the New York Times may take legal action against the ChatGPT maker OpenAI as its AI models allegedly use content published on the website, which is NYT’s intellectual property, to train its AI models. While that may not have happened so far, the major news publisher has now decided to ban OpenAI’s web crawler from viewing content on its website. The move means that the website’s content cannot be used to train any of OpenAI’s AI foundational models.
As per a report by The Verge, NYT has blocked OpenAI’s web crawler GPTbot from searching and indexing the contents of the website. The report highlights the robot.txt page of the publication which clearly shows that the bot has been disallowed. Using the Internet Archive’s Wayback Machine that lets users check webpages on any past date, it turns out that the bot was blocked on August 17.
This move comes after OpenAI gave website owners an “opt-out” option to not have their site’s content be used by the company to train its AI models. On August 7, the company explained that its GPTbot can be stopped by going to the robot.txt page. At the same time, highlighting the usage of the content in its blog post, it said, “Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies”.
For the unaware, the web crawler, also known as a web spider, is essentially a computer program that can search and automatically index website content. It goes through all the URLs of the website and then uses the data to source information for itself. Such web crawlers are being used a lot these days by AI companies to train their foundational models. Recently Twitter added a temporary tweet rate limit to stop such web crawlers from stealing the content on its platform. Similarly, Reddit has come up with a new API policy to dissuade web crawlers.
However, OpenAI is one of the few AI companies which is offering a direct and simple method to opt out of its GPTbot’s scope.
Last week, an NPR report revealed that the NYT may end up suing the ChatGPT makers after both parties failed to reach an agreement over a licensing deal where OpenAI would have to pay an agreed-upon amount for using its articles to train the AI models.