Cloudflare launches a defense against AI bots

Image Credit: Freepik

Owners now have the ability to prevent downloading, crawling, and artificial intelligence (AI) bots from taking content from websites and using it to train large language models (LLMs) to duplicate it without approval.

Website owners can disable the bots used for data scraping and model training by modifying their site’s robots.txt file, which instructs bots on which pages to visit on a website. This is possible with some AI providers, such as Google, OpenAI, and Apple. However, certain AI scrapers disregard this, as Cloudflare notes in an article introducing its bot-combating technology.

In its own blog, the company states that “customers don’t want AI bots visiting their websites, and especially those that do so dishonestly.” We worry that certain AI firms that want to get around content restrictions will keep changing to avoid being identified as bots.

Thus, Cloudflare improved automatic bot detection models by analyzing AI bot and crawler traffic in an effort to solve the issue. Among other things, the models take into account the possibility that an AI bot is attempting to blend in with the appearance and actions of a web browser in an attempt to avoid being detected.

Cloudflare Launches a solution to combat AI bots

To prevent bots from scraping websites for data used to train AI models, Cloudflare provides a free solution. The widely traded cloud service provider Cloudflare has developed a new, free tool to prevent bots from collecting data from websites hosted on its network for the purpose of creating artificial intelligence (AI) models.

According to Cloudflare, “bad actors typically use tools and frameworks that we are able to fingerprint when they attempt to crawl websites at scale.” “Our models [are] able to appropriately flag traffic from evasive AI bots as bots based on these signals.”

In addition to stating that it will keep manually adding AI bots to its blacklist over time, Cloudflare has set up a form for hosts to report suspected AI bots and crawlers.

As the demand for model training data increases due to the surge in generative AI, the issue of AI bots has become more apparent.

Cloudflare Increasing Security Risk with AI Bots

Data for training is becoming more and more necessary as generative AI becomes more prevalent. Several websites block AI scrapers and crawlers out of concern that AI merchants will utilize their content without authorization or payment. 26% of the top 1,000 websites and more than 600 news publishers have disabled OpenAI’s bot, according to research.

Not all blocking is trustworthy. It has been alleged that certain manufacturers circumvent regulations regarding AI bot exclusion in order to get an advantage. Although OpenAI and Anthropic have broken robots.txt rules, Perplexity has been accused of posing as actual website visitors in order to scrape content.

Cloudflare’s tools might be useful if they can find hidden AI bots. They don’t, however, address the more significant problem of publishers losing out on referral traffic from AI resources like Google’s AI Overviews, which omit websites that block particular AI crawlers.

How does Cloudflare function?

We were able to transform customer inquiries into GraphQL filters by utilizing one of the large language models (LLMs) that were available off-the-shelf on the platform, thanks to Cloudflare’s potent Workers AI global network inference engine. by imparting knowledge about the available filters on our Security Analytics GraphQL dataset to an AI model.

Then, we can make queries to our GraphQL APIs, obtain the necessary data, and create a data visualization to respond to the customer’s inquiry by applying the filters that the AI model has provided.

Using this method, we can simply query consumer data while protecting the privacy of any security analytics data and preventing the AI model from seeing it. By doing this, you can be confident that the model won’t ever be trained using your queries. Your searches and the information they yield remain on Cloudflare’s network at all times since Workers AI runs a local instance of the LLM on Cloudflare’s network.

Prospective Progression

We intend to quickly expand the Security Analytics AI Assistant’s capabilities, even if we are still in the early phases of development. If at first we are unable to fulfill some of your demands, please do not be alarmed. For any fields that are currently filterable, we can allow basic queries like “show me” or “compare” at launch, which may be plotted in a time series chart. To test the function and see what else you can do with it, we are thrilled to give the beta version of AI Advisor to all professional and enterprise customers. We understand that there are a lot of applications that we haven’t even considered. We are interested in knowing what you think is valuable and what you would prefer to see included in the future, so please share your thoughts with us. In future iterations, you’ll be able to pose queries like “Did I suffer any malicious activity recently?” and have Artificial Intelligence (AI) produce WAF guidelines for you to implement in order to mitigate them.