Herman's blog

The Great Scrape

LLMs feed on data. Vast quantities of text are needed to train these models, which are in turn receiving valuations in the billions. This data is scraped from the broader internet, from blogs, websites, and forums, without the author's permission and all content being opt-in by default.

Needless to say, this is unethical. But as Meta has proven, it's much easier to ask for forgiveness than permission. It is unlikely they will be ordered to "un-train" their next generation models due to some copyright complaints.

I wish the problem ended with the violation of consent for how our writing is used. But there's another, more immediate problem: The actual scraping.

These companies are racing to create the next big LLM, and in order to do that they need more and more novel data with which to train these models. This incentivises these companies to ruthlessly scrape every corner of the internet for any bit of new data to feed the machine. Unfortunately these scrapers are terrible netizens and have been taking down site-after-site in an unintentional wide-spread DDoS attack.

Over the past 6 months Bear, and every other content host on the internet, has been affected. Both Sourcehut and LWN have written about their difficulties in holding back the scourge of AI scrapers. This seems to be happening to big and small players alike. Self-hosted bloggers have had to figure out rate-limiting and CDNs too, which is pretty unfair for someone who just wants to write on the internet.

Bear is hit daily by bot networks requesting tens of thousands of pages in short time periods, and while I now have systems in place to prevent it actually taking down the server, when it started happening a few months ago it certainly had an impact on performance.

This is a difficult problem to solve, due to the way that these scrapers are designed. The first is that only a small portion of these scrapers identify themselves as such. These are all blocked at the WAF (Web Application Firewall) level and never reach any Bear blogs (about 500,000 requests have been blocked in the last 24 hours). However the vast majority of scrapers identify themselves as regular web-browsers, and use multiple servers and IP addresses, making all of the usual tools like rate-limiting and user-agent parsing obsolete. Not to mention that they all completely ignore robots.txt and other self-regulation rules.

One of the mitigation options is to add a challenge to every single page (like Cloudflare's managed challenge), but this is an unpleasant user-experience and blocks bots that are actually welcome, such as search engine crawlers. So while it is possible to mitigate all bot traffic, that would effectively make all blogs non-searchable on all the major search engines. Some of the LLM scrapers cheekily identify themselves as Googlebot or Yandexbot as well. This option would also affect anyone who runs scripts on their own site for backups or custom automations. Not ideal.

I've had to remove RSS subscriber analytics since I can't mitigate bots very well on RSS feeds which are explicitly designed for bots. This influx has caused the RSS analytics to be completely wrong, and it felt better to remove it than to display incorrect information.

As of right now I have several strategies in place to combat this deluge that are working well. If you're a service provider or sysadmin being negatively impacted by these scrapers, contact me and I'd be happy to show you what's worked for me.

Right now everything is under control on Bear. Over the past month bots have only managed to impact performance on Bear once, and that endpoint has since been protected. I've added significantly more active monitoring, and any time I see a spike of requests I find a common pattern, block it, and monitor whether it has affected any real users.

Thankfully, none of these scrapers render CSS, and therefore don't get logged as visitors on Bear's analytics.

The best case scenario is that the AI companies find another way to train their models without ruthlessly slashing and burning the internet. However, I doubt this will happen. Instead I see it getting worse before it gets better. More tools are being released to combat this, one interesting tool from Cloudflare is the AI Labyrinth which traps AI scrapers that ignore robots.txt in a never-ending maze of no-follow links. This is how the arms race begins.

And I'm ready for it. Let's fight this exploitation of the commons.