Fighting the AI scraper bots at Pivot to AI and RationalWiki

David Gerard@awful.systems · edit-2 2 days ago

Fighting the AI scraper bots at Pivot to AI and RationalWiki

Moonrise2473@feddit.it · 2 days ago

The insane part is that since the website is powered by WordPress, the scrapers could access all the posts in a single JSON file.

I was also exasperated by the fucking scrapers reading the same fucking page 20 times a second on posts that didn’t get new content from a decade ago so I migrated my blog to Hugo and I was completely shocked to discover that by default every WordPress blog comes with an unauthenticated api that allows literally everyone to get the whole blog content in JSON files. Why the fuck are you wasting my server power to scrape the HTML if you can get a easy JSON??? Take that fucking JSON and subscribe to the RSS for getting the next post that will be published the next decade. If you refresh that fucking URL 1000 times a day you will get the same fucking stuff, not a new magical article

I mitigated the issue with Wordfence by setting a rule like “more than 2 pages requested within a second = ip banned for a year”

Now, why would WordPress include an unauthenticated api that allows everyone to do a full unauthorized copy of the site in literally seconds is beyond me. There’s no valid reason to have it public without any authentication. That API shit doesn’t make sense, why by default a website should accept user signups from bots via API

jlow (he / him)@discuss.tchncs.de · 23 hours ago

Ohhhh, thanks for the mention of Wordfence. I’d love for Anubis to be available for Wordpress but I’ll take this in the meantime!

HedyL@awful.systems · edit-2 2 days ago

Even if it’s not the main topic of this article, I’m personally pleased that RationalWiki is back. And if the AI bots are now getting the error messages instead of me, then that’s all the better.

Edit: But also - why do AI scrapers request pages that show differences between versions of wiki pages (or perform other similarly complex requests)? What’s the point of that anyway?

BlueMonday1984@awful.systems · 2 days ago

Edit: But also - why do AI scrapers request pages that show differences between versions of wiki pages (or perform other similarly complex requests)? What’s the point of that anyway?

Gonna take a shot in the dark here and guess that the people behind these AI scrapers are incredibly stupid, and think Scrape More = More Data = More Growth = More Good^TM. Given this bubble’s already given us the goddamn scourge that is “vibe coding”, and is headed by growth-obsessed fuckwits and True Believers^TM, chances are I’m completely correct on this.

Sailor Sega Saturn@awful.systems · 2 days ago

Edit: But also - why do AI scrapers request pages that show differences between versions of wiki pages (or perform other similarly complex requests)? What’s the point of that anyway?

This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.

Any crawler that doesn’t know what their doing and doesn’t respect robots but wants to crawl an entire domain will end up following these sorts of links naturally. It has no sense that the requests are “complex”, just that it’s fetching a URL with a few more query parameters than it started at.

The article even alludes to how to take advantage of this with it’s “trap the bots in a maze of fake pages” suggestion. Even crawlers that know what they’re doing will sometimes struggle with infinite URL spaces.

HedyL@awful.systems · edit-2 2 days ago

This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.

It’s so ridiculous - supposedly these people have access to a super-smart AI (which is supposedly going to take all our jobs soon), but the AI can’t even tell them which pages are worth scraping multiple times per second and which are not. Instead, they appear to kill their hosts like maladapted parasites regularly. It’s probably not surprising, but still absurd.

Edit: Of course, I strongly assume that the scrapers don’t use the AI in this context (I guess they only used it to write their code based on old Stackoverflow posts). Doesn’t make it any less ridiculous though.