Email or username:

Password:

Forgot your password?
Top-level
Liam Pomfret, PhD

@alan @ionizedgirl @aires Are they actually building an index, or are they scraping content for AI training datasets? Given they seem to ignore sitemaps, I’m inclined to think the latter.

6 comments
Alan Langford

@liampomfret @ionizedgirl @aires Hard to say but the two are not mutually exclusive, and Google is ripe for being eclipsed at this point. Why not use the data to do both?

Liam Pomfret, PhD

@alan @ionizedgirl @aires I would've assumed it'd be more efficient for the AI Scraper bots to be working off of a list of URLs they'd already gotten from sitemaps, rather than brute-forcing everything by simultaneously scraping and following links.

Alan Langford

@liampomfret @ionizedgirl @aires You would think. But some of them are horrible. Amazonbot gets stuck on event pages in some weird way. It gets an error, then appends a fragment of the URL to itself and tries again, from multiple IP addresses. I've had to put in rules that look for the pattern and then return a 403, otherwise it just keeps trying. If this thing was a high school coding assignment, it would fail!

Liam Pomfret, PhD

@alan @ionizedgirl @aires Oh, you don't have to tell me how horrible they are. At the end of last month, just one of these bots was nuking my own site at a rate of 35+ accesses a second, sustained 24/7.

If you could DM me a copy of the rules that you've created for that, I'd love to pass them on to my own technical team, and see if that helps make any difference for some of the bots we hadn't yet been able to nail down.

Chuck

@liampomfret @alan @ionizedgirl @aires

Curious if this is an *Amazon* scraper or if it is someone running on AWS scraping. (can you distinguish between the two? Perhaps Amazon IPs vs AWS IPs)

The three party nature of search (indexer needs permission from site), (site needs to want clients of indexer to see it), (user wants an index that contains sites they are interested in) is a really interesting problem.

Alan Langford

@ChuckMcManis @liampomfret @ionizedgirl @aires The user agent is a current amazonbot, but that doesn't mean someone isn't spoofing it. I've been blocking that one on a per site/URL basis, so if there's some reason for Amazon to be checking in, they can get in.

Go Up