@aires apparently you can't crawl the web to index...

@aires apparently you can't crawl the web to index it unless you are google or bing because of cloudflare (the Nazi proxy) https://seirdy.one/posts/2021/03/10/search-engines-with-own-indexes/#fnref:1

Like 4 June at 18:53 | Wall-to-wall | Open on toot.cat

20 comments

Aires

@ionizedgirl Totally cool and normal that a proxy can choose which search engines get to index which websites :blobfoxfacepalm:

4 June at 18:57 | Open on tiggi.es

ionizedGirl

@aires it not just "a proxy" though, its the proxy that publically regretted ending their business dealing with imprisoned american nazi weev and his nazi organization stormfront

4 June at 19:01 | Open on toot.cat

2xfo

@ionizedgirl @aires
Is that kind of like a Nazi by proxy?

8 June at 18:24 | Open on infosec.exchange

Abie

@ionizedgirl @aires wow #TIL

9 June at 12:26 | Open on octodon.social

Tim

@ionizedgirl @aires That's a great article, thank you.

8 June at 12:42 | Open on mastodon.social

William D. Jones

@ionizedgirl @aires "OH FOR FUCK'S SAKE YOU'VE GOTTA BE SHITTING ME" is a phrase I just said out loud while reading your toot :o...

8 June at 23:17 | Open on mastodon.social

Seirdy

@ionizedgirl @aires Well, er, you can crawl most of CF if you register your bot with them through their verified bot program. Sites that opt into more advanced protection levels may still be inaccessible, and then there’s all the JS apps masquerading as websites that need a headless browser.

8 June at 23:46 | Open on pleroma.envs.net

Alan Langford

@ionizedgirl @aires As someone running a small web hosting company, yes, you can. There are at least three organizations aggressively crawling the web to build an index: TikTok (Bytedance), Amazon, and {not-sure-yet}. They are ignoring rate limits imposed by robots.txt and generating ridiculous server loads. [Except for Bytedance, because after they masked their user agent I've blocked every damn data centre they're running on with a firewall.] Amazon's crawler is pathologically bad, too.

9 June at 0:21 | Open on mindly.social

Liam Pomfret, PhD

@alan @ionizedgirl @aires Are they actually building an index, or are they scraping content for AI training datasets? Given they seem to ignore sitemaps, I’m inclined to think the latter.

9 June at 1:21 | Open on mastodon.social

Alan Langford

@liampomfret @ionizedgirl @aires Hard to say but the two are not mutually exclusive, and Google is ripe for being eclipsed at this point. Why not use the data to do both?

9 June at 1:43 | Open on mindly.social

Liam Pomfret, PhD

@alan @ionizedgirl @aires I would've assumed it'd be more efficient for the AI Scraper bots to be working off of a list of URLs they'd already gotten from sitemaps, rather than brute-forcing everything by simultaneously scraping and following links.

9 June at 5:05 | Open on mastodon.social

Alan Langford

@liampomfret @ionizedgirl @aires You would think. But some of them are horrible. Amazonbot gets stuck on event pages in some weird way. It gets an error, then appends a fragment of the URL to itself and tries again, from multiple IP addresses. I've had to put in rules that look for the pattern and then return a 403, otherwise it just keeps trying. If this thing was a high school coding assignment, it would fail!

9 June at 9:37 | Open on mindly.social

Liam Pomfret, PhD

@alan @ionizedgirl @aires Oh, you don't have to tell me how horrible they are. At the end of last month, just one of these bots was nuking my own site at a rate of 35+ accesses a second, sustained 24/7.

If you could DM me a copy of the rules that you've created for that, I'd love to pass them on to my own technical team, and see if that helps make any difference for some of the bots we hadn't yet been able to nail down.

9 June at 9:44 | Open on mastodon.social

Chuck

@liampomfret @alan @ionizedgirl @aires

Curious if this is an *Amazon* scraper or if it is someone running on AWS scraping. (can you distinguish between the two? Perhaps Amazon IPs vs AWS IPs)

The three party nature of search (indexer needs permission from site), (site needs to want clients of indexer to see it), (user wants an index that contains sites they are interested in) is a really interesting problem.

10 June at 17:48 | Open on chaos.social

Alan Langford

@ChuckMcManis @liampomfret @ionizedgirl @aires The user agent is a current amazonbot, but that doesn't mean someone isn't spoofing it. I've been blocking that one on a per site/URL basis, so if there's some reason for Amazon to be checking in, they can get in.

10 June at 17:51 | Open on mindly.social

Ted Johnson

@alan @ionizedgirl @aires Ironically, ChatGPT helped me write some directives for my .htaccess file that help prevent the crawling of some areas of my site by spiders.

9 June at 20:15 | Open on mas.to

Andrew Prillaman

@alan @ionizedgirl @aires

Don't forget my man claudebot by anthropic. He gonna waste all your bandwidth.

10 June at 14:57 | Open on theatl.social

josh

@ionizedgirl but can you scrape google

9 June at 0:28 | Open on wetdry.world

Abbie 🏳️‍⚧️

@ionizedgirl @aires that entire article was a great read.
Thanks!

9 June at 16:20 | Open on hackers.town

Mx. Luna Corbden

@ionizedgirl @aires Whichever of these engines starts filtering to only give results without ad trackers and ad bloat, they will win the internet. (Assuming they’re allowed to spider. I’m not sure I understand how cloudflare prevents them?)

9 June at 16:29 | Open on defcon.social

Go Up