Email or username:

Password:

Forgot your password?
39 comments
theheretic

@nixCraft damn, that blows. I really like using their bot over ChatGPT too.

Bo Stahlbrandt

@nixCraft we blocked them at #nikonians some time ago since they are scraping like mad.

Christian Quest 🌍

@nixCraft got the same on @osm_fr discourse forum.

They are now blacklisted, with others.

nixCraft 🐧

@cquest @osm_fr any idea how to black list them? robots.txt? CIDR block?

Christian Quest 🌍

@nixCraft @osm_fr robots.txt + user-agent rule in nginx

Few days ago that silly Claudebot made 100.000+ queries on five years old phpBB urls and got the same amount of 404 in return.
Now it is only 403 they will get whatever the url

madopal

@cquest @nixCraft @osm_fr Did the same for another site, and it just happily continues to request urls, hundreds of thousands of 403s not dissuading it in the slightest. I reached out to them via LinkedIn, got a response, trying to get them to pull their heads out of their collective asses.

F4GRX Sébastien

@cquest @nixCraft @osm_fr ah so thats just a:

RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} <regex>
RewriteRule . - [R=403,L]

Nice and easy! What regex should we block? is "^.*Claude.*$" enough?

Gonçalo Ribeiro

@f4grx Just checked the logs on my web server. I see they've used "ClaudeBot" and "claudebot" in my case.

Exzedo

@nixCraft Ah, so not only do they gather fuel by the truckload for their plagiarism bot, but they also actively harm the source of said fuel in the process.

How pleasantly disgusting.

katana crimson

@nixCraft Google did this back to one forum I was a part of years ago - and their stubborn refusal to obey robots.txt's crawl-delay directive still angers me to this day.

It's not just new crawlers that do this. Even the big old search engines have done it for ages.

Luigi :donor:

@nixCraft I wrote my master's thesis on data poisoning LLMs, especially as a self-defense mechanism against bot scraping.

The poisoning tool is pretty basic and shitty for now, I am aiming to release a better version in the summer

Luigi :donor:

@nixCraft University of Chicago has released an amazing data poisoning for images called Nightshade. Have a look, it's fantastic.

nightshade.cs.uchicago.edu/

F4GRX Sébastien

@luigirenna @nixCraft Yes but we need something for website protection. Following you to get future info about your tool!

Jamie Knight

@nixCraft if these parasites had any shame they wouldn't be stealing others' work for their AI grift in the first place.

Jeff Rivett

@nixCraft Same on several of the sites I manage. Blocking everywhere.

Ted Johnson

@nixCraft I know some Claude lovers who are going to be very conflicted about this.

Ted Johnson

@f4grx @nixCraft

#NotAllAI

But definitely AI that requires the entire WWW to be competitive.

Fubaroque

@nixCraft Interesting. I just decided that enough was enough and am since yesterday redirecting (301 permanent) ClaudeBot to large files filled with random bytes elsewhere on the web. 🤣

Didn’t know it was scraping for AI. But I’m sure the “info” they get out of that will be useful to them. 🤭

Kierkegaanks, π/🦴

@nixCraft maybe we should start prefacing ai companies as parasitic when talking about them on reddit et al? Eg ‘parasitic ai company anthropic bla bla bla’

F4GRX Sébastien

@pixelriot @iamdtms @nixCraft @chriscoyier @beep If I were an AI company I would never use any user agent in this list.

morgan

@iamdtms @nixCraft @chriscoyier @beep please don't block CCBot though, it's extremely well behaved cc @pjox

Pedro Ortiz Suarez

@fay @iamdtms @nixCraft @chriscoyier @beep We Crawl very slowly and very politely, always respecting robots.txt. We have been doing so for years, way before LLMs. Yes some companies have used our crawls for AI training, but we’re mainly a research crawl, our goal is to provide resources to researchers, archive and actually increase visibility of underrepresented parts of the web.

Pedro Ortiz Suarez

@fay @iamdtms @nixCraft @chriscoyier @beep There are also people who are starting to use our crawls in order to build indexes and alternative open web search engines, which I love, I don’t believe a handful of companies should be deciding the content that people consume on the web.

Tamas

@pjox @fay @nixCraft @chriscoyier @beep Thank you for letting me know. I'll act like this.

DamonHD

@nixCraft had to mail the press@ address to stop them re-checking every few seconds if I'd changed my mind about blocking them for effectively DoSing my site. Tech bros in a hurry...

DamonHD

@nixCraft my email subject was "cease and desist" which may be the only language that they understand!

Gonçalo Ribeiro

@nixCraft For anyone's reference, I've checked my logs and see user agents "ClaudeBot" and "claudebot" from below IPs, since December 2023 (block at your own risk, don't know what else they may be used for).

PS: they're all AWS IPs.

3.84.110.120
13.59.136.170
18.210.24.192
54.198.157.15
54.81.157.133

Freelock

@nixCraft This explains a lot -- several sites we manage have been hit by extremely unfriendly ClaudeBot scrapes in the past week. We've been putting in rate limiters that help -- but I like the idea of poisoning 😀

YurkshireLad

@nixCraft it’s a shame you can’t return a random block of text instead of the page content.

Hunterrules

@nixCraft cant wait for there to be a new spinoff of hoarders called "digital horders" lmao. this is the internet version of living off your mom at 30

Raptor :gamedev:

@nixCraft I'm not so sure I'd call it bad behavior, as far as I can tell their bot respects robots.txt and looking at the one for the LM forums they don't have any crawl delay set or any restrictions on indexing so any new large crawler that finds the site for the first time would likely have the same effect, as anything indexing the site that hasn't before will just continue branching without any delay, google will nuke you like this too if you put a big dataset up and they index it w/o delay

Michael Boelen

@nixCraft
Yes, saw them as well on my end. Tried contacting them and not much of a response yet. We should feed them digital rubbish, so their products will output 🤡💩
@securingdev

thinker

@nixCraft Well done to the @linuxmint team for finding and blocking this Ai scraper 👍

sasha

@nixCraft Dang, I actually somewhat enjoyed using Claude, for all its limitations.

Go Up