Bad and freeloader behaviour
AnthropicAI. https://www.reddit.com/r/linux/comments/1ceco4f/claude_ai_name_and_shame/ #linux #linuxmint #opensource Please boost to shame them.
Bad and freeloader behaviour 39 comments
@nixCraft we blocked them at #nikonians some time ago since they are scraping like mad. @f4grx Just checked the logs on my web server. I see they've used "ClaudeBot" and "claudebot" in my case. @nixCraft Google did this back to one forum I was a part of years ago - and their stubborn refusal to obey robots.txt's crawl-delay directive still angers me to this day. It's not just new crawlers that do this. Even the big old search engines have done it for ages. @nixCraft I wrote my master's thesis on data poisoning LLMs, especially as a self-defense mechanism against bot scraping. The poisoning tool is pretty basic and shitty for now, I am aiming to release a better version in the summer @nixCraft University of Chicago has released an amazing data poisoning for images called Nightshade. Have a look, it's fantastic. @luigirenna @nixCraft Yes but we need something for website protection. Following you to get future info about your tool! @nixCraft if these parasites had any shame they wouldn't be stealing others' work for their AI grift in the first place. @nixCraft Interesting. I just decided that enough was enough and am since yesterday redirecting (301 permanent) ClaudeBot to large files filled with random bytes elsewhere on the web. 🤣 Didn’t know it was scraping for AI. But I’m sure the “info” they get out of that will be useful to them. 🤭 @nixCraft maybe we should start prefacing ai companies as parasitic when talking about them on reddit et al? Eg ‘parasitic ai company anthropic bla bla bla’ @nixCraft Blocking AI Scraper Bots Blocking bots @iamdtms @nixCraft @chriscoyier @beep Here's a maintained and updated robots.txt from the author of that blog: https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt @pixelriot @iamdtms @nixCraft @chriscoyier @beep If I were an AI company I would never use any user agent in this list. @fay @iamdtms @nixCraft @chriscoyier @beep We Crawl very slowly and very politely, always respecting robots.txt. We have been doing so for years, way before LLMs. Yes some companies have used our crawls for AI training, but we’re mainly a research crawl, our goal is to provide resources to researchers, archive and actually increase visibility of underrepresented parts of the web. @fay @iamdtms @nixCraft @chriscoyier @beep There are also people who are starting to use our crawls in order to build indexes and alternative open web search engines, which I love, I don’t believe a handful of companies should be deciding the content that people consume on the web. @nixCraft For anyone's reference, I've checked my logs and see user agents "ClaudeBot" and "claudebot" from below IPs, since December 2023 (block at your own risk, don't know what else they may be used for). PS: they're all AWS IPs. 3.84.110.120 @nixCraft it’s a shame you can’t return a random block of text instead of the page content. @nixCraft cant wait for there to be a new spinoff of hoarders called "digital horders" lmao. this is the internet version of living off your mom at 30 @nixCraft I'm not so sure I'd call it bad behavior, as far as I can tell their bot respects robots.txt and looking at the one for the LM forums they don't have any crawl delay set or any restrictions on indexing so any new large crawler that finds the site for the first time would likely have the same effect, as anything indexing the site that hasn't before will just continue branching without any delay, google will nuke you like this too if you put a big dataset up and they index it w/o delay @nixCraft |
@nixCraft damn, that blows. I really like using their bot over ChatGPT too.