Bad and freeloader behaviour
AnthropicAI. https://www.reddit.com/r/linux/comments/1ceco4f/claude_ai_name_and_shame/ #linux #linuxmint #opensource Please boost to shame them.
nixCraft 🐧
Bad and freeloader behaviour 39 comments
Bo Stahlbrandt
@nixCraft we blocked them at #nikonians some time ago since they are scraping like mad.
Christian Quest 🌍
F4GRX Sébastien
Gonçalo Ribeiro
@f4grx Just checked the logs on my web server. I see they've used "ClaudeBot" and "claudebot" in my case.
katana crimson
@nixCraft Google did this back to one forum I was a part of years ago - and their stubborn refusal to obey robots.txt's crawl-delay directive still angers me to this day. It's not just new crawlers that do this. Even the big old search engines have done it for ages.
Luigi :donor:
@nixCraft I wrote my master's thesis on data poisoning LLMs, especially as a self-defense mechanism against bot scraping. The poisoning tool is pretty basic and shitty for now, I am aiming to release a better version in the summer
Luigi :donor:
@nixCraft University of Chicago has released an amazing data poisoning for images called Nightshade. Have a look, it's fantastic.
F4GRX Sébastien
@luigirenna @nixCraft Yes but we need something for website protection. Following you to get future info about your tool!
Jamie Knight
@nixCraft if these parasites had any shame they wouldn't be stealing others' work for their AI grift in the first place.
Fubaroque
@nixCraft Interesting. I just decided that enough was enough and am since yesterday redirecting (301 permanent) ClaudeBot to large files filled with random bytes elsewhere on the web. 🤣 Didn’t know it was scraping for AI. But I’m sure the “info” they get out of that will be useful to them. 🤭
Kierkegaanks, π/🦴
@nixCraft maybe we should start prefacing ai companies as parasitic when talking about them on reddit et al? Eg ‘parasitic ai company anthropic bla bla bla’
Tamas
@nixCraft Blocking AI Scraper Bots Blocking bots
Matt W
@iamdtms @nixCraft @chriscoyier @beep Here's a maintained and updated robots.txt from the author of that blog: https://github.com/ai-robots-txt/ai.robots.txt/blob/main/robots.txt
F4GRX Sébastien
@pixelriot @iamdtms @nixCraft @chriscoyier @beep If I were an AI company I would never use any user agent in this list.
Pedro Ortiz Suarez
@fay @iamdtms @nixCraft @chriscoyier @beep We Crawl very slowly and very politely, always respecting robots.txt. We have been doing so for years, way before LLMs. Yes some companies have used our crawls for AI training, but we’re mainly a research crawl, our goal is to provide resources to researchers, archive and actually increase visibility of underrepresented parts of the web.
Pedro Ortiz Suarez
@fay @iamdtms @nixCraft @chriscoyier @beep There are also people who are starting to use our crawls in order to build indexes and alternative open web search engines, which I love, I don’t believe a handful of companies should be deciding the content that people consume on the web.
Gonçalo Ribeiro
@nixCraft For anyone's reference, I've checked my logs and see user agents "ClaudeBot" and "claudebot" from below IPs, since December 2023 (block at your own risk, don't know what else they may be used for). PS: they're all AWS IPs. 3.84.110.120
YurkshireLad
@nixCraft it’s a shame you can’t return a random block of text instead of the page content.
Hunterrules
@nixCraft cant wait for there to be a new spinoff of hoarders called "digital horders" lmao. this is the internet version of living off your mom at 30
Raptor :gamedev:
@nixCraft I'm not so sure I'd call it bad behavior, as far as I can tell their bot respects robots.txt and looking at the one for the LM forums they don't have any crawl delay set or any restrictions on indexing so any new large crawler that finds the site for the first time would likely have the same effect, as anything indexing the site that hasn't before will just continue branching without any delay, google will nuke you like this too if you put a big dataset up and they index it w/o delay
Michael Boelen
@nixCraft |
@nixCraft damn, that blows. I really like using their bot over ChatGPT too.