Email or username:

Password:

Forgot your password?
Leah (Cloudstylistin)

Reading all these articles about the high traffic from AI bots, here is my experience from a small German hoster: It's a huge problem, and it is getting worse very fast. There are bots like claudebot from Antrophic which are so aggressive that they block whole systems, outranking actual attacks for example against WordPress. And they change their AWS cloud IPs so often that blocking doesn't help for long and they also ignore robots.txt. Our on-call is regularly alerted for such issues.

90 comments
Leah (Cloudstylistin)

I would go so far to say that for small sites 30-50% of traffic is now artificial bot traffic. 20-75% for bigger sites with much content. It's a huge waste of resources they outsource to all of us in terms of money and environmental damage.
And this is part of the much bigger footprint all this AI shit has beside the irresponsible big amounts of resources they waste officially.

morihofi

@leah For me, an individual who runs a small private website, this is true. 50% - 60% of the traffic on my website comes from bots in an average month.

tsia

@morihofi @leah I would like to compare my request logs. How are you measuring this? Some sort of user agent list?

Darrin West

@morihofi @leah @tsia_ We might crowd source detection by sharing (trimmed) logs somewhere. Seeing IPs and ids repeated across zillions of websites is a pretty good fingerprint of a bad actor. If all that were automated, the results could be auto added to block lists.

Leah (Cloudstylistin)

@obviousdwest @morihofi @tsia_ there are a lot of providers offering such services and I would prefer to fix the source of the peoblem and not the symptoms.

morihofi

@leah @obviousdwest @tsia_ I just discovered in my Matomo analytics that some traffic came from clients with a referer to xtraffic.plus. Some sort of traffic generator for SEO optimisation (How even should that work)

tsia

@morihofi @leah @obviousdwest (knowing some of those SEO people I wouldn’t be surprised if it didn’t work at all)

Werawelt

@morihofi @leah

Maybe: couldn't you reduce the number of hits per minute and block the IP for a certain period of time? In WP, for example, you can do this with Wordfence

Leah (Cloudstylistin)

@werawelt @morihofi thats only fixing the symptoms not the problem.

maybit

@leah @werawelt @morihofi
agree also, but fixing the problem, while it is something I'd like to see happen or even contribute to, is not something actionable which would relieve the anxiety* I feel reading about this

* light anxiety level, no need for a CW, I'm just expressing my feelings

mirabilos

@werawelt @morihofi @leah nah, typical visitors do bursts (page, CSS, fonts, images, js), so you’d hurt them too much

.oO(maybe one could just block all “cloud” services, since they don’t list their customers’ actual ranges in WHOIS and change them too often anyway… if enough people did this, maybe we could get good IP subnet to customer mapping in WHOIS…)

Earth Notes

@morihofi @leah >50% of my traffic has been non-human for a long time, though that has included search engine spiders etc. I have a target of keeping them to < 50%. See top line in table.

earth.org.uk/note-on-site-tech

The AI bots are a huge slice added to that.

Leah (Cloudstylistin)

In my opinion we should strongly regulate the AI stuff and just forbid most of it, like it's forbidden to run a very dirty coal plant too. For everything else we do it a little like in the pharmaceutical sector. You have to prove that your shiny new product is better/a _real_ benefit compared to the old one. I would bet that only very few very specialized use cases would be the result. But this would require an evaluation that is not based on capitalist logic and this just won't happen.

Nicole Parsons

@leah

Hybrid warfare & cyberwarfare has a new ally in the AI hype.

It's unsurprising that LLM & AI development is being funded by investors from hostile state actors like Saudi Arabia's Mohammed bin Salman, Iran, Russia, & China.

The scams, election interference, & climate denial from AI is huge.
reuters.com/technology/artific

It was never intended to be a legitimate business product, except for deluded CEO's who were promised mass layoffs & wage suppression schemes.
futurism.com/investors-concern

@leah

Hybrid warfare & cyberwarfare has a new ally in the AI hype.

It's unsurprising that LLM & AI development is being funded by investors from hostile state actors like Saudi Arabia's Mohammed bin Salman, Iran, Russia, & China.

The scams, election interference, & climate denial from AI is huge.
reuters.com/technology/artific

Nicolai Hähnle

@leah It ought to be possible to sue people who disregard robots.txt under some kind of computer-related crime law.

Mathaetaes

@leah I wonder if there wouldn’t be a market for a DDoS protection-like service that detects heavy traffic from a single IP and starts returning garbage data.

Ignore robots.txt with your ML scraper, get poisoned data in your ML scraper.

Seems like a fair outcome to me.

T1gerlilly

@leah I work for a large software firm with huge amounts of online content - which the AI firms have publicly said they trained on. We've since indicated through our various mechanisms they are not allowed to scrape content... But they very clearly are. We just moved to g4 analytics and started removing bot traffic from our metrics. It was nearly 70% of our traffic, so your numbers are dead on.

T1gerlilly

@leah And here's the thing, they're scraping and surfacing content that's our IP. They're literally stealing from us and redirecting traffic from our business.This is definitely criminal conduct, even if we don't have laws that mark it as such.

bookandswordblog

@T1gerlilly @leah I had a similar experience with my Wordpress site hosted in Canada and my static site hosted in Austria

OddOpinions5

@leah
probably stupid question from a non programmer
can't you crowd source a reverse attack ?

Rachel Rawlings

@leah At minimum, it should follow the European model that says do what thou wilt until there's significant evidence of harm. We have all that and more with AI (also fossil fuels).

panther

@leah agreed. It should be regulated with rules and quality check as for fake news, images and stuff

Florian Lohoff

@leah i deployed config snippets available for inclusion blocking OpenAI, Claude and some others by useragent as they really misbehave. Pasting random stuff into search boxes etc.

Just a single line in the vhost and the get http code for Removed.

rappet

@leah what is traffic in this case? Does the bot load the whole page with pictures and everything?

Leah (Cloudstylistin)

@rappet that's different depending on the case and bot. See traffic as number of requests if it helps.

Cassie

@leah Is this why every site all of a sudden has an old-school captcha again?

Martinus Hoevenaar

@leah I have added several lines of code to my .htaccess-file and also adjusted the firewall on the server where my website is hosted. That's it.
Robots.txt is first of all not obliged, but an agreement with search-engine companies. So, if they ignore it, they are within the boundaries of law, eventhough they're assholes.
I think it is better to do as I did and if you have the possibility to go even further, like blocking via the OS of the server, you're good for now.

Martinus Hoevenaar

@leah
I think webdesigners, coders and hosting companies should work together to do something about this pest. It completely ruins the internet, uses tonnes of resources and so damages the climate even more than we're already doing.
We should take back the internet, in a fashion like the fediverse is.
Most mainstream social media and AI-scrapers, if not all, are a virus. Not just for the infrastructure and content of the internet, but also for society as a whole.

DJM (freelance for hire)

@martinus @leah Did the same:
- wall 0: ai.txt
- 1st wall: robots.txt
- 2nd wall: .htaccess
- 3rd wall: IP

All info in French here: didiermary.fr/bloquer-ai-bots-

Earth Notes

@martinus @leah I disagree. If you tell a remote entity, eg by registered mail to the CEO and board or similar, that they are forbidden from accessing your server at all for any reason ("withdrawing implied rights of access") then in the UK at least all continued access is unauthorised and illegal.

It is an approach that I have used a few times to keep out persistently badly behaving identifiable bots and spiders.

Martinus Hoevenaar

@EarthOrgUK @leah I do not approach any CEO, I just use the method described and guess what? My webserver statistics show me that scraping is, more or less, done. There is still some scraping, but those are either bots from individuals that are not mentioned in the .htaccess file or in the firewall, or brand new bots that I wasn't aware of.
It's a continue job, which is, now it works, fun to do.

Zippy Wonderdust

@leah I see a future where many sites go password-protected with a simple signup process that includes agreeing that you are not a bot.

Sebastian Lauwers

@ZippyWonderdust @leah If they choose to ignore robots.txt, what makes you think they’ll care any more about a pinkie promise about not being a bot?

Zippy Wonderdust

@teotwaki @leah It’s not the pinkie promise; it’s the captcha before it.

Sebastian Lauwers

@ZippyWonderdust @leah Bots have a higher accuracy solving captchas than humans.

Captchas are effectively a great way to ensure the only people you’re locking out are _actual people_.

arxiv.org/pdf/2307.12108

Zippy Wonderdust

@leah @teotwaki Which, on second thought, leads me to revise my prediction to say that I see a dramatic increase in the deployment of non-signup captchas.

Tofu Golem

@leah
We are burning our children's future by pumping lots of CO₂ into the atmosphere just so that a small number of people can manipulate the attitudes of the masses.

The 21st century sucks.

Vanni Di Ponzano 🚴 ❤️ ☕ 🍎

@leah

#enshitification on the rise! it's an epidemic!

We Need to Call the Bot Busters

Chris Rosenau

@leah I’m gonna make all my websites blocked until he user actively clicks into the site

Jaime :verified: 🇪🇨

@leah yes Bytedance (TikTok) is pounding really hard on my clients sites. Right now I'm blocking by user-agent on Cloudflare WAF.

Bastian Greshake Tzovaras

@leah yeah, for the small projects I host and do ops for I see an increasing “ouroboros”-effect: “AI” crawlers hammer us on the one hand for crawling and then we get SEO spammers post “AI” generated spam at the same time. 😞

SaThaRiel :linux: :popos:

@leah I know - as a hoster, it doesn't help you much, but maybe you can recommend perishablepress.com/8g-firewal to your customers. It works great for me :)

Zonder Zon

@leah Would it be possible to deliver very bad content (that still makes sense), instead of the actual content of your website, when the IP belongs to a bot? The training set will be dirty, the resulting model will be catastrophic, and you might be blacklisted by AI-tech companies.

(it's not a true solution – but long term, actually it is. The training set is only as good as the trust you can put in it, and if they can't trust the content, what can they do with it?)

Zippy Wonderdust

@leah @ohne_sonne I very much like the idea of somehow Rickrolling all of the AI crawlers!

EVHaste

@leah At least half of my traffic is from bots. I have to assume for scraping purposes. It’s frustrating that there isn’t anything I can really do about it.

Daniel Marks

@leah I think the only solution is poisoning the AIs training data. You have to present decoy material to up the risk that they will train their AI with bad information. It makes it much more difficult for them to maintain the integrity of their service if they have to be constantly retraining bad material out of their AIs.

𝑪𝒐𝒓𝒆𝒚 𝑺𝒏𝒊𝒑𝒆𝒔 🍂

@profdc9 @leah I've been thinking about ways to automate this with some simple server-side scripting and a wildcard domain (so it looks like a bunch of unique sites). Nothing concrete yet but it feels like a project that would be very satisfying, at least until it maxes out my allowed bandwidth.

Daniel Marks

@coreysnipes @leah A good way to generate fake data that is difficult to discern as fake is to use a Markov random word generator. It will produce sentences that seem real but are nonsense.

github.com/jsvine/markovify

Daniel Marks

@coreysnipes @leah I think the beauty of using something like a Markov generator is that it will be very difficult for a DNN to generalize the Markov chain, thus taking capacity away from useful data when being incorporated into the DNN. It's like memorizing the phone book but not realizing it's just random numbers.

Alexander S. Kunz

@leah I’ve put my site behind a free Cloudflare account with their web application firewall, and began to “challenge” all traffic from major hosting companies (AWS, Hetzner, Google, Microsoft, and more over time as I kept monitoring my logs). Took some fine tuning to whitelist legit services of course. Bandwidth usage has dropped by two thirds now…

Crawlers such as “DataForSEOBot” or “FriendlyCrawler” and others fly pretty much under everyone’s radar still. It’s a huge mess.

mirabilos

@alexskunz @leah some ignore Crawl-Delay in robots.txt, too… grr… I 429 those with .htaccess.

Because, fuck Cloudflare.

Alexander S. Kunz

@mirabilos I know that CF has a mixed reputation. From a small website owner's perspective, it's a great help to curb this madness.

(there's a lot more going on than just ignoring crawl delays. 3rd parties that identify themselves as GoogleBot etc. for example are super easy to block with CF — hard to do that with .htaccess)

mirabilos

@alexskunz as a lynx user I have extra reasons to hate them, ofc

iolaire

@alexskunz @leah this sounds like good advice and also helps to explain why I see that challenge a lot more these days (Comcast and cell ips)

Alexander S. Kunz

@iolaire yes, it's a bit unfortunate. I try to not "challenge" dial-up/residential/cell but some combinations (outdated browser etc.) might trigger a challenge.

Cloudflare provides a "challenge solve rate" (CSR) and it's a little bit high today for me — 1.2% of presently 751 challenged requests were solved. Eight slightly inconvenienced visitors is better than getting scraped to death, with resource warnings from my hosting company, etc.

mirabilos

@leah yeah, Claudebot is the worst, wish they’d be sued into oblivion. The others are… manageable, mostly.

madopal

@leah I actually tracked down some of the Anthropic folks on LinkedIn to bitch them out. They hammered LTHForum here in Chicago, and of all the AI bots, they're the worst. They come from 1000 different IPs, and when you block them, they keep hammering trying.

DELETED

@leah I don’t know what any of that means, but it doesn’t sound good.

Joby (chaotic good)

@leah Yup. I run some small sites within a university and they're similarly a shockingly large problem. On one site I've resorted to blocking all of AWS Singapore, because for whatever inscrutable reason a ByteDance bot is hitting every single page every 2-5 minutes, 24/7, every request apparently in a full headless browser (so loading all media and such), and every request from a different IP. This is, mind you, a website that sees maybe a dozen total content updates per year -- in a busy year.

steve mookie kong

@leah

I don't bother with robots.txt nowadays. I run nginx and I just have fun with the bots:

  if ($http_user_agent ~ (GPTBot|Google-Extended|CCBot|FacebookBot|cohere-ai|PerplexityBot|anthropic-ai|ClaudeBot|) ) {
	return 403 "Get off my lawn";
  }
MekkerMuis 👨‍🚀 ⛩️ 🏳️‍🌈 🪂

@leah if I knew how to do it, I'd reply to each ai-bot with a #Zipbomb :awesome:

Steffen Voß

@leah Do you know, why they hit so often? Shouldn't they just scrape once everything?

Philip Gillißen

@kaffeeringe @leah my assumption: bugs and trying to get updated content, too.

Johannes Timaeus

@leah Thank you for pointing this out. To increase its relevance we would need some actual data to support your statements. So I guess that’s a case for investigative journalism and science. Anyone here who can do the job? 😎

Laura Orchid

@leah for me I often see IPs sending garbled data. Luckily fail2ban catches them :)

raboof

@leah ouch :(

The following Apache .htaccess fragment seems to have worked well for a phpBB forum I administer:

==
BrowserMatchNoCase "claudebot" bad_bot
Order Deny,Allow
Deny from env=bad_bot
==

(of course it is even better when they don't make it to Apache at all, but I got the impression that they're detecting they're not getting useful responses anymore and then stop trying)

kevin ✨

@leah it’s a real pain, if they’d at least use consistent user agents (or user-agents at all) so they could be blocked on a webserver level. :(

I hope legislators will do something about it.

Attie Grande

@leah I've been supporting a small company with their website issues - their dev has been complaining about the database server for a long time... turns out ~80% of their traffic was Claude, and the remaining ~20% was Facebook's bot. After blocking these two based on their user agent, the CPU usage on the system has gone from "pinned at 100%" to basically idle, and response times have fallen back into the acceptable range (previously there were regular spikes to many minutes)

darix

@leah Not to mention the attitude of "we ignore robots.txt, sue us if you want" and "Without using copyrighted content for free our business model doesnt work" parasites at its finest.

Go Up