Reading all these articles about the high traffic from...

Reading all these articles about the high traffic from AI bots, here is my experience from a small German hoster: It's a huge problem, and it is getting worse very fast. There are bots like claudebot from Antrophic which are so aggressive that they block whole systems, outranking actual attacks for example against WordPress. And they change their AWS cloud IPs so often that blocking doesn't help for long and they also ignore robots.txt. Our on-call is regularly alerted for such issues.

Like 29 July at 20:30 | Open on chaos.social

90 comments

Leah (Cloudstylistin)

I would go so far to say that for small sites 30-50% of traffic is now artificial bot traffic. 20-75% for bigger sites with much content. It's a huge waste of resources they outsource to all of us in terms of money and environmental damage.
And this is part of the much bigger footprint all this AI shit has beside the irresponsible big amounts of resources they waste officially.

29 July at 20:34 | Open on chaos.social

morihofi

@leah For me, an individual who runs a small private website, this is true. 50% - 60% of the traffic on my website comes from bots in an average month.

29 July at 20:40 | Open on furry.engineer

tsia

@morihofi @leah I would like to compare my request logs. How are you measuring this? Some sort of user agent list?

29 July at 20:45 | Open on chaos.social

Darrin West

@morihofi @leah @tsia_ We might crowd source detection by sharing (trimmed) logs somewhere. Seeing IPs and ids repeated across zillions of websites is a pretty good fingerprint of a bad actor. If all that were automated, the results could be auto added to block lists.

29 July at 21:08 | Open on hachyderm.io

Leah (Cloudstylistin)

@obviousdwest @morihofi @tsia_ there are a lot of providers offering such services and I would prefer to fix the source of the peoblem and not the symptoms.

29 July at 21:13 | Open on chaos.social

morihofi

@leah @obviousdwest @tsia_ I just discovered in my Matomo analytics that some traffic came from clients with a referer to https://www.xtraffic.plus. Some sort of traffic generator for SEO optimisation (How even should that work)

29 July at 21:20 | Open on furry.engineer

tsia

@morihofi @leah @obviousdwest (knowing some of those SEO people I wouldn’t be surprised if it didn’t work at all)

29 July at 21:23 | Open on chaos.social

mirabilos

@morihofi @leah @obviousdwest @tsia_ referrer spam is not new tho

29 July at 23:53 | Open on toot.mirbsd.org

Werawelt

@morihofi @leah

Maybe: couldn't you reduce the number of hits per minute and block the IP for a certain period of time? In WP, for example, you can do this with Wordfence

29 July at 20:54 | Open on social.tchncs.de

Leah (Cloudstylistin)

@werawelt @morihofi thats only fixing the symptoms not the problem.

29 July at 20:56 | Open on chaos.social

Werawelt

@leah
Yes, that's right!
@morihofi

29 July at 20:56 | Open on social.tchncs.de

maybit

@leah @werawelt @morihofi
agree also, but fixing the problem, while it is something I'd like to see happen or even contribute to, is not something actionable which would relieve the anxiety* I feel reading about this

* light anxiety level, no need for a CW, I'm just expressing my feelings

29 July at 21:34 | Open on indieweb.social

mirabilos

@werawelt @morihofi @leah nah, typical visitors do bursts (page, CSS, fonts, images, js), so you’d hurt them too much

.oO(maybe one could just block all “cloud” services, since they don’t list their customers’ actual ranges in WHOIS and change them too often anyway… if enough people did this, maybe we could get good IP subnet to customer mapping in WHOIS…)

29 July at 23:54 | Open on toot.mirbsd.org

Earth Notes

@morihofi @leah >50% of my traffic has been non-human for a long time, though that has included search engine spiders etc. I have a target of keeping them to < 50%. See top line in table.

https://www.earth.org.uk/note-on-site-technicals.html#Stats

The AI bots are a huge slice added to that.

30 July at 6:57 | Open on mastodon.energy

Leah (Cloudstylistin)

In my opinion we should strongly regulate the AI stuff and just forbid most of it, like it's forbidden to run a very dirty coal plant too. For everything else we do it a little like in the pharmaceutical sector. You have to prove that your shiny new product is better/a _real_ benefit compared to the old one. I would bet that only very few very specialized use cases would be the result. But this would require an evaluation that is not based on capitalist logic and this just won't happen.

29 July at 20:42 | Open on chaos.social

Bill Hooker

@leah I like your thinking on this.

29 July at 20:43 | Open on chaos.social

Nicole Parsons

@leah

Hybrid warfare & cyberwarfare has a new ally in the AI hype.

It's unsurprising that LLM & AI development is being funded by investors from hostile state actors like Saudi Arabia's Mohammed bin Salman, Iran, Russia, & China.

The scams, election interference, & climate denial from AI is huge.
https://www.reuters.com/technology/artificial-intelligence/microsofts-costs-focus-fears-rise-over-slow-payoff-ai-2024-07-29/

It was never intended to be a legitimate business product, except for deluded CEO's who were promised mass layoffs & wage suppression schemes.
https://futurism.com/investors-concerned-ai-making-money

@leah

Hybrid warfare & cyberwarfare has a new ally in the AI hype.

It's unsurprising that LLM & AI development is being funded by investors from hostile state actors like Saudi Arabia's Mohammed bin Salman, Iran, Russia, & China.

The scams, election interference, & climate denial from AI is huge.
https://www.reuters.com/technology/artificial-intelligence/microsofts-costs-focus-fears-rise-over-slow-payoff-ai-2024-07-29/

Expand text...

29 July at 21:39 | Open on mstdn.social

Nicolai Hähnle

@leah It ought to be possible to sue people who disregard robots.txt under some kind of computer-related crime law.

29 July at 22:31 | Open on mastodon.gamedev.place

Mathaetaes

@leah I wonder if there wouldn’t be a market for a DDoS protection-like service that detects heavy traffic from a single IP and starts returning garbage data.

Ignore robots.txt with your ML scraper, get poisoned data in your ML scraper.

Seems like a fair outcome to me.

29 July at 23:26 | Open on infosec.exchange

T1gerlilly

@leah I work for a large software firm with huge amounts of online content - which the AI firms have publicly said they trained on. We've since indicated through our various mechanisms they are not allowed to scrape content... But they very clearly are. We just moved to g4 analytics and started removing bot traffic from our metrics. It was nearly 70% of our traffic, so your numbers are dead on.

29 July at 23:36 | Open on mastodon.social

T1gerlilly

@leah And here's the thing, they're scraping and surfacing content that's our IP. They're literally stealing from us and redirecting traffic from our business.This is definitely criminal conduct, even if we don't have laws that mark it as such.

29 July at 23:37 | Open on mastodon.social

bookandswordblog

chatbots hammering sites

@T1gerlilly @leah I had a similar experience with my Wordpress site hosted in Canada and my static site hosted in Austria

30 July at 5:28 | Open on scholar.social

OddOpinions5

@leah
probably stupid question from a non programmer
can't you crowd source a reverse attack ?

30 July at 0:52 | Open on mas.to

Rachel Rawlings

@leah At minimum, it should follow the European model that says do what thou wilt until there's significant evidence of harm. We have all that and more with AI (also fossil fuels).

30 July at 6:06 | Open on mastodon.social

panther

@leah agreed. It should be regulated with rules and quality check as for fake news, images and stuff

30 July at 6:10 | Open on mastodon.social

Florian Lohoff

@leah i deployed config snippets available for inclusion blocking OpenAI, Claude and some others by useragent as they really misbehave. Pasting random stuff into search boxes etc.

Just a single line in the vhost and the get http code for Removed.

30 July at 6:35 | Open on c.im

Show 5 more replies

rappet

@leah what is traffic in this case? Does the bot load the whole page with pictures and everything?

29 July at 21:07 | Open on chaos.social

Leah (Cloudstylistin)

@rappet that's different depending on the case and bot. See traffic as number of requests if it helps.

29 July at 21:12 | Open on chaos.social

Cassie

@leah Is this why every site all of a sudden has an old-school captcha again?

29 July at 21:19 | Open on bunni.social

Martinus Hoevenaar

@leah I have added several lines of code to my .htaccess-file and also adjusted the firewall on the server where my website is hosted. That's it.
Robots.txt is first of all not obliged, but an agreement with search-engine companies. So, if they ignore it, they are within the boundaries of law, eventhough they're assholes.
I think it is better to do as I did and if you have the possibility to go even further, like blocking via the OS of the server, you're good for now.

29 July at 21:36 | Open on mastodon.art

Martinus Hoevenaar

@leah
I think webdesigners, coders and hosting companies should work together to do something about this pest. It completely ruins the internet, uses tonnes of resources and so damages the climate even more than we're already doing.
We should take back the internet, in a fashion like the fediverse is.
Most mainstream social media and AI-scrapers, if not all, are a virus. Not just for the infrastructure and content of the internet, but also for society as a whole.

29 July at 21:40 | Open on mastodon.art

DJM (freelance for hire)

@martinus @leah Did the same:
- wall 0: ai.txt
- 1st wall: robots.txt
- 2nd wall: .htaccess
- 3rd wall: IP

All info in French here: https://www.didiermary.fr/bloquer-ai-bots-chatgpt-openai/

30 July at 1:19 | Open on masto.ai

Show 1 reply

Earth Notes

@martinus @leah I disagree. If you tell a remote entity, eg by registered mail to the CEO and board or similar, that they are forbidden from accessing your server at all for any reason ("withdrawing implied rights of access") then in the UK at least all continued access is unauthorised and illegal.

It is an approach that I have used a few times to keep out persistently badly behaving identifiable bots and spiders.

30 July at 7:01 | Open on mastodon.energy

Martinus Hoevenaar

@EarthOrgUK @leah I do not approach any CEO, I just use the method described and guess what? My webserver statistics show me that scraping is, more or less, done. There is still some scraping, but those are either bots from individuals that are not mentioned in the .htaccess file or in the firewall, or brand new bots that I wasn't aware of.
It's a continue job, which is, now it works, fun to do.

30 July at 8:29 | Open on mastodon.art

Show 2 replies

Zippy Wonderdust

@leah I see a future where many sites go password-protected with a simple signup process that includes agreeing that you are not a bot.

29 July at 21:43 | Open on mastodon.social

Sebastian Lauwers

@ZippyWonderdust @leah If they choose to ignore robots.txt, what makes you think they’ll care any more about a pinkie promise about not being a bot?

29 July at 22:58 | Open on mastodon.online

Zippy Wonderdust

@teotwaki @leah It’s not the pinkie promise; it’s the captcha before it.

29 July at 23:10 | Open on mastodon.social

Sebastian Lauwers

@ZippyWonderdust @leah Bots have a higher accuracy solving captchas than humans.

Captchas are effectively a great way to ensure the only people you’re locking out are _actual people_.

http://arxiv.org/pdf/2307.12108

30 July at 6:18 | Open on mastodon.online

Zippy Wonderdust

@leah @teotwaki Which, on second thought, leads me to revise my prediction to say that I see a dramatic increase in the deployment of non-signup captchas.

29 July at 23:18 | Open on mastodon.social

Tofu Golem

@leah
We are burning our children's future by pumping lots of CO₂ into the atmosphere just so that a small number of people can manipulate the attitudes of the masses.

The 21st century sucks.

29 July at 21:47 | Open on mastodon.social

Vanni Di Ponzano 🚴 ❤️ ☕ 🍎

@leah

#enshitification on the rise! it's an epidemic!

We Need to Call the Bot Busters

29 July at 21:49 | Open on mastodon.social

Chris Rosenau

@leah I’m gonna make all my websites blocked until he user actively clicks into the site

29 July at 22:02 | Open on mastodon.social

Jaime :verified: 🇪🇨

@leah yes Bytedance (TikTok) is pounding really hard on my clients sites. Right now I'm blocking by user-agent on Cloudflare WAF.

29 July at 22:09 | Open on infosec.exchange

Bastian Greshake Tzovaras

@leah yeah, for the small projects I host and do ops for I see an increasing “ouroboros”-effect: “AI” crawlers hammer us on the one hand for crawling and then we get SEO spammers post “AI” generated spam at the same time. 😞

29 July at 22:16 | Open on scholar.social

SaThaRiel :linux: :popos:

@leah I know - as a hoster, it doesn't help you much, but maybe you can recommend https://perishablepress.com/8g-firewall/ to your customers. It works great for me :)

29 July at 22:29 | Open on social.linux.pizza

Zonder Zon

@leah Would it be possible to deliver very bad content (that still makes sense), instead of the actual content of your website, when the IP belongs to a bot? The training set will be dirty, the resulting model will be catastrophic, and you might be blacklisted by AI-tech companies.

(it's not a true solution – but long term, actually it is. The training set is only as good as the trust you can put in it, and if they can't trust the content, what can they do with it?)

29 July at 23:08 | Open on potate.space

Zippy Wonderdust

@leah @ohne_sonne I very much like the idea of somehow Rickrolling all of the AI crawlers!

29 July at 23:30 | Open on mastodon.social

EVHaste

@leah At least half of my traffic is from bots. I have to assume for scraping purposes. It’s frustrating that there isn’t anything I can really do about it.

29 July at 23:27 | Open on mastodon.social

Daniel Marks

@leah I think the only solution is poisoning the AIs training data. You have to present decoy material to up the risk that they will train their AI with bad information. It makes it much more difficult for them to maintain the integrity of their service if they have to be constantly retraining bad material out of their AIs.

29 July at 23:38 | Open on mastodon.social

𝑪𝒐𝒓𝒆𝒚 𝑺𝒏𝒊𝒑𝒆𝒔 🍂

@profdc9 @leah I've been thinking about ways to automate this with some simple server-side scripting and a wildcard domain (so it looks like a bunch of unique sites). Nothing concrete yet but it feels like a project that would be very satisfying, at least until it maxes out my allowed bandwidth.

30 July at 1:53 | Open on fosstodon.org

Daniel Marks

@coreysnipes @leah A good way to generate fake data that is difficult to discern as fake is to use a Markov random word generator. It will produce sentences that seem real but are nonsense.

https://github.com/jsvine/markovify

30 July at 1:58 | Open on mastodon.social

𝑪𝒐𝒓𝒆𝒚 𝑺𝒏𝒊𝒑𝒆𝒔 🍂

@profdc9 @leah ooo, good tip - thanks!

30 July at 2:00 | Open on fosstodon.org

Daniel Marks

@coreysnipes @leah I think the beauty of using something like a Markov generator is that it will be very difficult for a DNN to generalize the Markov chain, thus taking capacity away from useful data when being incorporated into the DNN. It's like memorizing the phone book but not realizing it's just random numbers.

30 July at 2:06 | Open on mastodon.social

Alexander S. Kunz

@leah I’ve put my site behind a free Cloudflare account with their web application firewall, and began to “challenge” all traffic from major hosting companies (AWS, Hetzner, Google, Microsoft, and more over time as I kept monitoring my logs). Took some fine tuning to whitelist legit services of course. Bandwidth usage has dropped by two thirds now…

Crawlers such as “DataForSEOBot” or “FriendlyCrawler” and others fly pretty much under everyone’s radar still. It’s a huge mess.

29 July at 23:48 | Open on mas.to

mirabilos

@alexskunz @leah some ignore Crawl-Delay in robots.txt, too… grr… I 429 those with .htaccess.

Because, fuck Cloudflare.

29 July at 23:55 | Open on toot.mirbsd.org

Alexander S. Kunz

@mirabilos I know that CF has a mixed reputation. From a small website owner's perspective, it's a great help to curb this madness.

(there's a lot more going on than just ignoring crawl delays. 3rd parties that identify themselves as GoogleBot etc. for example are super easy to block with CF — hard to do that with .htaccess)

30 July at 0:12 | Open on mas.to

mirabilos

@alexskunz as a lynx user I have extra reasons to hate them, ofc

30 July at 0:24 | Open on toot.mirbsd.org

iolaire

@alexskunz @leah this sounds like good advice and also helps to explain why I see that challenge a lot more these days (Comcast and cell ips)

30 July at 0:33 | Open on masto.ai

Alexander S. Kunz

@iolaire yes, it's a bit unfortunate. I try to not "challenge" dial-up/residential/cell but some combinations (outdated browser etc.) might trigger a challenge.

Cloudflare provides a "challenge solve rate" (CSR) and it's a little bit high today for me — 1.2% of presently 751 challenged requests were solved. Eight slightly inconvenienced visitors is better than getting scraped to death, with resource warnings from my hosting company, etc.

30 July at 0:53 | Open on mas.to

mirabilos

@leah yeah, Claudebot is the worst, wish they’d be sued into oblivion. The others are… manageable, mostly.

29 July at 23:52 | Open on toot.mirbsd.org

madopal

@leah I actually tracked down some of the Anthropic folks on LinkedIn to bitch them out. They hammered LTHForum here in Chicago, and of all the AI bots, they're the worst. They come from 1000 different IPs, and when you block them, they keep hammering trying.

30 July at 0:42 | Open on mastodon.social

noplasticshower

@leah Data feudalism is here
https://www.lawfaremedia.org/article/why-the-data-ocean-is-being-sectioned-off

30 July at 0:49 | Open on zirk.us

DELETED

@leah I don’t know what any of that means, but it doesn’t sound good.

30 July at 0:51 | Open on mastodon.social

anon

@leah Does fail2ban help?

30 July at 2:48 | Open on pr0mised.life

Joby (chaotic good)

@leah Yup. I run some small sites within a university and they're similarly a shockingly large problem. On one site I've resorted to blocking all of AWS Singapore, because for whatever inscrutable reason a ByteDance bot is hitting every single page every 2-5 minutes, 24/7, every request apparently in a full headless browser (so loading all media and such), and every request from a different IP. This is, mind you, a website that sees maybe a dozen total content updates per year -- in a busy year.

30 July at 4:05 | Open on hachyderm.io

steve mookie kong

@leah

I don't bother with robots.txt nowadays. I run nginx and I just have fun with the bots:

  if ($http_user_agent ~ (GPTBot|Google-Extended|CCBot|FacebookBot|cohere-ai|PerplexityBot|anthropic-ai|ClaudeBot|) ) {
	return 403 "Get off my lawn";
  }

30 July at 4:42 | Open on mookiesplace.com

L-Theanine

@mookie @leah I need to do this!

30 July at 5:11 | Open on teabag.ninja

Clara Listensprechen

Ooooh, That's a KEEPER!.
LOLz

30 July at 5:32 | Open on friendica.myportal.social

JamSharp

@mookie@mookiesplace.com @leah@chaos.social I also mess with bots on my site, ai crawlers get to see a very inaccurate news site.

30 July at 7:29 | Open on fedi.jamsharp.net

MekkerMuis 👨‍🚀 ⛩️ 🏳️‍🌈 🪂

@leah if I knew how to do it, I'd reply to each ai-bot with a #Zipbomb :awesome:

30 July at 5:33 | Open on troet.cafe

Steffen Voß

@leah Do you know, why they hit so often? Shouldn't they just scrape once everything?

30 July at 5:36 | Open on social.tchncs.de

Philip Gillißen

@kaffeeringe @leah my assumption: bugs and trying to get updated content, too.

30 July at 6:01 | Open on ruhr.social

Andreas, DJ3EI, he/him

Is bandwidth cheaper than storage, for them?

@guerda @kaffeeringe @leah

30 July at 6:29 | Open on mastodon.radio

Dr. Christopher Kunz

@leah There are some numbers in Cloudflare's writeup here: https://blog.cloudflare.com/declaring-your-aindependence-block-ai-bots-scrapers-and-crawlers-with-a-single-click

30 July at 6:33 | Open on chaos.social

Johannes Timaeus

@leah Thank you for pointing this out. To increase its relevance we would need some actual data to support your statements. So I guess that’s a case for investigative journalism and science. Anyone here who can do the job? 😎

30 July at 6:50 | Open on nrw.social

Laura Orchid

@leah for me I often see IPs sending garbled data. Luckily fail2ban catches them :)

30 July at 7:04 | Open on honey.floristpony.eu.org

raboof

@leah ouch :(

The following Apache .htaccess fragment seems to have worked well for a phpBB forum I administer:

==
BrowserMatchNoCase "claudebot" bad_bot
Order Deny,Allow
Deny from env=bad_bot
==

(of course it is even better when they don't make it to Apache at all, but I got the impression that they're detecting they're not getting useful responses anymore and then stop trying)

30 July at 7:38 | Open on merveilles.town

kevin ✨

@leah it’s a real pain, if they’d at least use consistent user agents (or user-agents at all) so they could be blocked on a webserver level. :(

I hope legislators will do something about it.

30 July at 7:55 | Open on chaos.social

Attie Grande

@leah I've been supporting a small company with their website issues - their dev has been complaining about the database server for a long time... turns out ~80% of their traffic was Claude, and the remaining ~20% was Facebook's bot. After blocking these two based on their user agent, the CPU usage on the system has gone from "pinned at 100%" to basically idle, and response times have fallen back into the acceptable range (previously there were regular spikes to many minutes)

30 July at 8:53 | Open on chaos.social

Show 1 reply

Tim Wappat :verified:

@leah Been seeing this issue grow too.

30 July at 9:42 | Open on hachyderm.io

darix

@leah Not to mention the attitude of "we ignore robots.txt, sue us if you want" and "Without using copyrighted content for free our business model doesnt work" parasites at its finest.

30 July at 11:10 | Open on mastodon.social

Go Up