Email or username:

Password:

Forgot your password?
Normal :jo_2: :v_enby:

Okay, just to be clear;

Tencent/Alibaba/the CCP/whatever has *specific* software targeted at scraping mastodon servers, by hitting the unauthenticated "list posts from user" API endpoint, with *specific* account IDs, that they must've determined to be interesting from some moment or another.

The only way to get those account IDs is to ask the server to resolve them, thats already a step further than a normal scraping bot.

What's more, the IPs and user agents (basically, text which tells what browser you have) all ONLY hit this specific endpoint, meaning this is a targeted, predetermined, scraping operation.

By looking up the account IDs from the last dozen or so attempts that i see here in my logs, almost each of them are some colorful flavour of trans, gay, or enby.

They are specifically targeting us queer folks with this scraping.

39 comments | Expand all CWs
Normal :jo_2: :v_enby:

im summarising this after discovering this in my previous post, though there i more or less focus on the ways to stop it

(tech.lgbt/@ShadowJonathan/1118)

here i wanted to make clear what it *means*, what they are *doing*, and why this is obviously unsettling

Juliet Merida (she/they) ๐Ÿš๐Ÿณ๏ธโ€โšง๏ธ๐Ÿน๐ŸŽฏ

@ShadowJonathan@tech.lgbt If it was anyone else saying this I'd ask for receipts but I'm assuming you've seen the logs and have done some correlation yourself. Is that true?

[Edit: As soon as I posted this you continued the thread with a link to your discoveries. :) ]

violette :trans_cowboy:

@ShadowJonathan do you think they're only tagetting queer people, or we only see this because we have a lot more representation of queer people vs non queer people ?

Normal :jo_2: :v_enby:

@violets ive gone through about 4 dozen IDs, and they're all visibly trans/gay/enby on their profile to a degree, while i know that some people on here do not have that

its possible that the probability is just that high, but even the idea alone is frightening imo

Normal :jo_2: :v_enby:

here's a sample of those logs, full account IDs redacted for privacy's sake

Evvy :neofox_floof:

@ShadowJonathan will you notify affected users or is there no point in doing so

Normal :jo_2: :v_enby:

@pierogiburo not really a point, this goes back further than i have server logs at the moment

Normal :jo_2: :v_enby:

@fay59 there's way too many for that to really mean something, the best i could do is maybe put up an announcement, and block them in the future (which is what im doing atm)

DELETED

@ShadowJonathan None of them use Wayland ๐Ÿ˜”

AxiomPraxis

@ShadowJonathan There's a really cool open source utility called "GoAccess" that you can run against those (nginx?) logs to generate very pretty reports for easier visual parsing.

The hardest part tends to be nailing down the specific log format syntax, if you try it out and need any help with that part let me know!

justin โ˜… MOVED TO @onemuri@wavebird.party

@ShadowJonathan >nt 6.0 and 6.1

oh my fucking god are they scraping it with potatoes

LED2000

@ShadowJonathan Windows NT 6.0 why they using Windows Vista?

David Croyle

@ShadowJonathan That's alarming!

sentinian

@ShadowJonathan@tech.lgbt they really out here keeping an eye out on the usual suspects

Natty :butterflyN:

@ShadowJonathan@tech.lgbt By occam's razor, is it a possibility someone linked this URL in an article somewhere and the UA is a search indexing crawler?

nepi

@ShadowJonathan This is obviously concerning but... isn't content scraping just a fact of life with public ActivityPub posts? It's kind of how the whole protocol works, right?

๐Ÿ™๐”ธโˆ‹ฮปC

@ShadowJonathan They have done it for twitter before. Some artists on pixiv also got invited to the local police office.

Cysio :verified_gay:โ€‹

@ShadowJonathan nothing seems to be hitting my instance, seems like Cloudflare is keeping them at bay

Akka V. ๐Ÿ‘พ๐Ÿณ๏ธโ€โšง๏ธ

@ShadowJonathan this may be a dumb remark, but isn't everyone on your instance some flavor of trans, gay or enby ?

You'd need some data on what this bot has been scraping on instances that aren't entirely LGBT, or data on which instances it's been scrapping, to know if it's targeting queer folks specifically.

Normal :jo_2: :v_enby:

@akkavodol i know, but i went through about 4 dozen IDs, all of which had some marker clearly visible on their account, which isn't everyone on this server

notably, there were no inactive users, only plenty active locals, which is already alarming

Akka V. ๐Ÿ‘พ๐Ÿณ๏ธโ€โšง๏ธ

@ShadowJonathan Based on what you said it does seem that it's targeting some users specifically. Probably users who did or said something that got them noticed by a previous scrapping algorithm.

I just don't know if being LGBT was the thing that got them noticed. It could have been a political topic that they talked about, or an instance or post that they interacted with.

Normal :jo_2: :v_enby:

@akkavodol most likely

Xerz! :blobcathearttrans:

@ShadowJonathan Question: do you have any evidence about who is to blame for this other than the user agent? Because itโ€™s a good reminder that, yeah, this stuff is VERY MUCH public and exposed and it can be abused

but itโ€™s user agents for publicly available software anyone can download, and state agents can trivially mask user agent anyway

sheelps :blobcatneruneru:

@ShadowJonathan@tech.lgbt why are they scraping fedi through QQ ๐Ÿ˜ญ it's not even accessible without a chinese phone number, what could possibly be the reason

ikanreed

@ShadowJonathan can you tell me more about how you know who is doing the scraping? I'm curious how you detected that it was happening

Normal :jo_2: :v_enby:

@ikanreed i looked through my logs, saw a lot of "QQDownload", and "TencentTraveler", found it cute and interesting, but when i did a zgrep on my logs to see what kind of traffic patterns they had, i saw they hit *only* this API, which was *very* suspicious to me

when i started looking into the IPs involved, i saw they used other user agents as well, which speaks heavily of user agent spoofing, no normal browser, client, or scraper would have any of these kinds of traffic patterns

betty :palm: :squeeak:

@ShadowJonathan @ikanreed could you lend the list of the common useragents used? this is gonna be really useful for blocking purposes

ikanreed

@ShadowJonathan that might have another explanation on the ips. My understanding of the great firewall is that it uses proxies for all requests to servers outside of China(for censorship purposes), and there's probably a finite number of ip addresses associated with those proxies.

Which might generate that same pattern of user agents being varied from one ip.

Still doesn't explain the weird API requests part.

augmented jungle

@ShadowJonathan@tech.lgbt it's probably specifically targeting mastodon instances? going through my logs and I don't get any GET /api/* calls, maybe because i'm on misskey and the structure is different

but then like this is a question for a bigger misskey instance
โ€‹:mikushrug:โ€‹

/usr/Bit โ–‘ Nova :

@ShadowJonathan may I see the previous post where it's discussed how to stop this from happening?

Normal :jo_2: :v_enby:

@cmdr_nova link is on this reply; tech.lgbt/@ShadowJonathan/1118

Leah96xxx :over18: :verifiedtransfem: :verifieddemigirl:

@ShadowJonathan Well if they look at my profile, they're gonna get an eyeful lol. Same if they scrape some of the other gorgeous people on here

jeff

@ShadowJonathan@tech.lgbt it's not the CCP, it's some military contractor in austin texas who wants to manufacture more consent or war with iran so they need some training data for their bots. fedi doesn't block anything because it can't. this is a feature we are yelling these posts to all the internet.
please understand fedi is where all the gay is so when the war machine needs to make more fear mongering they use fedi as a firehose for the paranoia and hatred of all the kinds.

solution: put on the kitty cat ears and install misskey nya

@ShadowJonathan@tech.lgbt it's not the CCP, it's some military contractor in austin texas who wants to manufacture more consent or war with iran so they need some training data for their bots. fedi doesn't block anything because it can't. this is a feature we are yelling these posts to all the internet.
please understand fedi is where all the gay is so when the war machine needs to make more fear mongering they use fedi as a firehose for the paranoia and hatred of all the kinds.

solution: put on the...

chris
@ShadowJonathan my personal server is immune because unauthenticated users can't do anything with the API thanks pleroma and lain
Go Up