Hi Fediverse denizens. I've been working on a project...

Hi Fediverse denizens. I've been working on a project I hope will help Fediverse devs make software that federates across ALL services, not just Mastodon-plus-a-few-others.

It's called the Fediverse Schema Observatory and before launching it, I am opening up a public comment period re: safety and privacy.

This post lays out what the software does, what data it records, and why I think it's safe to deploy:

https://asml.cyber.harvard.edu/fediverseobservatory/

Please give feedback here or at the email in the post!

Like 28 October at 17:22 | Open on friend.camp

63 comments

DJ Sundog - from the toot-lab

@darius hey, this looks solid to me! what's the opt-in process look like?

28 October at 17:32 | Open on toot-lab.reclaim.technology

Darius Kazemi

@djsundog Asking a very specific version of this question because I want to make sure what I wrote came across:

can you describe to me what you think opting in will do? (I ask it this way because I wanna QA my blog post haha)

28 October at 17:34 | Open on friend.camp

DJ Sundog - from the toot-lab

@darius

haha, fair enough - based on my first (fast) read-through, I think opting in will send all public (or in your parlance, hyperpublic) posts to your server for aggregation into the dataset.

I'm guessing that the opt-in process will look a lot like subscribing to a relay that just happens to never send any traffic back to my server

28 October at 17:42 | Open on toot-lab.reclaim.technology

Darius Kazemi

@djsundog Precisely yes and yes! You'll just add it as a relay

28 October at 17:43 | Open on friend.camp

DJ Sundog - from the toot-lab

@darius excellent! thanks for verifying that (and for, y'know, actually building the project haha)

28 October at 17:47 | Open on toot-lab.reclaim.technology

Hugh

@darius @djsundog Is this how a new software would be added? e.g. if I wanted the Observatory to include BookWyrm, I'd convince a BookWyrm server admin to join the relay?

28 October at 20:04 | Open on ausglam.space

Irenes (many)

@darius interesting! thanks, we'll dig in when we can

28 October at 17:40 | Open on adhd.irenes.space

erin sparling

@darius I LOVE the idea of a schema observatory. I’ve done quite a few projects in the past that involve schema registries, and they can be so helpful to a community who cares. This idea of effectively reverse-engineering the current utility of supporting existing federated schemas is awesome.

28 October at 17:42 | Open on mastodon.social

erin sparling

@darius Your stance on semi-anonymized data, the idea of using relay traffic exclusively, the (forthcoming) web presence of being able to browse this information: wonderful.

28 October at 17:44 | Open on mastodon.social

erin sparling

@darius you start to get at version interop at the end of the video, and separately you mentioned in another comment asking “what would opt-in look like?” Any gluing of these two together, as in a registration of schema transformations from one to another? So far it seems like thenproject is saying “ok, now you as a fedidev can target the right schemas for your community that will deliver value,” but it leaves the outliers as just that.

28 October at 17:53 | Open on mastodon.social

erin sparling

@darius Maybe there are already projects that do this, and discovery is enough of a catalyst? Maybe everyone just falls back to a basic `note`? Forgive my ignorance, I’m basically just rehashing things I’ve had to work through previously.

28 October at 17:54 | Open on mastodon.social

Darius Kazemi

@everyplace I mean I am hoping that people can build stuff like that on top of this! I just was gobsmacked that the basic data wasn't even out there really.

Also the schema data has already proven useful in standards discussions

28 October at 17:58 | Open on friend.camp

erin sparling

@darius I can only imagine re: informing the standards.

Also, just had the worst idea. Of course the fediSchemaTransforms should be a federated post themselves…

28 October at 18:02 | Open on mastodon.social

Darius Kazemi

Here is a video demo of the tool if you just wanna see what it's about in 4 minutes.

28 October at 17:42 | Open on friend.camp

butterflyoffire ⏚ꝃ⌁⁂

@darius cc @syuilo check ↑

28 October at 20:02 | Open on mstdn.fr

Jan Lehnardt :couchdb:

@darius always a step ahead, congrats on the launch!

28 October at 17:55 | Open on narrativ.es

Pat Race 🍕

@darius Thanks for doing this work!

28 October at 19:05 | Open on alaskan.social

YakYak_OG

@darius what happens when something you can post here will get strikes on other platforms? Some are worse than others but I also think it's a great idea if you can get around actually free speech!

28 October at 19:08 | Open on mastodon.social

Jennifer Moore

@darius

Seems to me like a cool and potentially very useful thing!

Also, respect for the lovely job done here of explaining it up front before people have a chance to get worried - which "ought to be" common sense, but in practice seems to be remarkably uncommon sense :-)

28 October at 19:19 | Open on scicomm.xyz

DELETED

@darius Seems like you might be over-qualified. Hopefully it won't take long to land a new position that would align with your qualifications. Meanwhile, are you open to checking side gigs?

28 October at 19:24 | Open on social.vivaldi.net

Laurens Hof

@darius

a: this is a super valuable tool, very helpful, it's great youre building it!

b: I genuinely do not think the fediverse can exist if we're taking the idea seriously that surfacing activitypub data formats requires an opt-in consent. Taking that idea to its logical conclusion means that its literally impossible to build a federating AP server.

28 October at 19:25 | Open on indieweb.social

Emelia 👸🏻

@darius I think this is going to be fundamental as a tool for compatibility between Fediverse software as the network grows. Understanding what is being sent, schema wise, is essential to being able to handle it.

I also wouldn't mind running this as a proxy in front of a server, where I can submit information about activities I receive, but anonymizing everything before forwarding to the collector.

28 October at 19:45 | Open on hachyderm.io

Scott Feeney

@darius please hand off Hometown to another maintainer! It's cool you're moving on to these new, bigger projects, but it's sadly ironic that those of us who liked your ideas about social media the most — and therefore started Hometown servers — are left stuck without updates or bug fixes.

28 October at 19:59 | Open on social.coop

Darius Kazemi

@graue yes, I want to do this and have plans for it, but I needed to get this out first

28 October at 20:14 | Open on friend.camp

DELETED

@darius great tool but the Mastodon developers do not want to comply with other rules for different federations and activity hub standards. Mastodon will only work in Mastodon. It’s the reason why you won’t see misskey post in Mastodon or pixelfed photos, the Mastodon developers deliberately ignore everything outside the Mastodon community.

28 October at 20:02 | Open on mstdn.social

aliceif

@firecat@mstdn.social you can see misskey notes though, this post is sent from a :misskey2022: server
it just doesn't support additional features of misskey so text formatting or quoted posts might not make it into.the mastodon GUI (but even that can be reimplemented by a frontend or app)

28 October at 23:19 | Open on mkultra.x27.one

Mitex Leo

@firecat I'm seeing posts from Misskey and Pixelfed without any issue. Could you please describe the issue you're referring to?

29 October at 0:20 | Open on mitexleo.one

DELETED

@ml there’s a fork of mastodon that explains mastodon issues https://github.com/hometown-fork/hometown/wiki

29 October at 2:28 | Open on mstdn.social

Hugh

@darius THIS IS AWESOME!

I'm wondering what happens (i.e. how would you display it)when a software sends an Activity type in two different ways depending on certain factors.

I can't remember the exact example, but I think it was GoToSocial at some point provided *either* an array of URIs or a single URI as a string depending on the number of recipients. That was a tricky one to troubleshoot.

28 October at 20:03 | Open on ausglam.space

Deborah Pickett

@darius How do you counteract poisoning of the data by bad actors? Do you have a blocklist of peers or addresses to discard data from? Can you remove poisoned anonymized datapoints if they’re discovered long after collection?

Does the relay work with peers who have AUTHORIZED_FETCH turned on? Most relays don’t.

I presume you can’t collect data from Fediverse servers that don’t support relays. I know that Takahē doesn’t, pretty sure it’s still on GoToSocial’s roadmap.

28 October at 20:20 | Open on old.mermaid.town

The Nexus of Privacy

@darius thanks for taking time time to write it up so thoroughly and get feedback! It seems like a great tool -- I was impressed by the demo at FediForum -- and I really appreciate you thinking so deeply about the privacy aspects. Your approach very much aligns with the principles @rwg and @rra suggest in https://www.cell.com/patterns/fulltext/S2666-3899(23)00323-9, so I certainly hope that this sets the bar for future projects.

Opt-in at the server level makes a lot of sense to me, and I like the specific approach you described in your reply to @djsundog ... it's a mechanism server admins are already familiar with. The discussion of how you can't leverage existing opt-in/opt-out signals makes it clear that trying to do so would compromising user privacy (and also exposes a limitation of the current design -- not your issue but something I hope developers think about).

Scrubbing the data is a great example of data minimization, and the example makes it easy to understand. The exceptions you list all seem very sensible,

A question about the additional opt-out mechanism ... does this do anything more than the admin undoing the opt-in by unsubscribing? If not, then it might be overkill ... although certainly nothing the matter with having an email-based opt-out as well.

Expand text...

28 October at 20:23 | Open on infosec.exchange

Silmathoron ⁂

sounds good to me, thanks @darius

28 October at 20:29 | Open on floss.social

Mike, First of His Name

@darius I'd love to read the blog post but light text, black background is a massive accessibility problem for a lot of people. If you're able to incorporate some detection of user colour theme requirements into the styles for the project and its documentation that'd be extremely helpful. Thanks.

28 October at 20:44 | Open on social.chinwag.org

Jupiter Rowland

@Darius Kazemi This could get the more interesting, the farther something is away from Mastodon.

Even if it can cope with Hubzilla which remains to be seen, I expect it to run into trouble when it has to deal with (streams). That's for two reasons:

One, (streams) is actually intentionally nameless and brandless and has next to no nodeinfo code, in case you need it. This also means that it doesn't have a unified instance identifier. Mastodon always identifies as "mastodon", as does Glitch. Pixelfed always identifies as "pixelfed". Lemmy always identifies as "lemmy". And so forth.

(streams) does not always identify as "streams" or "(streams)". (streams) instances may identify as whatever their admins want them to identify. The most important public instance identifies as "get ready to rumbly". Others have identified or still identify as "diversi spiritus" or "theshire" or "mordor" or "gondor" or "nomád". The creator and maintainer himself is on an instance branded "y" (because Y is not X). And so forth.

It's a free-text string that could be anything, including other Fediverse projects. In fact, this has actually happened in the past: (streams) instances can be upgraded from instances of Zap, the third Osada, Misty (a.k.a. Mistpark 2020), Redmatrix 2020 and Roadhouse (all five defunct as of New Year's Eve, 2022), and when this was done, the instances kept their old branding.

Good luck identifying (streams) instances and telling them from instances of everything else by the usual means.

Two, (streams) is one of the first two Fediverse server applications that have introduced decentralised IDs as per FEP-ef61. Addresses may have "/.well-known/apgateway/did:⁠key:(48 random letters and/or digits)/" in them. Not sure if that'll mess with the Observatory.

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #Fediverse #Streams #(streams)

Expand text...

28 October at 21:05 | Open on hub.netzgemeinde.eu

Steffo

@darius opposite question!

will a relay explicitly for opting in be available?

some people may not want to join a public relay due to how heavy they are on server load, but may want to provide data to your project nontheless…

28 October at 21:07 | Open on junimo.party

Amber

@darius@friend.camp Hello! I am glad I found this post again, it got lost in my feed. I emailed you a couple of questions. I see you already answered the authorized fetch one. I bring the following counter argument though:

I run a queer instance, and I feel pretty confident with what you’ve shown off so far when it comes to your data “deanonymization" because you’re stripping everything and just changing it to a skeleton highlighting types and the various nested structures. I’m okay with this, but authorized fetch is a protection against instances that pose a threat to my community. I can’t exactly disable it (even though I am completely okay with the collection method you outline) just to opt in. Would you consider looking into that in the future? (I don’t believe you need an entire instance implementation just something that can sign the requests so my instance is happy).

Expand text...

28 October at 21:09 | Open on transfem.social

Jenny :bf_trans:

@darius Honestly, you could have just left this at "I'm not assuming that I am entitled to the content of every public post on the federated network" and I'd have already known you were more on the level than most scrapers and bridges. But you seem to have put a lot of effort into bot only not hanging on to the content of every post on the network, but into being transparent about the how and why. I'd give you a thumbs-up react, but most Masto servers don't support it. (So obviously I agree with your stated goal as well. :neobot_giggle:)

One thing that occurred to me is when a fedi instance uses custom forks. For example, Anarres.family uses a customized fork of Glitch Social (itself a fork of Mastodon), and the version number reported on our web interface, and presumably what your software will see, is v4.4.0-alpha.1+glitch.anarres.family, which is a pretty common practice for these forks, though not all of them include the full URL of the instance like ours does. So with your example of "we saw n polls from [software] x.y.z," in our case it would not actually anonymize us because it would display "50 polls by Masto...anarres.family." Not terribly critical since you're planning on scrubbing post data, so there's not much that could be shared that we'd likely want anonymized.

You could get around that by truncating or obfuscating the version, and possibly rolling it in with the equivalent Masto or Glitch version, but that's relevant data; one of the things our fork adds is the emoji reaction code from Chuckya, so our instance is capable of sending and receiving AP messages that are not supported by other Glitch instances.

Expand text...

28 October at 21:15 | Open on anarres.family

NeoDB Open Source Software

@darius nice idea. consider opt in with https://eggplant.place

meanwhile it would be nice to share your code url, host name, and nodeinfo once you are ready to open your service. much as you'd like to research others' schema, other admin/dev may want to know more about yours, by looking into nodeinfo and code.

28 October at 21:21 | Open on mastodon.online

Darius Kazemi

@neodb correct and I will do so (I already have nodeinfo working)

28 October at 21:51 | Open on friend.camp

bryan newbold

@darius cool! I didn't dig in too deeply, so maybe you cover this, but a related helpful feature/tool might be knowing which software (libraries, implementations) would be able to parse data/schemas. getting in to https://caniuse.com/ territory, or things like ACID.

also often helpful if folks (eg, devs) can paste in data (JSON) and see how it validates, using the exact code being used to "observe", without that being captured and included in database.

28 October at 21:23 | Open on social.coop

Darius Kazemi

@bnewbold yeah for sure, if I can create something lile caniuse that would be huge! That's been a goal of mine the whole time really though I'm not there yet

28 October at 21:50 | Open on friend.camp

Tim Zee

@darius I always presume that anything I post here is public anyway. So in that regard I don't see an issue.

28 October at 21:30 | Open on mastodon.social

sbszine

@darius Thank you for asking before launching and for the detail and transparency. Sounds good to me.

28 October at 21:31 | Open on dice.camp

Veronika Cheplygina

@darius I think I support this but the scrolling thing on the side makes it difficult to read the thing

28 October at 21:55 | Open on dair-community.social

Darius Kazemi

@DrVeronikaCH I would like it to go away too. I agree with you

28 October at 21:57 | Open on friend.camp

Darius Kazemi

@DrVeronikaCH there is a pause button on the lower right of the page that might help?

28 October at 21:57 | Open on friend.camp

silverpill

@darius There seems to be an overlap between this and https://funfedi.dev/support_tables/

Perhaps you can contribute to that project? It seems to be privacy-respecting because samples are generated locally

28 October at 22:23 | Open on mitra.social

Darius Kazemi

@silverpill yes! It came in my radar last week and I hope to contribute to their cc0 data as well

28 October at 22:27 | Open on friend.camp

[DATA EXPUNGED]

d(jack’o la)ngo 🎃

@darius I understand that the observatory is injesting data via a relay. This more or less limits data to public activities, which is fine however I wonder if there would be an alternate method for submitting a schema? Whether by submitting a static file, or by some type of handshake…

28 October at 23:44 | Open on social.coop

d(jack’o la)ngo 🎃

@darius the thing I’m building experimentally defaults to private

28 October at 23:46 | Open on social.coop

Bob 🇺🇲♒🐧🪖

@darius

👍⭐🤩 Sounds great !!!

29 October at 0:17 | Open on beamship.mpaq.org

🍄🌈🎮💻🚲🥓🎃💀🏴🛻🇺🇸

@darius is the idea to post to and read from live ActivityPub servers to analyze whether or not they adhere to specs?

29 October at 0:23 | Open on mastodon.social

Darius Kazemi

@schizanon no, it's just to see what kind of data formats live activitypub servers use. Definitely no posting to servers here

29 October at 0:28 | Open on friend.camp

🍄🌈🎮💻🚲🥓🎃💀🏴🛻🇺🇸

@darius how can you know what they use without using them?

29 October at 0:37 | Open on mastodon.social

Darius Kazemi

@schizanon it's to identify problems as a first step prior to further investigation. But I need to know what software to investigate in the first place. I've already done a bunch of investigation myself that has led me to learning about lots of new to me projects and sleuthing in their federation source code

29 October at 2:10 | Open on friend.camp

🍄🌈🎮💻🚲🥓🎃💀🏴🛻🇺🇸

@darius so you're actually reading the source code for every AP impl? How will you know if they change?

29 October at 2:54 | Open on mastodon.social

Darius Kazemi

@schizanon there's a whole community of ActivityPub researchers that keep track of source code on all sorts of projects. We work together to keep each other informed and this tool is a piece of that process

29 October at 3:00 | Open on friend.camp

John Harris

@darius yay! Good luck!!

29 October at 0:59 | Open on mefi.social

Bogomil Shopov - Бого

@darius look at the Nostr. They did it already.

29 October at 6:49 | Open on hapyyr.com

James M.

@darius excellent! This kind of data would be super helpful to developers. I thought about standardizing on schema.org ontologies, but measuring actual usage is much better. And kudos for seeking feedback before development.

Overall, I like your privacy priorities. If anyone doesn't trust you to scrub the post data, you could allow them to scrub it on their end, or with a proxy they trust in the middle.

I wonder how well someone could be fingerprinted from the data after it is scrubbed.

29 October at 7:03 | Open on sfba.social

jonny (good kind)

@darius
Super dope, love it

29 October at 16:19 | Open on neuromatch.social

Go Up