Email or username:

Password:

Forgot your password?
Darius Kazemi

Hi Fediverse denizens. I've been working on a project I hope will help Fediverse devs make software that federates across ALL services, not just Mastodon-plus-a-few-others.

It's called the Fediverse Schema Observatory and before launching it, I am opening up a public comment period re: safety and privacy.

This post lays out what the software does, what data it records, and why I think it's safe to deploy:

asml.cyber.harvard.edu/fediver

Please give feedback here or at the email in the post!

63 comments
DJ Sundog - from the toot-lab

@darius hey, this looks solid to me! what's the opt-in process look like?

Darius Kazemi

@djsundog Asking a very specific version of this question because I want to make sure what I wrote came across:

can you describe to me what you think opting in will do? (I ask it this way because I wanna QA my blog post haha)

DJ Sundog - from the toot-lab

@darius

haha, fair enough - based on my first (fast) read-through, I think opting in will send all public (or in your parlance, hyperpublic) posts to your server for aggregation into the dataset.

I'm guessing that the opt-in process will look a lot like subscribing to a relay that just happens to never send any traffic back to my server

Darius Kazemi

@djsundog Precisely yes and yes! You'll just add it as a relay

DJ Sundog - from the toot-lab

@darius excellent! thanks for verifying that (and for, y'know, actually building the project haha)

Hugh

@darius @djsundog Is this how a new software would be added? e.g. if I wanted the Observatory to include BookWyrm, I'd convince a BookWyrm server admin to join the relay?

Irenes (many)

@darius interesting! thanks, we'll dig in when we can

erin sparling

@darius I LOVE the idea of a schema observatory. I’ve done quite a few projects in the past that involve schema registries, and they can be so helpful to a community who cares. This idea of effectively reverse-engineering the current utility of supporting existing federated schemas is awesome.

erin sparling

@darius Your stance on semi-anonymized data, the idea of using relay traffic exclusively, the (forthcoming) web presence of being able to browse this information: wonderful.

erin sparling

@darius you start to get at version interop at the end of the video, and separately you mentioned in another comment asking “what would opt-in look like?” Any gluing of these two together, as in a registration of schema transformations from one to another? So far it seems like thenproject is saying “ok, now you as a fedidev can target the right schemas for your community that will deliver value,” but it leaves the outliers as just that.

erin sparling

@darius Maybe there are already projects that do this, and discovery is enough of a catalyst? Maybe everyone just falls back to a basic `note`? Forgive my ignorance, I’m basically just rehashing things I’ve had to work through previously.

Darius Kazemi

@everyplace I mean I am hoping that people can build stuff like that on top of this! I just was gobsmacked that the basic data wasn't even out there really.

Also the schema data has already proven useful in standards discussions

erin sparling

@darius I can only imagine re: informing the standards.

Also, just had the worst idea. Of course the fediSchemaTransforms should be a federated post themselves…

Darius Kazemi

Here is a video demo of the tool if you just wanna see what it's about in 4 minutes.

Jan Lehnardt :couchdb:

@darius always a step ahead, congrats on the launch!

YakYak_OG

@darius what happens when something you can post here will get strikes on other platforms? Some are worse than others but I also think it's a great idea if you can get around actually free speech!

Jennifer Moore

@darius

Seems to me like a cool and potentially very useful thing!

Also, respect for the lovely job done here of explaining it up front before people have a chance to get worried - which "ought to be" common sense, but in practice seems to be remarkably uncommon sense :-)

DELETED

@darius Seems like you might be over-qualified. Hopefully it won't take long to land a new position that would align with your qualifications. Meanwhile, are you open to checking side gigs?

Laurens Hof

@darius

a: this is a super valuable tool, very helpful, it's great youre building it!

b: I genuinely do not think the fediverse can exist if we're taking the idea seriously that surfacing activitypub data formats requires an opt-in consent. Taking that idea to its logical conclusion means that its literally impossible to build a federating AP server.

Emelia 👸🏻

@darius I think this is going to be fundamental as a tool for compatibility between Fediverse software as the network grows. Understanding what is being sent, schema wise, is essential to being able to handle it.

I also wouldn't mind running this as a proxy in front of a server, where I can submit information about activities I receive, but anonymizing everything before forwarding to the collector.

Scott Feeney

@darius please hand off Hometown to another maintainer! It's cool you're moving on to these new, bigger projects, but it's sadly ironic that those of us who liked your ideas about social media the most — and therefore started Hometown servers — are left stuck without updates or bug fixes.

Darius Kazemi

@graue yes, I want to do this and have plans for it, but I needed to get this out first

DELETED

@darius great tool but the Mastodon developers do not want to comply with other rules for different federations and activity hub standards. Mastodon will only work in Mastodon. It’s the reason why you won’t see misskey post in Mastodon or pixelfed photos, the Mastodon developers deliberately ignore everything outside the Mastodon community.

aliceif

@firecat@mstdn.social you can see misskey notes though, this post is sent from a ​:misskey2022:​ server
it just doesn't support additional features of misskey so text formatting or quoted posts might not make it into.the mastodon GUI (but even that can be reimplemented by a frontend or app)

Mitex Leo

@firecat I'm seeing posts from Misskey and Pixelfed without any issue. Could you please describe the issue you're referring to?

Hugh

@darius THIS IS AWESOME!

I'm wondering what happens (i.e. how would you display it)when a software sends an Activity type in two different ways depending on certain factors.

I can't remember the exact example, but I think it was GoToSocial at some point provided *either* an array of URIs or a single URI as a string depending on the number of recipients. That was a tricky one to troubleshoot.

Deborah Pickett

@darius How do you counteract poisoning of the data by bad actors? Do you have a blocklist of peers or addresses to discard data from? Can you remove poisoned anonymized datapoints if they’re discovered long after collection?

Does the relay work with peers who have AUTHORIZED_FETCH turned on? Most relays don’t.

I presume you can’t collect data from Fediverse servers that don’t support relays. I know that Takahē doesn’t, pretty sure it’s still on GoToSocial’s roadmap.

The Nexus of Privacy

@darius thanks for taking time time to write it up so thoroughly and get feedback! It seems like a great tool -- I was impressed by the demo at FediForum -- and I really appreciate you thinking so deeply about the privacy aspects. Your approach very much aligns with the principles @rwg and @rra suggest in cell.com/patterns/fulltext/S26, so I certainly hope that this sets the bar for future projects.

Opt-in at the server level makes a lot of sense to me, and I like the specific approach you described in your reply to @djsundog ... it's a mechanism server admins are already familiar with. The discussion of how you can't leverage existing opt-in/opt-out signals makes it clear that trying to do so would compromising user privacy (and also exposes a limitation of the current design -- not your issue but something I hope developers think about).

Scrubbing the data is a great example of data minimization, and the example makes it easy to understand. The exceptions you list all seem very sensible,

A question about the additional opt-out mechanism ... does this do anything more than the admin undoing the opt-in by unsubscribing? If not, then it might be overkill ... although certainly nothing the matter with having an email-based opt-out as well.

@darius thanks for taking time time to write it up so thoroughly and get feedback! It seems like a great tool -- I was impressed by the demo at FediForum -- and I really appreciate you thinking so deeply about the privacy aspects. Your approach very much aligns with the principles @rwg and @rra suggest in cell.com/patterns/fulltext/S26, so I certainly hope that this sets the bar for future projects.

Mike, First of His Name

@darius I'd love to read the blog post but light text, black background is a massive accessibility problem for a lot of people. If you're able to incorporate some detection of user colour theme requirements into the styles for the project and its documentation that'd be extremely helpful. Thanks.

Jupiter Rowland
@Darius Kazemi This could get the more interesting, the farther something is away from Mastodon.

Even if it can cope with Hubzilla which remains to be seen, I expect it to run into trouble when it has to deal with (streams). That's for two reasons:

One, (streams) is actually intentionally nameless and brandless and has next to no nodeinfo code, in case you need it. This also means that it doesn't have a unified instance identifier. Mastodon always identifies as "mastodon", as does Glitch. Pixelfed always identifies as "pixelfed". Lemmy always identifies as "lemmy". And so forth.

(streams) does not always identify as "streams" or "(streams)". (streams) instances may identify as whatever their admins want them to identify. The most important public instance identifies as "get ready to rumbly". Others have identified or still identify as "diversi spiritus" or "theshire" or "mordor" or "gondor" or "nomád". The creator and maintainer himself is on an instance branded "y" (because Y is not X). And so forth.

It's a free-text string that could be anything, including other Fediverse projects. In fact, this has actually happened in the past: (streams) instances can be upgraded from instances of Zap, the third Osada, Misty (a.k.a. Mistpark 2020), Redmatrix 2020 and Roadhouse (all five defunct as of New Year's Eve, 2022), and when this was done, the instances kept their old branding.

Good luck identifying (streams) instances and telling them from instances of everything else by the usual means.

Two, (streams) is one of the first two Fediverse server applications that have introduced decentralised IDs as per FEP-ef61. Addresses may have "/.well-known/apgateway/did:⁠key:(48 random letters and/or digits)/" in them. Not sure if that'll mess with the Observatory.

#Long #LongPost #CWLong #CWLongPost #FediMeta #FediverseMeta #CWFediMeta #CWFediverseMeta #Fediverse #Streams #(streams)
@Darius Kazemi This could get the more interesting, the farther something is away from Mastodon.

Even if it can cope with Hubzilla which remains to be seen, I expect it to run into trouble when it has to deal with (streams). That's for two reasons:
Steffo

@darius opposite question!

will a relay explicitly for opting in be available?

some people may not want to join a public relay due to how heavy they are on server load, but may want to provide data to your project nontheless…

Amber

@darius@friend.camp Hello! I am glad I found this post again, it got lost in my feed. I emailed you a couple of questions. I see you already answered the authorized fetch one. I bring the following counter argument though:

I run a queer instance, and I feel pretty confident with what you’ve shown off so far when it comes to your data “deanonymization" because you’re stripping everything and just changing it to a skeleton highlighting types and the various nested structures. I’m okay with this, but authorized fetch is a protection against instances that pose a threat to my community. I can’t exactly disable it (even though I am completely okay with the collection method you outline) just to opt in. Would you consider looking into that in the future? (I don’t believe you need an entire instance implementation just something that can sign the requests so my instance is happy).

@darius@friend.camp Hello! I am glad I found this post again, it got lost in my feed. I emailed you a couple of questions. I see you already answered the authorized fetch one. I bring the following counter argument though:

I run a queer instance, and I feel pretty confident with what you’ve shown off so far when it comes to your data “deanonymization" because you’re stripping everything and just changing it to a skeleton highlighting types and the various nested structures. I’m okay with this, but...

Jenny :bf_trans:

@darius Honestly, you could have just left this at "I'm not assuming that I am entitled to the content of every public post on the federated network" and I'd have already known you were more on the level than most scrapers and bridges. But you seem to have put a lot of effort into bot only not hanging on to the content of every post on the network, but into being transparent about the how and why. I'd give you a thumbs-up react, but most Masto servers don't support it. (So obviously I agree with your stated goal as well. :neobot_giggle:)

One thing that occurred to me is when a fedi instance uses custom forks. For example, Anarres.family uses a customized fork of Glitch Social (itself a fork of Mastodon), and the version number reported on our web interface, and presumably what your software will see, is v4.4.0-alpha.1+glitch.anarres.family, which is a pretty common practice for these forks, though not all of them include the full URL of the instance like ours does. So with your example of "we saw n polls from [software] x.y.z," in our case it would not actually anonymize us because it would display "50 polls by Masto...anarres.family." Not terribly critical since you're planning on scrubbing post data, so there's not much that could be shared that we'd likely want anonymized.

You could get around that by truncating or obfuscating the version, and possibly rolling it in with the equivalent Masto or Glitch version, but that's relevant data; one of the things our fork adds is the emoji reaction code from Chuckya, so our instance is capable of sending and receiving AP messages that are not supported by other Glitch instances.

@darius Honestly, you could have just left this at "I'm not assuming that I am entitled to the content of every public post on the federated network" and I'd have already known you were more on the level than most scrapers and bridges. But you seem to have put a lot of effort into bot only not hanging on to the content of every post on the network, but into being transparent about the how and why. I'd give you a thumbs-up react, but most Masto servers don't support it. (So obviously I agree with...

NeoDB Open Source Software

@darius nice idea. consider opt in with eggplant.place

meanwhile it would be nice to share your code url, host name, and nodeinfo once you are ready to open your service. much as you'd like to research others' schema, other admin/dev may want to know more about yours, by looking into nodeinfo and code.

Darius Kazemi

@neodb correct and I will do so (I already have nodeinfo working)

bryan newbold

@darius cool! I didn't dig in too deeply, so maybe you cover this, but a related helpful feature/tool might be knowing which software (libraries, implementations) would be able to parse data/schemas. getting in to caniuse.com/ territory, or things like ACID.

also often helpful if folks (eg, devs) can paste in data (JSON) and see how it validates, using the exact code being used to "observe", without that being captured and included in database.

Darius Kazemi

@bnewbold yeah for sure, if I can create something lile caniuse that would be huge! That's been a goal of mine the whole time really though I'm not there yet

Tim Zee

@darius I always presume that anything I post here is public anyway. So in that regard I don't see an issue.

sbszine

@darius Thank you for asking before launching and for the detail and transparency. Sounds good to me.

Veronika Cheplygina

@darius I think I support this but the scrolling thing on the side makes it difficult to read the thing

Darius Kazemi

@DrVeronikaCH I would like it to go away too. I agree with you

Darius Kazemi

@DrVeronikaCH there is a pause button on the lower right of the page that might help?

silverpill

@darius There seems to be an overlap between this and https://funfedi.dev/support_tables/

Perhaps you can contribute to that project? It seems to be privacy-respecting because samples are generated locally

Darius Kazemi

@silverpill yes! It came in my radar last week and I hope to contribute to their cc0 data as well

[DATA EXPUNGED]
d(jack’o la)ngo 🎃

@darius I understand that the observatory is injesting data via a relay. This more or less limits data to public activities, which is fine however I wonder if there would be an alternate method for submitting a schema? Whether by submitting a static file, or by some type of handshake…

d(jack’o la)ngo 🎃

@darius the thing I’m building experimentally defaults to private

🍄🌈🎮💻🚲🥓🎃💀🏴🛻🇺🇸

@darius is the idea to post to and read from live ActivityPub servers to analyze whether or not they adhere to specs?

Darius Kazemi

@schizanon no, it's just to see what kind of data formats live activitypub servers use. Definitely no posting to servers here

Darius Kazemi

@schizanon it's to identify problems as a first step prior to further investigation. But I need to know what software to investigate in the first place. I've already done a bunch of investigation myself that has led me to learning about lots of new to me projects and sleuthing in their federation source code

🍄🌈🎮💻🚲🥓🎃💀🏴🛻🇺🇸

@darius so you're actually reading the source code for every AP impl? How will you know if they change?

Darius Kazemi

@schizanon there's a whole community of ActivityPub researchers that keep track of source code on all sorts of projects. We work together to keep each other informed and this tool is a piece of that process

James M.

@darius excellent! This kind of data would be super helpful to developers. I thought about standardizing on schema.org ontologies, but measuring actual usage is much better. And kudos for seeking feedback before development.

Overall, I like your privacy priorities. If anyone doesn't trust you to scrub the post data, you could allow them to scrub it on their end, or with a proxy they trust in the middle.

I wonder how well someone could be fingerprinted from the data after it is scrubbed.

Go Up