Email or username:

Password:

Forgot your password?
Chris Trottier

People say, "The Internet is forever".

This is not true. So much on the Internet has disappeared. I would state further that more has disappeared from the Internet than currently what exists.

This is one reason I wish that a P2P content delivery system was the default: because not only would it deliver information faster, information would depend less on WHERE it is and more on WHAT it is -- thereby creating an avenue for redundancies.

60 comments
Chris Trottier

Imagine if YouTube was shut down.

It would be a literal tragedy for the human race. Billions of hours worth of creativity would disappear in an instant.

"That would never happen!" some believe.

But it already has.

Remember Google+? All of it's gone forever.

Or remember all that media stored on MySpace? It's vanished.

We must stop depending on Big Tech to archive our data. Their mandate is to profit off our data, not preserve it.

Chris Trottier

One reason I hope PeerTube takes off is because:

1. It's decentralized
2. It uses a P2P delivery system

That's one step in the right direction -- but I hope other social media networks follow their lead.

Chris Trottier

Actually, I kind of wish there was a universal P2P protocol that was a mixture of HTTP and BitTorrent.

That alone would fix so many problems with the Internet!

Chris Trottier

@anji I've used IPFS for three years, and I've yet to see broad adoption apart from crypto.

If something like archive.org were stored on IPFS, then that would be a game changer.

Polychrome :clockworkheart:

@atomicpoet @anji to use your example, neither PeerTube or IPFS are good archiving solutions. Google can't be trusted long term but IPFS forgets data as soon as someone doesn't pin it and PeerTube instances tend to shut down within months. I've been trying to use both and it's been very unreliable.

Disclaimer - I'm extremely pro-decentralization and pester everyone I know about it but I am also something of an archivist and I can't ignore the reality.

In the end long term data storage has to be done over offline media - which is also becoming a problem as more people switch to flash based storage since if you leave it unpowered for a couple of years the data will go corrupt and eventually useless much faster than on classic magnetic media.

The net will always be ephemeral on the long term unless we work up something insane like Xanadu. So for now the most stable online archiving option is with people who care - e.g. archive.org.

@atomicpoet @anji to use your example, neither PeerTube or IPFS are good archiving solutions. Google can't be trusted long term but IPFS forgets data as soon as someone doesn't pin it and PeerTube instances tend to shut down within months. I've been trying to use both and it's been very unreliable.

Disclaimer - I'm extremely pro-decentralization and pester everyone I know about it but I am also something of an archivist and I can't ignore the reality.

Chris Trottier

@polychrome @anji Okay, but hear me out. What if archive.org was decentralized, and every library and university in the world ran an instance?

Konrad

@atomicpoet I'm sure hard disk companies would quite love that. @polychrome @anji

Doc Edward Morbius ⭕​

@atomicpoet Just to oput some numbers on this...

The US has about 9,000 public libraries (administrative units) and another 3,000 or so academic libraries, for a total of 12,000 of both classes.

There is an estimated total of over 115,000 libraries in the country. (Many are public school libraries.)

web.archive.org/web/2018102617libguides.ala.org/numberoflibr

I'm going to assume that major-city public libraries and top academic libraries might be considered archival hubs. That's a hundred or so from each list, conservatively.

The US Library of Congress holds 40 million catalogued works (books, generally, a total of ~130 million items of various descriptions).

At 5 MB/book, total disk storage would run about $3,700 ($18.3/TB), for spinning rust. Other offline / nearline storage might be cheaper. I'm going to estimate a disk storage system at roughly 4x this cost, or just under $15,000. (This is probably high, I'm being conservative.)

That is, for $15,000, any library in the world could hold the entire works of the world's largest library, the Library of Congress.

For comparison, the Internet Archive budgets $2/GB for data in perpetuity. That's $2 per 400 books or so.

Yes, "books" != "Internet data". But it's a comparison point.

@polychrome @anji

@atomicpoet Just to oput some numbers on this...

The US has about 9,000 public libraries (administrative units) and another 3,000 or so academic libraries, for a total of 12,000 of both classes.

There is an estimated total of over 115,000 libraries in the country. (Many are public school libraries.)

naxxfish

@polychrome @atomicpoet @anji it's true - you cannot trust anyone who's motives are not preservation to preserve your media. Archiving has been getting increasingly difficult and expensive over the years as the volume and diversity of media goes up, and it's expensive.

I'd go one further and say optical media - not necessarily CD/DVD, though - is the way to go - formed by irreversible chemical/ mechanical processes. Tapes and disks are fine - but are erasable and so less durable.

naxxfish

@polychrome @atomicpoet @anji also - tape heads have a finite lifetime (in hours read). Many kinds of tape machines (and this heads) which were once common are no longer manufactured: thus there is a finite supply of tape heads. There are archives in the world which have more hours of media stored in them than there are tape head hours in the world. So some of the archive is already lost - it's just we have to decide which bit we don't recover.

Rachael Ava 💁🏻‍♀️

@polychrome @atomicpoet @anji LTO Tape drives are quite popular for archival purposes, as they don't lose data easily when not in use for a long time.

Miłosz SP9UNB

@polychrome @atomicpoet @anji You can't expect non-commercial instance to be reliable if you don't support it. Support instance (mastodon, peertube, etc..) or setup/share your own and will last forever.

Terry Hancock

@polychrome @atomicpoet @anji

After researching this problem for myself, I settled on two offline storage media:

1) I buy used 1-TB 3.5" hard drives and offline-storage cases.
The 1-TB size is a good match to my needs, cheap to buy (especially used, and used is fine -- for offline use, they'll get little wear).

2) M-Disc optical disks, DVD-M or M-Disc BDR, which are much more durable than dye-based media (will probably outlast the magnetic media of the hard disks).

@polychrome @atomicpoet @anji

After researching this problem for myself, I settled on two offline storage media:

1) I buy used 1-TB 3.5" hard drives and offline-storage cases.
The 1-TB size is a good match to my needs, cheap to buy (especially used, and used is fine -- for offline use, they'll get little wear).

Terry Hancock

@polychrome @atomicpoet @anji

I plan to use the 3.5"HDs to store expanded source trees, software/distro, and the complete EXR-stream renders of my output.

The EXRs are "intermediates". Regenerating them is expensive, but it is an automatic process, once the software is running.

I'll use the optical M-Disc media to store source files, PNG streams, video renders, and software archives.

As for the volatility of PeerTube, that's why I'm running my own instance, now.

Hopefully this works. 🤞

teledyn 𓂀

@polychrome @atomicpoet @anji the recent threat by #Google to oust #GSuiteLegacy users who wouldn't pay the ransom gave a new perspective on the net-archives, faced as I was with somehow finding new homes for 16 years worth of life-history data for 8 users.

I think we must accept that "Digital Archive" is a contradiction in terms, a transient transport from A to B. Digital 'artifacts' are a 'volatile' variable contained within a scope that will inevitably be garbage collected.

Ton Zijlstra

@polychrome @atomicpoet @anji and/or archive yourself what you need/want to.

Leonie :pb:​ :22breadinv: :vf:

@atomicpoet @anji It looks like archive.org is actually planning to use IPFS. Have to look up the source later, currently at work.

Leonie :pb:​ :22breadinv: :vf:

@atomicpoet @anji ok, here's a follow up to this. Looks like they removed any evidence for that, the only thing I could fine was a cut version of the interview on archive.org. The whole interview isn't available anymore.

Tavi

@anji @atomicpoet ipfs backs a lot of Library Genesis. It's working well enough for them. I think it's just a matter of time.

Doc Edward Morbius ⭕​

@atomicpoet IPFS is one of the distribution mechanisms used by Library Genesis.

@anji

hacknorris

@atomicpoet
Theoretically possible but its only a theory and a number..

Jens Finkhäuser

@atomicpoet Welcome to the #interpeer project, or at least where I hope it'll be in a few months.

Jens Finkhäuser

@atomicpoet For something like archiving, IPFS may actually be a good choice. The goal for us is broader, and also support real-time scenarios such as live broad-/manycasting, as well as collaborative editing.

IPFS has a few features that mean it doesn't lend itself all that well to those scenarios where there are frequent updates to a resource.

hkc (Carbonated)

@atomicpoet imo that would be great not only from data preservation standpoint, but also just because it's more convenient. I can't even count how many times I had to waste time waiting for slow CDNs that are thousands of kilometers away to give me data I want. Decentralization not only makes it more persistent, but also makes content delivery much faster. But I guess at that point in time average person won't really care about that, unfortunately.

Vftdan

@atomicpoet
There is #ZeroNet, but it has some problems & now it does not seem to be very popular

ansuz / ऐरन

@atomicpoet that's the kind of talk that floods your mentions with IPFS people

Matthijs De Smedt

@atomicpoet I also like the aims of PeerTube, but being based on ActivityPub it seems to have some of the same limitations as Mastodon. Lack of global search, discovery, trends, etc. It would be nice to be able to easily search for and discover all content everywhere from any server… That’s part of what makes YouTube (and Twitter) so good.

Morgan

@anji @atomicpoet I'm 99% sure it would be possible to make something like this with the existing videos on PeerTube based on ActivityPub. I'm 0% sure you can currently easily search for and discover all content currently on YouTube, let alone all the content that's been deleted because YouTube didn't like it.

Matthijs De Smedt

@raphaelmorgan @atomicpoet If someone on your local server doesn’t follow or boost a remote post/video, it never appears and cannot be found in search… That’s a problem, I think.

Not arguing that a centralized solution is better than a distributed one! But the ActivityPub follow/boost model which doesn’t proactively announce a servers content to the entire network frustrates me sometimes.

Silmathoron ⁂

@anji
Do you know there is sepia search for global PeerTube searches?
sepiasearch.org/
@atomicpoet

🇪🇸🇺🇦 Ignacio 🇺🇦🇪🇸

@atomicpoet But here is the thing (notice that I don't understand much about fediverse yet). Peertube, in its own, is decentralized and uses P2P, so this scenario is harder to happen than Youtube or other platforms.

But if a specific Peertube instance goes down, what happens with the videos stored on that specific instance? Same thing with Funkwhale instances, Pixelfed instances and other instances of fediverse webpages.

Chris Trottier

@icg937 Actually, my Pixelfed instance has gone down multiple times. You could still view the content from other instances. That's the magic of ActivityPub.

mrpieceofwork

@atomicpoet FB deleted my notes that had info I wanted to hold onto... f-n TEXT... tiny KBs.... and no (real) warning. That pretty much did it for me. F FB

🌈☔🌦️🍄

@atomicpoet Stuff on P2P systems just disappear in different ways than stuff on corporate infrastructure. It does make making copies of data easier, but there isn't an inherent pattern preventing data loss. I've seen my fair share of stale torrents of unique/rare data :)

Chris Trottier

@wmd Sure, we've talked about this on the rest of the thread. The problem with P2P is persistence. However, that problem can be easily solved.

Nicole Göbel

@atomicpoet If you want to have information security in terms of availability, you better use your personal hardware and backup strategy.

Chris Trottier

@nicolegoebel I don't think most people can back up the entirety of YouTube.

Nicole Göbel

@atomicpoet No but I backup the videos I have uploaded.

Chris Trottier

@nicolegoebel That's a good practice to have but it doesn't address the core problem I'm talking about.

Nicole Göbel

@atomicpoet tbh I don't think that the loss of such platforms is a real tragedy. If the data is worth sth, it will pop up again elswhere.

Nicole Göbel

@atomicpoet you may try the philosophy of Seneca sometime.

PJ Brunet

@atomicpoet

Well, if anyone wants to move their #videos to #WordPress, send me some details and I'll give you a quote. We might need to discuss transcoding for all browsers.

I mostly host #blogs with photos, but I'd like to have more video clients, that seems like the future.

If you have a lot of big videos, I'd probably offer two rates, one with daily offsite #backups and one without offsite backups. For offsite backups of videos, I'd probably need a new 'rsync' script.

Improving People NOT only Tech

@pj @atomicpoet
Maybe say more what you're offering... for example could also move it to an own cloud or nextcloud or peertube thing no?

Frost, Wolffucker 🐺:therian:

@atomicpoet On the other paw, that means you can't delete stuff. /ever./

And sometimes deleting stuff is important.

Jonathan Lamothe
@atomicpoet What I mean when I say that the internet is forever is that once something's been put on it, you can never be 100% sure it's been removed.
Hank G ☑️
@atomicpoet I'm in the process of converting my blog from Jekyll to Hugo. I'm making sure that the URL hierarchy stays the same. Going through some older posts almost none of the links work any more. Are they up in Wayback? Maybe but it's not like it's an official archive or anything. I've thought about writing something that does "sip to local" for shared posts and then if the original link disappears redirect to the local copy.
i'm a SELECT * (not a black star)
@atomicpoet @woozle the internet only remembers if it gives it joy (your embarrassments), not what should be archived (the driver you needed)

ofc p2p only works if someone recognized its value two decades ago…
Woozle Hypertwin

@atomicpoet This is an excellent point. Just as an example that has particularly affected my work, I have saved (on issuepedia.org mostly) hundreds or perhaps thousands of links to content on the late Google+ -- and very little of that content was saved on archive.org or anywhere else. I have Google Take-out of my own posts and comments, but not of anyone else's -- and that's not in an easily-accessible format, either.

If the act of having accessed that content on my browser meant that I retained my own copy of that content which I could then make available to others, the demise of Google+ would not have made that content effectively vanish.

It would be a lot of content, though, and I can see (solvable) difficulties around storing and organizing it.

cc:
@dredmorbius @eryn

@atomicpoet This is an excellent point. Just as an example that has particularly affected my work, I have saved (on issuepedia.org mostly) hundreds or perhaps thousands of links to content on the late Google+ -- and very little of that content was saved on archive.org or anywhere else. I have Google Take-out of my own posts and comments, but not of anyone else's -- and that's not in an easily-accessible format, either.

Doc Edward Morbius ⭕​

@woozle Complex views on this.

A challenge of very-large-scale services is that when they die, much goes with them.

I was also on G+. I'd archived virtually all my own content by submitting it to the Wayback Machine, using a script (much of that on 15 January 2019, a few have mutliple saves).

I went through a similar issue and process when Joindiaspora, the original node of the Diaspora* network, shut down earlier this year. Again, I managed to archive most of my content (> 75% of posts, and virtually all the significant ones), at both the Internet Archive and Archive Today.

(The latter doesn't have an automated process, so submissions were manual, though I'd written scripts to at least generate the submission links.)

Part of informational maintenance and hygiene is ensuring that content you link is archived. Internet Archive are working with Wikipedia specifically to do this for web articles, published academic articles, and now books (these will open to the cited page on the Open Library). This is of course awesome.

There's also the issue that not everything needs to be remembered in perpetuity. And much probably shouldn't be. And a system in which content cannot be removed ... also has huge problems.

@atomicpoet @eryn

@woozle Complex views on this.

A challenge of very-large-scale services is that when they die, much goes with them.

I was also on G+. I'd archived virtually all my own content by submitting it to the Wayback Machine, using a script (much of that on 15 January 2019, a few have mutliple saves).

I went through a similar issue and process when Joindiaspora, the original node of the Diaspora* network, shut down earlier this year. Again, I managed to archive most of my content (> 75% of posts, and virtually...

Chartodon

@ColinTheMathmo

Your chart is ready, and can be found here:

solipsys.co.uk/Chartodon/10851

Things may have changed since I started compiling that, and some things may have been inaccessible.

In particular, the very nature of the fediverse means some toots may never have made it to my instance, in which case I can't see them, and can't include them.

The chart will eventually be deleted, so if you'd like to keep it, make sure you download a copy.

Kaye

@atomicpoet This is something I stumbled into just a few weeks ago, and not even with video content.
There was this service that let you set up free web forums (EZboards) that was very popular around the turn of the millenium. The web forums were originally preserved on the Internet Archive, but the domain was picked up by a malicious squatter in 2011.
Thanks to a retroactive robots.txt that forbids archiving, those archives are lost now. And that's just HTML and pictures, not something weighty like video.

Doc Edward Morbius ⭕​

@atomicpoet On the creation vs. destruction rates: Total Internet data is still growing AFAIU.

The doubling rate has been about 18 months to 2 years, for decades now (and pre-dating both the Web and Internet).

So unless half of all data is lost within 18--24 months, no, the total would still be growing.

Though average lifespan is likely low. Internet Archive's metric is that the typical webpage on last 50 days until deleted or changed.

Robobostes

@atomicpoet the best argument against 100% P2P hosting is the millions of dead torrents. Ideally you'd have a system where there is multiple local points of centralisation, that can also keep functioning based on direct P2P between users in case the central entity disappears. So basically peertube, but other instances should be able to archive content form other instances as long as it is shared by some user/seeder.

Robobostes

@atomicpoet So if one instance disappears, and the video stops being centrally hosted, it should still be indexed and made accessible by other instances as long as there are people seeding it. I'm not sure if peertube does this atm

Go Up