Email or username:

Password:

Forgot your password?
Maarten Steenhagen

Link rot is so very, very real.

I just visited a site that 'archives' articles from 2015. The majority of links no longer works.

What to do? You could archive pages, and link to the archived pages (I've seen many people do this). But how durable is that? Will those archival sites exist still in 10 years time?

I'm used to consulting books that are centuries old. That we're unable now to archive digital stuff for longer than a couple of years is terrifying.

135 comments
mirabilos

@msteenhagen letโ€™s hope @internetarchive will still be around in a decadeโ€ฆ but, yeah, electronic stuff is ephemeral in nature

author_is_ShrikeTron๐Ÿ” ๐Ÿ’‰x6

@msteenhagen Well, old books linked to other books that are now also lost, so it's not really new.

Selena

@ShrikeTron @msteenhagen
Yeah. And so many old texts that only have like 1 or 2 extant versions, handcopied from a long-lost original.
We can read Ceasar's Bello Gallico because a medieval monk copied it in the 9th century.
We know the poetic Edda from the codex Regius.

There are plenty of 'see other chapter' references in old history books, but that other text is long lost.

Miia Mustang

@msteenhagen Film negatives are also a pretty good storage medium for visual things!

I think it shouldn't be underestimated when it comes to archiving things :3

Ge0rG

@miiamustang
We should crowdfund a microfiche website archive project! ๐Ÿ˜๐Ÿ˜๐Ÿ˜
@msteenhagen

Rolf Blijleven

@msteenhagen You know the waybackmachine, right?
web.archive.org/

(Big but inevitable drawback: anything archived that is behind a password is not accessible)

Basically, if information keepers and information providers do not implement persistent identifiers, we're at a loss.

Dennis Moser

@RolfBly @msteenhagen

The Wayback Machine is NOT infallible... I have been finding breaks there for several years now.

Rolf Blijleven

@dennis_moser Sure, so have I, but it's oftentimes the only thing you can turn to.
All the more reason to help keep it alive. I donate.

@msteenhagen

Dennis Moser

@RolfBly @msteenhagen and THAT is a big part of the problemโ€ฆ a single massive point from failure. We are not creating a rigorous diversity of repositories, something I was evangelizing for over 20 years ago. Itโ€™s expensive and no one is willing to share the painโ€ฆ

Rolf Blijleven

@dennis_moser What can I say? You're right. It shouldn't depend on a few personal initiatives. The importance isn't felt, the usefulness (is that a word?) is not seen.

@msteenhagen

MatthewToad43

@RolfBly @dennis_moser @msteenhagen IIRC from a talk many years ago archive.org has a fair bit of redundancy. The main worry is legal action, and there are some.

MatthewToad43

@RolfBly @dennis_moser @msteenhagen I've seen archive people argue in CACM that even if you have 100x redundancy you get data loss with enough data.

However this simply isn't true, assuming you know about e.g. erasure coding. Which isn't exactly a new technology; distantly related algorithms were used on Voyager and similar tools are used on CDs, DVDs, digital cinema etc.

It is *possible* to have sufficient redundancy.

Rolf Blijleven

@matthewtoad43 @dennis_moser @msteenhagen

An observation in addition to this, I've worked for an institute that wants to use Archivematica for born-digital material. Drawings in 70's 80's CAD software etc. Archivematica is quite good at preserving, disclosing, it's Open Source, all very nice.

But. What's missing is software and hardware to *show* the material ๐Ÿ™„๐Ÿคท๐Ÿผโ€โ™‚๏ธ

Beko Pharm

@urlyman @msteenhagen this.

I've a very long going blog and I randomly check old links (thanks to "onthisdate" Feature) and it happens often that a link does still work but serves something completely different.

โ€ฆand usually this isn't something I'd _want_ to link to.

Prentiss Riddle ๐ŸŽ›

@bekopharm @urlyman @msteenhagen Reminds me of a story. Circa maybe 1994 I worked with an educator who was pioneering the development of online K-12 educational materials. It was early days and all painstakingly hand-coded in HTML.

Well, he linked ON EVERY PAGE to an external resource on a domain that got taken over by porn spammers! ๐Ÿ˜ฎ

I got to save the day with a little script that walked his tree and rewrote those links. Good times. ๐Ÿ˜„

Beko Pharm

@pzriddle sounds like a job for `sed` and/or RegEx :D

Prentiss Riddle ๐ŸŽ›

@bekopharm Exactly (I used perl). Amazing what passed for magic back in those days.

Jess๐Ÿ‘พ

A thing you could create if you wanted to:

The Internet Archive allows you to request they save a particular site for preservation.

help.archive.org/help/save-pag

Have a periodic job that whenever a new link is found in your blog, send a request to save it, then edit the post with a link to that page's archive after it in case the url is dead/redirected.

@bekopharm @urlyman @msteenhagen

Jess๐Ÿ‘พ

I've long tried to think of what the most durable but dense information storage way to store digital data for a far distant future would be that could still be recovered even in a fairly low tech society might be that doesn't cost a completely unreasonable amount of money. Simply analog data would still likely be microfiche - it's possible to even view that just by squinting really hard. But if you want to store digital data, the best I could come up with is microfiche of QR codes (with lots of instructions to try to detail how to read the microfiche and the algorithm to decode QR codes). You can hand compute QR codes if you're patient back into their digital bits, and it would allow way more data to be stored on a single sheet of microfiche than simply analog/optical data.

@urlyman @bekopharm @msteenhagen

I've long tried to think of what the most durable but dense information storage way to store digital data for a far distant future would be that could still be recovered even in a fairly low tech society might be that doesn't cost a completely unreasonable amount of money. Simply analog data would still likely be microfiche - it's possible to even view that just by squinting really hard. But if you want to store digital data, the best I could come up with is microfiche of QR codes (with lots of instructions...

Lord Caramac the Clueless, KSC

@bekopharm @urlyman @msteenhagen It gets even worse when you consider that even properly stored digital media will eventually fail. Magnetic tape streamer cassettes (DAT etc.) might last a century if stored at room temperature, shielded from external magnetic fields. Hard disks, floppy disks, optical ones like CD/DVD/BD/Laserdisc, EPROMs, flach memory, etc., won't last that long even if stored properly. Besides, you need the right computer hardware (not just the drives) and software to access the data.
Even analogue data from the 20th and 21st century is mostly stored on very fragile storage media, like cheap paper that dissolves into dust after 60 years or less if just kept on a regular household bookshelf, and even under the best conditions, it doesn't last much longer. Restoring books printed on cheap paper is extremely expensive and time consuming, the pages need to be sliced in two which are half the thickness of the original and then glued on top of a new page.

The only way we can really preserve data is by constantly copying it again and again, and it is a race we're losing. Since the ongoing global polykrisis (I prefer the spelling with k since it is closer to the original Greek) is very likely to cause the decline and downfall of the Industrial Age, I fear everything we have learned and created will likely get lost forever in the next few centuries.

To those of you who fear we might enter a new Dark Age after the end of the Machine Age, I have to say that it is even worse: We already live in the Dark Age, because what makes such an age "dark" is the lack of information, the lack of historical data, and we won't leave much of that to the people of the future because we don't use parchment and clay tablets as storage media, easy to read with the naked eye, but fragile paper made from wood pulp, fragile polymer disks and tapes coated with magnetic and reflective materials, and tiny silicon chips.

The polykrisis is, of course, a complex global crisis consisting of interconnected crises like , , , , etc. It basically boils down to with its built-in addiction to economic growth, and the simple fact that we have exceeded the sustainable limits of our good Earth, we are in overshoot and damaging the planetary systems on which all of our lives depend. It is highly unlikely that this global Industrial Civilisation will achieve sustainability before it begins to collapse, but we can still achieve a very slow collapse that plays out over a couple of centuries. Most of our achievements will be lost, however. It is quite likely that there will never be a second Industrial Age, and that our species will never build complex microelectronics or space rockets again.

But if we don't go extinct and manage to keep this planet in a state where Homo sapiens can survive, we will eventually evolve into new Hominid species, and maybe one of those starts a high-tech civilisation once again in ten million years or so.

@bekopharm @urlyman @msteenhagen It gets even worse when you consider that even properly stored digital media will eventually fail. Magnetic tape streamer cassettes (DAT etc.) might last a century if stored at room temperature, shielded from external magnetic fields. Hard disks, floppy disks, optical ones like CD/DVD/BD/Laserdisc, EPROMs, flach memory, etc., won't last that long even if stored properly. Besides, you need the right computer hardware (not just the drives) and software to access the data.

skry

@msteenhagen From 1996-2016 I saw about 20% link rot per year on a website with hundreds of external links. Even if half of them just moved, it was a big job to manage the links.

Erik Jonker

@msteenhagen ...that's why at least in government we have laws with regard to archiving , we also have a national archive etc. people are working hard there to archive information durable

Dennis Moser

@ErikJonker @msteenhagen
... and cue uncontrollable laughter (sorry, I've been in this mess since Grad School in 1992).

The conflicts between records management and archives predate the digital era.

Jens Christian

@msteenhagen certainly I can think of a technical solution to the problem but it needs adoption and also acceptance to the fact that it won't drive any ad income ๐Ÿคทโ€โ™€๏ธ

Andy Carolan :prami:

@msteenhagen When Reddit shot itself in the foot, that killed a LOT of links. Many articles linked by search engines are deleted or missing. Also, social media entities adding login walls doesn't help.

It's all very temporary.

catgirl/whale solidarity

@andycarolan @msteenhagen reddit was scrubbed intentionally. People are furious they're being used as free advertising then having to pay to post

Andy Carolan :prami:

@neko Yes, I remember that. Lots of communities were understandably upset. Awful tactics by the CEO of Reddit too IIRC @msteenhagen

DamonHD

@msteenhagen One route might be something like this, alongside the Internet Archive:

zenodo.org/doi/10.5281/zenodo.

Nemo_bis ๐ŸŒˆ

@msteenhagen One would need to run perma.cc/ or archive-it.org/ recursively for every cited URL: it gets expensive quick. (And requires the not to be burnt down by the knowledge arsonists aka big publishers, or you need to separately deposit the WARC files and serve them somehow.)

Veronica Olsen ๐Ÿณ๏ธโ€๐ŸŒˆ๐Ÿ‡ณ๐Ÿ‡ด๐ŸŒป

@msteenhagen I have a bookmark folder with interesting articles that go back to 2005, and indeed a whole lot of the links are dead. So at some point I started to also print to PDF every time I save something there, so I have an archive of it all.

When I worked in academia, I used Zotero for references, which also saves webpages. But I stopped paying for storage when I left academia. I have a local copy of that archive too.

Martin Vogel

@msteenhagen Thatโ€™s why I use Zotero as a bookmark tool. It downloads copies of web pages into my own private archive.
The sad design error of the WWW is that links do not exist. What we call โ€œlinksโ€ are mere pointers. Maybe Ted Nelson should have put a little more effort into his system Xanadu in the sixties. wired.com/1995/06/xanadu/

PrivateGER :owo:

@msteenhagen@provo.lol i run a little youtube archiver and it's downright scary how much stuff just... disappears, without any reason behind it

xs4me2

@msteenhagen

The digital world and the internet are volatileโ€ฆ just as human thought and opinion nowadays it seems. Attention span is eroding.

It is worrying indeed, we are our history, if we do not remember that, we will be repeating mistakes of the pastโ€ฆ

milosz

@msteenhagen I am using ArchiveBox archivebox.io/ for years and return to it from time to time when looking for something, but this is a small private instance.

Angry sweet antiracist enby

@msteenhagen better archiving legislation helps. And more responsibility from content delivery organisations

0xtdec ๐Ÿ‡ง๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ช๐Ÿ‡ช๐Ÿ‡บ

@msteenhagen To combat link rot as part of my workflow, every link I save to Pocket gets picked up by a nightly script and submitted to archive.org.

But it still depends on a single organisation, with an uncertain long term future, so it's not a great solution per se - though it's a convenient one for now.

I need to look if someone ever created a self-hosted archive.org type thing so I can save content locally too.

Ian Davis

@msteenhagen
Also a concern and criticism in respect of streaming only media and licenced digital books.

Are we in an ege of ephemera?

Howard Chu @ Symas

@id1om @msteenhagen Are we in an age of ephemera - Yes. By design. And people are being conditioned to not rely on their own memory either.

Who controls the present controls the past. Who controls the past controls the future. -- 1984, George Orwell

Ian Davis

@hyc @msteenhagen
Prompts a recall of this quote that seems to assert knowledge of reality is not a requirement of those that govern...

"we create our own reality. And while you're studying that realityโ€”judiciously, as you willโ€”we'll act again, creating other new realities, which you can study too, and that's how things will sort out. We're history's actors...and you, all of you, will be left to just study what we do'

en.m.wikipedia.org/wiki/Realit

@hyc @msteenhagen
Prompts a recall of this quote that seems to assert knowledge of reality is not a requirement of those that govern...

"we create our own reality. And while you're studying that realityโ€”judiciously, as you willโ€”we'll act again, creating other new realities, which you can study too, and that's how things will sort out. We're history's actors...and you, all of you, will be left to just study what we do'

Howard Chu @ Symas

@msteenhagen thousands of years from now, archaeologists digging thru the remnants of our society will find no durable artifacts other than plastic shells, and no literature. They'll probably conclude it was a cultural Dark Age. And probably be correct.

Aral Balkan

@msteenhagen One way is to try and get folks to maintain older versions of their sites.

Itโ€™s trivial to do if one cares at all for that sort of thing. But I have a feeling the issue is with the qualifier in that statement. In a world of disposable โ€œstartupsโ€, I donโ€™t think many people do.

4042307.org

Anna

@msteenhagen I have had to accept that digital information will disappear and thereโ€™s nothing much we can do about it, because the technology isnโ€™t built to last, is buggy, and nobody is asking questions. Timed obsolescence is accepted. Everything is new all the time.

Itโ€™s 2023 and Hollywood still hasnโ€™t figured out how to store digital content for longer than several years.

Personally, a photo reel disappearing/getting corrupted is the worst. When I can I make several backups and print the most important. Eventually the backups of disappear too though.

Weโ€™re in a cycle of endless maintenance for the sake of โ€˜convenienceโ€™.

@msteenhagen I have had to accept that digital information will disappear and thereโ€™s nothing much we can do about it, because the technology isnโ€™t built to last, is buggy, and nobody is asking questions. Timed obsolescence is accepted. Everything is new all the time.

Itโ€™s 2023 and Hollywood still hasnโ€™t figured out how to store digital content for longer than several years.

Alex@rtnVFRmedia Suffolk UK

@halcionandon @msteenhagen

For entertainment media its *worse*, both Hollywood and public service broadcasters worldwide often actively destroy archive content or let it deteriorate on old formats without backups - particularly that which was targeted to kids and youths/young adults and has less perceived value, but contains a lot of pop music- due to legal arguments over rights (especially music rights, but also liabilities for repeat fees for performers)

AlsoPaisleyCat

@vfrmedia @halcionandon @msteenhagen

Even when content is preserved, there is not neutral preservation.

Letโ€™s keep in mind that broadcast television news in the United States was not preserved until an individual businessman in the mid 1960s was frustrated that he couldnโ€™t rewatch a special on Leary and LSD. So he funded the Vanderbilt Television News Archive.

That archive gives us most of our historic video of the Vietnam War & other key events. But itโ€™s incomplete. It was whatever was able to be recorded on videotape off the local Tennessee stations. It was often preempted by sports on weekends and doesnโ€™t include most special coverage, or broadcasts other than national evening news.

Does this bias the record? Likely.

@vfrmedia @halcionandon @msteenhagen

Even when content is preserved, there is not neutral preservation.

Letโ€™s keep in mind that broadcast television news in the United States was not preserved until an individual businessman in the mid 1960s was frustrated that he couldnโ€™t rewatch a special on Leary and LSD. So he funded the Vanderbilt Television News Archive.

Extra_Special_Carbon

@msteenhagen Itโ€™s OK. Google is sending all of your data to brokers that will inevitably get hacked making certain that all of your data is immortalized on hard drives across the globe.

SarahV

@msteenhagen Similar issues with digital book formats. Amazon was a proponent for .mobi format but have now abandoned it in favor of EPUB. OEBPS readers are hard to find. Adobe's DRM can lock access to titles making them unusable. PDF has a ton of accessibility issues.

Digital publishing has made the *dissemination* of information fast easy and cheap. But *archiving* that information is not only fraught with digital issues but legal ones like DMCA. We need to rebalance this equation.

David Colarusso

@msteenhagen it is by no means a solved problem, but in the legal profession, we've been moving more and more to the use of archives administered by relevant libraries. See e.g. perma.cc/libraries In this way the same institutions (law schools) that publish our scholarship (law journal articles) can also house relevant archives.

Mark

@msteenhagen
My father was a photographer. When he died I inherited suitcases full of photographs, some are over 100 years old. I have a digital camera, I have taken thousands of photographs. When I die no one will see them ever again.

Rebecca Cotton-Weinhold

@Wizardofosmium @msteenhagen Well, that is something you can solve by making them easy to publish and publish them before or after your death.

Rebecca Cotton-Weinhold

@Wizardofosmium It depends on the type of images, if you want to publish before or after your death, in what format, and who you want to make them available to. If you want the same level as your dad you could always just have them printed, but I am assuming that is not your goal? I would recommend understanding what you want to achieve with the preservation and then based on that hit a search machine for possible solutions. That should give you some results to get started.

Mark

@rlcw
The main point I was trying to make was that printed photographs don't need any effort to preserve, my father probably didn't think about what would happen to his photographs. A lot of photographs from the digital age are just going to disappear. I have a few old IDE hard drives full of images but that technology has been superseded.

Rebecca Cotton-Weinhold

@Wizardofosmium Your father did not develop every picture he ever took and I am sure many of the films are lost to time. You can "develop" your pictures with way less effort than he ever could onto paper - or any other medium that serves your purpose of archiving.
The real question is, if this holds the same value to you, and is worth doing to you?

Which is also at the heart of this discussion: what is worth maintaining, preserving and how and at what cost?

@Wizardofosmium Your father did not develop every picture he ever took and I am sure many of the films are lost to time. You can "develop" your pictures with way less effort than he ever could onto paper - or any other medium that serves your purpose of archiving.
The real question is, if this holds the same value to you, and is worth doing to you?

Michaล‚ "rysiek" Woลบniak ยท ๐Ÿ‡บ๐Ÿ‡ฆ

@msteenhagen it gets worse. Websites 5-10 years ago would at least lend themselves to being archived โ€” either in services like Wallabag, or just by saving the HTML and resources using wget or some spider.

Websites today, with their JS overload and dynamic content, and CloudFlare "stopping bots", quite often do not allow themselves to be archived.

It's not just link rot, it's also the rot of our ability to archive stuff at all.

JWPH

@rysiek @msteenhagen I concur with that too, when also viewed from the fandom angle; even then, with the rise of newer, younger fandoms like that TADC pilot thing, Pizza Tower, and , amongst others, a large subset of these members are not just aware of the act of archiving, whatever its multiple implications. Still, fans are just as eager to save and archive anything that represents the good and bad sides of the media they're a fan of, if not overt in the first place.

Lesley Frew

@rysiek @msteenhagen you might be interested in

Michele C. Weigle, Michael L. Nelson, Sawood Alam, and Mark Graham, โ€œRight HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering,โ€

The exact problem you mentioned. Plus archiving all of these extra javascript files at different times causes temporal replay violations!

Lesley Frew

@rysiek @msteenhagen also epa.gov subpages have not been archived in 2 months :(

Borealis AKA the LiteralGrill

@msteenhagen I have been struggling to source a quote for a small article that was only ever shared in a YouTube video once. No where using it as their source even mentions what the name of the video even is.

Folks don't even know how to cite things to even help folks find where they might be stored away anymore and with things vanishing so fast we're going to lose important knowledge.

Florian Egermann

@msteenhagen It is a major problemโ€ฆ does anybody else think that the term โ€žlink rotโ€œ is bad, since it applies that only pointers are disappearing, not the content? It should be โ€žcontent rotโ€œ.

Lesley Frew

@fleg @msteenhagen thereโ€™s a second term, content drift, to address that :)

rapaz

@msteenhagen I know of projects that mirror entire websites, so they could still be accessible of the main site goes away. I also personally, ocasionally, download specific webpages that I like, so as long as I have a browser I can still read that particular snapshot

llewelly

@msteenhagen
although imperfect, I feel @internetarchive is so far the best existing attempt to address this problem. The thing is - the scope of its ability to handle link rot is sharply limited to what people submit snapshots of.

aburtch

@msteenhagen The only reason we know anything at all about ancient peoples is because they wrote things down on stone, which is virtually indestructible.

Sam

@aburtch @msteenhagen as an ex--archie i kind of find it fascinating watching information loss in real time. It's always happened of course, people base decisions on what to keep by whats important to them for their own lives and stories, what remains is to a great extent a matter of chance.
for example, i keep my Dad's glasses on my desk, i have saved a few of his emails..will my son keep those?
when talking about archiving you can't ignore the sociological angle or human nature.

aburtch

@allofmystudentsrunaway @msteenhagen I love this perspective. What's important to us may not be important to the next generations. I guess all we can do is preserve what we can and pass it along. (Hopefully in a format other than microfiche or Laserdisc!)

Prentiss Riddle ๐ŸŽ›

@msteenhagen All good points, and a major concern.

I had the pleasure for several years of contributing my organization's publications to a "digital repository" at a major university library. The folks running the repository and their bosses who made an institutional commitment to it are unsung heroes. But if I had a time machine it would be interesting to know how long that commitment lasts. I'd give good odds to 50 years, about the span of a long professional career. But after that?

Jill the Pill

@msteenhagen

Now take it one step further: any of your lifeโ€™s work, insights and expression, that you store in a digital format is likewise ephemeral. At the same time, storing all that information in The Cloud uses huge amounts of energy and water, polluting air, water and the atmosphere.

RealGene โ˜ฃ๏ธ

@msteenhagen
I had questions about the operation of a steam boiler.

Searching led me to several HVAC support forums, all of which referenced and linked to a 'handbook' from a particular boiler manufacturer.

Said boilermaker went through a couple of buyouts/mergers. Handbook link is now 404.

Since I had the filename of the PDF from the link, I found it on some sketchy offshore website.

I have hoarded almost every PDF I've downloaded since the late 90's.

@msteenhagen
I had questions about the operation of a steam boiler.

Searching led me to several HVAC support forums, all of which referenced and linked to a 'handbook' from a particular boiler manufacturer.

Said boilermaker went through a couple of buyouts/mergers. Handbook link is now 404.

Since I had the filename of the PDF from the link, I found it on some sketchy offshore website.

Mx. Eddie R

@msteenhagen
One of the busier pages on my website is a mirror of the site for a circa-2002 mp3 tagging program, because that version of the software and documentation are still used, but no longer exist other places like the original domain.

Sheogorath ๐ŸฆŠ

@msteenhagen National archives have actually started to take notice and archive (depending on the country) content from various official and unofficial websites with the same or similar standards as they archive books, newspapers and other published articles.

en.wikipedia.org/wiki/List_of_

Rikard ๐Ÿ‡บ๐Ÿ‡ฆ

@msteenhagen Bob Frankston, who invented the spreadsheet, has a good suggestion, Forever URLs.

rmf.vc/ieeeforeverurls

(Read it carefully, Bob selects every word and thought with care)

Ole Dirty Rice :jeb:

@msteenhagen Iโ€™ve always been wary of Tweet embeds in serious publications. Basically every news outlet has embedded tweets in a way that depends on those accounts, tweets, (and Twitter writ large) being online forever. It seems like a mistake! What if a syphilis-brained billionaire buys Twitter and breaks every previously functioning part of the company?

Daedalus

@msteenhagen

Sounds like you have your own solution, archive in print. Store in a library.

bmaxv

@msteenhagen

"What to do?"

Smaller fragments, and you save hashes of source and target with the link. That way, if you lose the big original chunk you can try to brute force a solution to the hash and reconstruct the information.

But that would require a fundamental change of how we use links and how easy it is to write them.

The web would need to be rethought as an archiving mechanism, not just a snapshot that may or may not work.

Or we start creating small personal archives.

DELETED

@msteenhagen

Someone should whip up an extension for this.

When a 404 is hit, it could look for the same page on the wayback machine and if available, direct the user to that one.

Dan Connolly

@BigMcLargeHuge @msteenhagen Brave does that (or: offers to) by default.

I'm sure such plug-ins exist.

Lesley Frew

@BigMcLargeHuge @msteenhagen thereโ€™s a bot for wikipedia that does this!

Chumchum Tumtum

@msteenhagen this is why the internet archive is so vital, as are data hoarders

MylesRyden

@msteenhagen
Indeed and format rot might be even worse. Will the pdf specification be the same 20 years from now?

Of will you have to go to Internet Archive and download some OS emulator to run today's PDF (or JPEG or .xls or whatever) reader?

This Old Hiker

@msteenhagen I have saved a set of browser bookmarks that is about 20 years old, mainly to reuse the folder structure. They are mostly in home repair, hiking, careers, and child raising. Only about 15% of the links still work. Another 10% can be found by searching the sites they were on or in a larger search.

A* Ulven :verified_blobcat:

@Patrickoldhiker @msteenhagen and worst of all, is that saving websites while conserving interactivity and internal links is a difficult task even for people with vast experience with computers.

Best bet is to take screenshots of them but then, the text has to be OCRd for the data inside to be indexable again.

Honestly with the price of modern technology, anyone should have a right to permanently save entire webpages at the exact state you downloaded them.

Mike Taylor ๐Ÿฆ•

@msteenhagen This is one of the (many) reasons why the lawsuit that "publishers" are bringing against the Internet Archive is so iniquitous.

jgg

@msteenhagen

There is the save page as option in browsers. A pity it is totally broken for most of modern sites. Most of them lost format, other even content. And it is hard to be sure you stored it the right way, since the saved page it is very likely depending on content on other servers. Testing it offline can help, of course, but you will need to empty your cache.

It's getting worse every decade.

Mathias

@msteenhagen I once worked for a company where, whenever you used a link in an internal documentation, a PDF version of that website was automatically created.

Dan Connolly

@msteenhagen the term "link rot" suggests everyone who puts anything on the web is expected to keep it there forever. Why do folks expect that? Think of it as phone calls or chit-chat. Before 1990, *none* of that is preserved.

Dan Connolly

@msteenhagen conversely, it gives institutions of long-standing, who *should* be publishing info durably, an excuse to say "most links rot; it's no problem if ours do too"

Dennis Moser

@msteenhagen

All is maya.

The "backup" is not the "archive".

These days I feel like Cassandra, having spent much of my career trying to warn of the very problems you describe. The corporatization and commercialization of the Web has led us to this situation and I honestly see no good solutions. Everything available to us at this point is "patchwork". I suspect what we will see is pockets of things that endure...

AlienKnight

@msteenhagen I wonder if print archives of websites could be an option. I think that still isnt a great option, but it is less ephemeral than it all being a digital archive of a digital medium, it adds a physical layer to it

Lesley Frew

@msteenhagen you might be interested in:

Mohamed Aturban, Michael L. Nelson, and Michele C. Weigle, โ€œWhere Did the Web Archive Go?,โ€

The exact problem you mentioned!

Niclas Hedhman

@msteenhagen

This is well-known, and Roy T Fielding (author of HTTP spec, Apache web server pioneer, ++) warned about this in the 1990s, and promoted that all links was "dated", i.e.

hedhman.org/1998/12/my-info-pa

and that one never removed old content, to preserve links forever.

Can't find the source, maybe it was just on a mailing list.

snep

@msteenhagen almost like it was never a good idea to make everything primarily digital in the first place.

Winchell Chung โš›๐Ÿš€

@msteenhagen

Agreed.
My website has been around for about 20 years and has lots of hot links. Every day more of them succumb to link rot. The best I can do is replace the dead links with equivalent links into the Internet Archive.

Jess๐Ÿ‘พ

@msteenhagen Link rot is awful, and also, data rot in general spooks the hell out of me - like just how quickly most data storage mediums can decay. Consumer grade HDDs and CD-R and DVD-R die within a decade a lot of times. Even when you back stuff up to a cloud service, who the hell knows if/when your cloud service will randomly terminate your account, have a data failure and delete it all, or whatever. I often wonder if the archeologists and librarians of the 22nd century (if human civilization still exists by then) are going to consider much of the data of the early 21st century simply "lost".

@msteenhagen Link rot is awful, and also, data rot in general spooks the hell out of me - like just how quickly most data storage mediums can decay. Consumer grade HDDs and CD-R and DVD-R die within a decade a lot of times. Even when you back stuff up to a cloud service, who the hell knows if/when your cloud service will randomly terminate your account, have a data failure and delete it all, or whatever. I often wonder if the archeologists and librarians of the 22nd century (if human civilization...

Botahamec

@msteenhagen Maybe we need a physical building that's just filled with tape drives of the archives

bookandswordblog

@msteenhagen and the Internet has destroyed the commercial incentive to publish factual books on paper.

Henry Cobb

@msteenhagen That's how planets get forgotten from galactic empires, copyright enforcement!
en.wikipedia.org/wiki/Hachette

Bob Tregilus ๐Ÿง ๐Ÿ“ท

@msteenhagen I've experienced this as well and it seems to be getting worse. I don't know, but is link rot a problem at sites such as the Internet Archive, Wikipedia, Library of Congress? I hope not. But as one commentor wrote, "electronic stuff is ephemeral." And that truth is indeed-- "terrifying."

Cy
Problem with link rot is the antithought known as Intellectual Property. The IPtards are the ones trying to get us linking to everything instead of copying it. (They're also the ones trying to bury the archive sites in frivolous lawsuits.)

So the solution to link rot is to copy the files you link to. And f*ck IP.

As for centuries old books, keep in mind that 99% of all literature written back then has been totally destroyed. Books can last for centuries, in good conditions, but there's always something lost over time. That's not an entirely bad thing, because some ideas really shouldn't be preserved. Copying the files you want to save is akin to an oral tradition, where files that lots of people like will last longer. So I think that's a good tradeoff. I still have files from over 20 years ago, and nobody can link rot those away as long as they're precious enough for me to keep a lot of copies of them.

Alternatively, put your stuff on an M-disc. Those are supposed to last a while.
Problem with link rot is the antithought known as Intellectual Property. The IPtards are the ones trying to get us linking to everything instead of copying it. (They're also the ones trying to bury the archive sites in frivolous lawsuits.)

schrotthaufen

@msteenhagen @blogdiva According to this protocol.com/internet-archive-, the wayback machine alone is about 45PB, which isnโ€™t that much in HDDs these days. About $1m. But servers, energy cost, as well as bandwidthโ€”and most importantly maintenanceโ€”are not cheap. The legal risk is probably the biggest challenge, though. Do you delete everything archive dot org deletes when they get a DMCA complaint, or do you ignore it in order to preserve more?

tadpole

@msteenhagen It takes strong institutions to pass the test of time. web.archive.org/ is the strongest we have at the moment.

Julie Webgirl

@msteenhagen Get the Wayback Machine app for your phone (Android has it, not sure about iPhone). You just paste the link into the app and it goes and scans it and you get a link directly to the page saved on the Wayback Machine. I don't think it's going anywhere for a long time. Or is that the "archives" site you were talking about??

Dennis Moser

@msteenhagen @Et_Alia โ€”

Repeat after me: "The backup is NOT the archive. The backup is NOT the archive..."

Francis ๐Ÿดโ€โ˜ ๏ธ Gulotta

@msteenhagen Iโ€™ve got a 20 year blog and everything is dead, embeds are the worst but most links go away within a few years! Never mind 20!

happyborg

@msteenhagen
One of the goals of is to solve this by storing public data forever.

Meta Wish

@msteenhagen as a kid, if there was a particular article I'd wanna save I'd copy paste into a word document, or recipes I'd copy paste into my own digital notepad. Both are applications so as long as they themselves don't break, I can access when I want. Of course, applications breaking happen suddenly so perhaps not a great option either...

UnCoveredMyths

@msteenhagen

I have had this issue so many times. At one point, I started using Link Accessed on Specific Date as my guide. And save copies into my Scrivener. That wasn't always possible.

I also run a bookmarks check every few months. It isn't uncommon for several sites to have simply disappeared from the internet during that time.

Ed Beck

@msteenhagen I'm working on a project right now to archive some of our digital projects. We are going to use a few different tools, including webrecorder and httrack. Still working on our settings, but one of our first prototype runs is available here: archive.us.reclaim.cloud/2023-

My idea is to place digital humanities projects in our open-access institutional archive, just as we store our open-access papers.

@msteenhagen I'm working on a project right now to archive some of our digital projects. We are going to use a few different tools, including webrecorder and httrack. Still working on our settings, but one of our first prototype runs is available here: archive.us.reclaim.cloud/2023-

Pete Prodoehl ๐Ÿ•

@msteenhagen And yet the URLs on my non-commercial personal web site still work 25 years later. ;)

Medium Endian

@msteenhagen this is what makes @internetarchive so damn important.

As a kid I remember being told "Be careful, any you out in the internet is there forever."

It sounded scary at the time, but now I realize that the alternative is even scarier.

kasperd
That problem has been known for decades:
w3.org/Provider/Style/URI

I think the biggest challenge in fixing this problem is convincing people that it needs to be fixed. I think in some ways this problem is similar to the problems of DRM and planned obsolescence.
Claudius

@msteenhagen Store stuff on your hard drive or it will be gone. Heck store stuff on your hard drive and it might still be lost due to fuckups or hardware failures. Sadly, even archive.org (with their pretty awesome track record) might go down when funding dries up.

Impersonal

@msteenhagen This is incompetence on the part of the technical and/or content team. Every redesign, migration, or other major change I have been part of had redirects as a top priority.

On the better projects, we also had the resources to go in and update old links to minimise link rot.

There are no excuses not to do this.

Go Up