Email or username:

Password:

Forgot your password?
Tom Walker

People worry a lot about losing knowledge — about "burned-down libraries".

Comparatively few people seem to worry about what happens if you take a billion books full of auto-generated, often-untrue junk text and *add* them all to the library.

In theory, nothing is lost. In reality, everything is lost, because nothing useful can now be found.

191 comments
Adam Greenfield

@tomw …aaaaaand *that* is how Jorge Lewes Borges came up with the plot of “The Library of Bebel”!

DanCast

@adamgreenfield @tomw if you enjoyed that, I would highly recommend A Short Stay in Hell

Avram Grumer

@adamgreenfield @tomw I was just going to bring up that Borges story, only the Library of ChatGPT is even worse. Most of the texts in the Library of Babel are obvious nonsense. The Library of ChatGPT is filled with superficially-valid texts.

Ergative Absolutive

@avram @adamgreenfield @tomw
Of course, the Library of Babel also contains all the ChatGPT nonsense, as well as the alphabet soup and word salad. Does that make it worse or better?

Better, because the alphabet soup and word salad remind the reader that there's a lot of nonsense in the library?

Or worse, because in the midst of all the gibberish the superficially plausible ChatGPT books look like the real thing?

(I lean towards the latter. The parallels really are chilling.)

@avram @adamgreenfield @tomw
Of course, the Library of Babel also contains all the ChatGPT nonsense, as well as the alphabet soup and word salad. Does that make it worse or better?

Better, because the alphabet soup and word salad remind the reader that there's a lot of nonsense in the library?

Or worse, because in the midst of all the gibberish the superficially plausible ChatGPT books look like the real thing?

Adam Greenfield

@ergative @avram @tomw As I read it, yes, that was the real horror of the Borges story, the trillions of nearly-“perfect spurious interpolations.” If you thought we already lived in a universe of epistemic relativism, and suffered the consequences thereof, well…just wait for 2024.

Joshua

@ergative @avram @adamgreenfield @tomw there's stuff even more wrong, and more convincing, and more dangerous in the library of Babel than ChatGPT could ever generate

Avram Grumer

@zenten @ergative @adamgreenfield @tomw It occurs to me that the Library of Babel contains both all of the spells for summoning Cthulhu, Azathoth, Nyarlathotep, etc, and all of the spells for banishing them. As well as all of the spells for summoning them *mislabeled* as spells for banishing them.

Adam Greenfield

@mostly_harmless Yes, I also enjoyed his “Tlin, Ikbar, Orbus Tretius.”

ideaPDish

@tomw
That sounds like googles these days

nethope

Incorrect documentation is often worse than no documentation.

Bertrand Meyer

azquotes.com/quote/709604

Phracker

@tomw It actually gets worse, because the auto-generated junk text will be more optimized for search engines than anything written by humans, so pretty soon it will be impossible to find anything on the Internet outside of auto-generated junk and sponsored or corporate content. This genuinely worries me as a creator of blog content who understands the importance of SEO.

Tim Mackey 🦥

@Phracker2Art @tomw from what I can tell we’ve already reached that point. For the past year or two it’s become increasingly difficult to find real answers to many questions on Google. The only workable way I’ve found for certain types of questions it to use the “site:” modifier to restrict my questions to sites like Reddit or Stackoverflow. But even Stackoverflow is having problems now with AI-generated garbage.

Paul Cantrell

@tomw @BlackAzizAnansi
Zeynep Tufekci spoke passionately on this topic regarding Russian disinformation. Her argument was roughly that we should regard “flooding the zone” as a counterpart to censorship that achieves many of the same ends.

Kevin

@inthehands @tomw @BlackAzizAnansi Masha Gessen makes a similar point in (the book) Surviving Autocracy. If you look at anti-Ukraine propaganda, russia throws so much contradictory nonsense out there that many Westerners gave up trying to make sense out of it and have just moved on to other concerns.

yunchtime

@tomw

the signal to noise ratio is clearly something that Q-anon and its cohort of delusional cult ravings is designed to exploit. unfortunately for us the 'don't be evil' index that can winnow some of this chaff out, doesn't live up to its credo. so here we are.

ML2
@tomw
I've been using a lot more site specific searches and even specialized search engines like marginalia.nu for exactly this reason.

It also helps that my main search engine isn't Google.
whetstone

@tomw @chancerydaily I have a friend who works at the Internet Archive and I told him recently that their collection has become instantly more valuable because the imports are all time-stamped; they will be the only large digital source of information uncorrupted by plausible AI.

whetstone

@chancerydaily i’ll tell him! he’s really good people.

Tom Walker

@whetstone @chancerydaily Yes the pre-AI web, while hardly perfect, will be an important future source.

Andii אַנדִֽי

@whetstone @tomw @chancerydaily As long as the time stamping can't be 'hallucinated' ...

whetstone

@Andii @tomw @chancerydaily I don't think it can, because the stamp is something the archive does (not something it adds based on external time-stamping). The stamp is literally just a notation of when they crawled a page.

Andii אַנדִֽי

@whetstone @tomw @chancerydaily I hope so. I'm just a bit concerned that there could develop spoofs that look credible.

whetstone

@Andii @tomw @chancerydaily Well.... someone *outside* the Internet Archive could try to spoof, I suppose, and to fool people that way. But I don't think this is a concern internally. IA knows its own data. So the resource is there, people just have to make sure they're getting it directly from the source.

Andii אַנדִֽי

@whetstone @tomw @chancerydaily -That's the scenario that I'm wondering could become a thing. I guess it'd have to be an intercept or one of those 'single spelling mistake' URLs ... So probably not a big worry.

DELETED

@tomw I think you are against AI written stories in libraries. If this is the case I agree fully. I think it is already past time that AI generated materials should be permanently branded as AI created.

David Slifka

@tomw Steve Bannon would call that “flood the zone with sh*t”

Noah Cook

@tomw I remember seeing a reference, I believe in one of Charles C. Mann's books on pre-Colombian America, about how some late Mesoamerican ruins show that at some point they began using their glyphs in nonsense ways, as if for the art rather than the actual grammatical or syntactic meaning.

This is far outside my area of expertise, and it is possible that any parallels are apocryphal rather than directly related.

mekka okereke :verified:

@tomw

I suspect that writing certifiably produced before 2021 will be considered differently than writing produced afterwards, at least for some time.

Like low background steel for writing. 🤷🏿‍♂️

en.m.wikipedia.org/wiki/Low-ba

Stephen Hoffman

@tomw @ncweaver This is the foundation of modern censorship.

Not of stopping the flow of information, but of making the information difficult to find, burying it in mundane or contested or conflicting information. Of flooding the search indices and resources.

Dr. Angus Andrea Grieve-Smith

@tomw Thank you! I've been trying to warn people about this, but I haven't been getting through. Maybe they'll understand your formulation!

J. Steven York RESISTS

@tomw We're getting to that point with the internet as a whole. All the information, and misinformation, in the world, at your fingertips, and no way to be sure which is which.

DELETED

@tomw This is already happening in wikipedia, Pew Research CEO on a blog made a age rage to help their research into generation z but the person who made edits to the page never told anyone why such numbers were created. Many news outlets and government agencies used the fake age range of 1997 - 2010 has Pew research stance on Gen z. Now in 2023 there is no agreement on when someone in Gen z was born or when Gen z ended. This is causing many misleading and mistakes in data research.

mostly_harmless

@tomw Holy shit I never thought of that.

It's fucking depressing.

Can we make human libraries exclusively for human-generated content?

AGTMADCAT :verified:

@tomw That's what librarians are for!

Here's hoping we have enough of them to weather the storm.

Emma

@tomw especially if this happens in tandem with librarian jobs being cut / people being driven out of the profession due to harassment and low pay.

Vixen

@tomw and this is why the internet is so flawed

Stuart P. Bentley
I've been thinking about this a lot, calling it "the Infopocalypse"
Buggerlugs

@tomw in engineering it's called "Signal to noise ratio" or SNR. To extract signal, it has to be discernable within the noise. To suppress signal, increase the noise. The voice of truth can be suppressed by the babble of lies. The conspiracy theories aren't dangerous in themselves but pile them deep enough and people lose touch with reality.

Michael Roufa

@tomw from Too Much Coffee Man, years ago. The boogieman back then was the govt, but the idea is the same.

Kazii The Avali

@tomw it reminds me of how there is a digital library that has every book that could ever be ritten. a combination of your home town and your full name in order but no one will ever find it cause its just a bunch of random. libraryofbabel.info/

Dhavan

@tomw @float13 even in theory things are lost because (information) entropy provably increases.

Donald Hobern

@tomw There is a similar risk for every field that increasingly relies on citizen science data. Biodiversity research relies heavily on field observations to document the occurrence of species. This can have implications for conservation and land use and hence be of interest to extractive industries, etc. It seems inevitable that we will start seeing fake AI-generated observations that seek to obfuscate actual distributions of threatened species and communities.

SaturniusMons :verified:

@tomw There was a brief moment in time that you could trust the internet, but that was quickly eclipsed by the swindlers, hoaxers, and rotten.com :)

We've not been able to trust text on the net for a whiles, nor (photoshopped) images, now deep faked videos and AI generated data

It's been a race to find a way to share information in such a way that it was impervious to those who lie.

And we've been losing every time

Howard Chu @ Symas

@tomw
A Colombian judge used chatGPT in a ruling.

So chatGPT makes up answers out of thin air, those answers become part of public record, search engines index them, then you're done: no longer able to search for factual answers to questions.

theguardian.com/technology/202

Hen Gymro Heb Wlad

@tomw That's the whole point of disinformation, not to convince you it's true, but to swamp the truth in so many lies you can no longer find the truth or recognise it when you find it.

DELETED

@tomw indeed, the principle of the purloined letter, applied at scale...

Humbird0 Fandom

@tomw
This is one of those bad ideas I'm reluctant to share because you just know it will inspire the assholes out there.

Simon Zerafa :donor: :verified:

@tomw

My friend Borges found this out the hard way. He's still searching for the accurate catalogue of all the books in his library 🫤🤷‍♂️

Dogzilla

@tomw Here’s a question: are our libraries *already* filled with inaccurate information? How do we know? Who judges?

Florian Schmidt

@tomw
In theory, this will happen *in the future*.
In practice...

Miss X

@tomw this is one of my main concerns when it comes to AI text generation. Just imagine the disinformation possibilities.

Tom Walker

(I appreciate people replying about actual books and libraries but this is, like, a metaphor y'know.)

JudithTrainiert

@tomw I must admit, I wasn't thinking about ChatGPT at first either, but about scientific papers - It's becoming increasingly hard to find good scientific literature, because you have to sift through literally hundreds of papers with poor methodology, poor english and no new results, just because funding agencies believe the number of publications is a good measure for scientific quality.

DELETED

@tomw Also not strictly related to knowledge or books, but this was the exact argument a lot of people used back when Steam opened the floodgates and allowed pretty much anyone who wanted to, to be listed on their service.

I think most agree there is value in curation, but that obviously depends on exactly who is doing the curation.

Random Walker 😷🇪🇺🍸

@tomw Is this where we use ChatGPT to generate Borges' Library of all Possible Books?

Emily M-O 🏃🏼‍♀️

@tomw @tomw You seem to be mixing up a library, whose contents are selected by humans, with a corpus. If you think libraries are synonymous with physical books, you're a few decades behind.

DELETED

@tomw I wrote a paper on a related issue during my masters degree - the impact of 'deepfakes' - same thing, different medium - being a breakdown of trust. And we've already seen the impact of that.

FoolishOwl

@tomw Somehow I'm reminded of Eden, by Stanislaw Lem, particularly a bit at the beginning about automated factories that produced bizarre objects that were recycled as soon as they were produced.

smial

@tomw
The vast majority of volunteer authors and proofreaders in Wikipedia try their best to prevent the addition of false, stupid or manipulative information.
But there are too few of us.

Renée

@tomw isn't this exactly what French intelligence did to render the political document dump unusable??

arialdo

@tomw reminds me of Borges’ Library of Babel.

Attila Kinali

@jesusmargar @tomw It is constantly happening! Science is so full of absolute sh*t papers, that finding the actual science is like looking for a needle in a haystack.

Today, unless you know someone from that very field who can guide you, you will spend weeks if not months going through crap papers trying to figure out what are actual results and what is just made up stuff. Good luck to find anyone if that field has not seen any active research in 30 years.

Jesus Margar

@attilakinali mmmh that's not my personal experience as a researcher. I agree about the bad papers but Q1 JCR journals hardly ever have any. I am yet to see an auto-generated book or paper in mathematics.

Attila Kinali

@jesusmargar Yes, math is much better in that regard. My general experience is the more applied a field is, the more you have these need-to-churn-out-paper people. And there are plenty of venues to publish things where the reviewers just don't care. Heck, I reviewed papers in CS that just did what had been done 20 years earlier, just in shitty without understanding what they are doing, and my co-rewievers were like "great results!".

gerald

@tomw
... and false balance is part of the game 🤷‍♂️

Jake Rayson

@tomw Bannon's “Flood the zone with shit” at scale.

Maybe somebody can design an app that finds the useful stuff?

happyborg

@tomw You are also describing advertising, except it isn't junk but worse: corporate misinformation and propaganda whose aim is to maximise profit at the expense of humans.

Carlos Rodríguez

@tomw but if you added infinite auto-generated books we would get Shakespeare back. So there’s that. Just keep adding.

Tom Walker

@carlosrodriguez Sadly AI is not even a good 'infinite monkey' because it produces the most statistically likely, ie. the most mediocre, text. At least the monkey might produce something unusual.

Carlos Rodríguez

@tomw I disagree strictly in a probabilistic point of view since at leas the AI is producing real words against random characters.

But that’s not the point, the point is, at this stage AI will produce an immense amount of nonsense which will corrupt the body of knowledge we currently have. I agree with you.

But I wonder if that’s different from the nonsense humans have been producing online for the past few decades. Maybe in volume.

bapril

@tomw Keeping me up at night as we speak..

mtjm

@tomw This made me realize a thing I wasn't able to understand before: people say that Stanisław Lem predicted the Web via a science fiction story about a Daemon of the Second Kind (producing an equivalent of Borges' Library of Babel).

I knew Web as of when spam was obviously recognizable and linking allowed finding a lot of good information. Now I see that 1965 story as being more about using ChatGPT-like tools to get a lot more useless information.

Adrian Segar

@tomw Every day now, I am seeing my somewhat distinctive name incorporated into random nonsense web pages. The frequency of these spam posts is now about the same as genuine mentions. If the plausibility of these junk pages continues to improve, it may soon be difficult to find genuine mentions of me on the web.

Ray Gulick 💗🌛 ⭐️ 🍀

@tomw

I'm old enough to remember when we thought the internet was going to usher in a golden age of knowledge. So naive...

Tom Walker

@rgulick I think it, very broadly speaking, did – it's just it doesn't feel like it first because of social media mis/disinfo and now AI.

Ray Gulick 💗🌛 ⭐️ 🍀

@tomw

For sure there has been an upside, and many instances of sharing knowledge that have benefitted us.

But the downside has dominated to date because the ratio of idiots to reasonable and thoughtful people is so much higher than anyone would've guessed.

#PearlsBeforeSwine

mick tobin

@tomw yeah but early people wrote books

Richard

@tomw This is the central theme of The Library of Babel by Jorge Luis Borges.

Mikie

@tomw That is a SEO problem. A decent AI system with librarian support will fix that problem

Andrew Davies

@tomw I get that it's a metaphor, and yes it points to a big problem internet-wise.

But libraries are curated. And the Florida outrage is about who should get to do the curation - teachers or politicians.

Charlotte Eowyn

@tomw this is stupid but like, way back in college when our physics department wanted to throw out all the old physics books, I took every single one. I still have them in plastic tubs.

I had this absurd idea to bury them, like the Rosetta stone.

ChrisJ

@tomw I was never convinced by the whole PGP signing web-of-trust thing that was (more) popular about 15-20 years ago. I wonder if we should bring it back, and use it for actual trust -- I'd be happy to mark pages I trust and am an expert on, and people could trust friends and particular experts, then (weakly) trust friends of friends, etc.

Not perfect, but we probably need something against the storm of AI generated nonsense.

Tom Walker

@chrisj I think PGP falls into "too hard" but working web search in future does feel like it will require some kind of trust mechanism. Pagerank was a weak and by now completely subverted version of that via links.

Alfred Poor

@tomw +1 It's all about "signal to noise ratio" and dirty data.

The sticky point is the question of who gets to decide what is "untrue" and "junk"....

Graham Lester

@tomw Another way of looking at this is that propagandists try to make *everything* a library so that you can 'do your own research' by watching YouTube videos from people who have no expertise in what they are discussing and would never be able to get an old-fashioned book published by any respectable publisher.

DeterioratedStucco

@tomw That's called "the Internet", IIRC.

Dr. Heather Etchevers

@tomw This worries biomedical scientists on a daily basis, which is why they've made incentives to have perverse ranking systems devised and sold back to us to help us rank one another in terms of "productivity", all of which systems, too, have been gamed or diluted themselves. #Pubmed #Clarivate #googleScholar

Upthorn

@tomw I have been noticing this about the internet for a few years now. It is getting very difficult to find useful information from among the AI-generated listicles these days.

hilaryjohn

@tomw
Beyond Borges and into the Dark Age Ahead ...
The majority of we poor primates are blindly heading towards mass extinction when overwhelmed by heat and associated climate catastrophes having been 'controlled' by the dishonest political, economic and media elites whose strategy of drawing public attention away from important issues by flooding the media with continuous distractions, deflections, 1/3

hilaryjohn

@tomw
diversions and mendacious denials (the signal to noise overload) that will increase with AI generated claptrap and poppycock to hide the calamity that is close upon us. 2/3

hilaryjohn

@tomw
As Carl Sagan said:-
“If we’ve been bamboozled long enough, we tend to reject any evidence of the bamboozle. We’re no longer interested in finding out the truth. The bamboozle has captured us. It’s simply too painful to acknowledge, even to ourselves, that we’ve been taken."

Southern Liberal

@tomw
I hate fear, but I fear people who hate knowledge.

I'm petrified of book burnings.
I'm anxious of library shutdowns.
I'm perturbed by those who want to censor, ban, and leave the people without eyes, ears, or a cranium.

Nick Matthews

@tomw I've seen internal document systems rendered completely useless by storing absolutely everything, regardless of how useful it is. Trying to search for anything just reveals garbage. I'm sure the real document I was looking for was there, because the person who created it saved everything, but it's just impossible to find.

Mike McCaffrey :pdx_badge:

@tomw Someone needs to develop a search engine to sort through the endless text being created by AI tools. You could call it the "Better Optimized Robot-Generated Exclusionary Search", or B.O.R.G.E.S.

Sasha Laundy

@tomw if you want a preview of this, try searching for *anything* to do with cooking, baking, or gardening on DuckDuckGo. Entirely links to autogenerated, conglomerated crawled content, turduckened together with tons of ads

And because it’s readable and mostly plausible info, it often takes me halfway through the page to realize it’s junk, remember DuckDuckGo is unusable for these topics, and switch to google

Peter J. Welcher

@tomw you just described Amazon shopping lately. 5000 alleged matches, no useful filter to narrow the search. (Try finding a 3/4” wide belt for example. China seems to think 1.25 to 1.5” is what all US men want.)

Tom Walker

@jkfecke No, I had no idea the story existed before a bunch of people told me about it in response to this!

Kate Nyhan

@tomw You are right, even in the most literal interpretation of "library." Librarians license journal and ebook packages for our collections but can exercise very little control over their contents. I don't know how to fix it - #CollectionDevelopment is not my specialty - but I see this dynamic in the catalogs of the academic library where I work and the public library where I'm a patron. Lots of content that no one would buy on its own merits, drowning out high-quality material.

Falcon Darkstar

@tomw this is why libraries the institutions are so important, and different in quality from libraries the buildings. It's the job of a librarian to fill the library with information, refresh it, and remove (or archive, if important) what's stale.

Xebulun EnEssEitch

@tomw unless there were some way of making a list of the ones before the polluting. but in the absence of clear marking and/or heuristics to detect churned content going forward it is indeed a tragedy of the commons...

Xebulun EnEssEitch

@tomw but fine art did not end when cheap printmaking became possible. cultures adapt.

Neil Craig

@tomw That's some properly good red team thinking!

Kluthulhu' XOR 1=1--

@tomw
About 15-20 years ago, we stood on the cusp of the information age, but collectively decided that drowning in the trash of the data age was the way to go...

Irenes (many)

@tomw yeah we've been quite concerned about this, personally.

Clifton Royston

@tomw

This process started quite a long time ago but it's been accelerating steadily, much of it driven by the way Google ranks search results and pays sites for ads.

ChatGPT is just throwing cans full of gasoline on the existing fire.

chrismac4u

@tomw thanks for the new anxiety. Now I won’t be able to sleep tonight.

Jeff Dillon

@tomw Vernor Vinge's Fire Upon the Deep novel features a malevolent AI bent on galactic domination; the messages warning of its spread are buried in a sea of conflicting and incoherent babble.

Aleatha

@tomw there is a really prescient short story along these lines, by Borges about eighty years ago. (“The Library of Babel”.) I think about it a lot lately.

Space Catitude 🚀

@tomw

This is why open source software documentation has a reputation for being poor quality, IMHO.

There's actually good documentation for most of it, but also documentation that is obsolete, unofficial, etc. And that dilutes the quality that the searcher experiences. If you get the wrong source, it'll be subtly wrong a lot of the time.

One reason that it's really important to include version numbers and dates.

Joe Ganley

@tomw This is a subplot in the Neal Stephenson novel "Fall; or, Dodge In Hell". There (spoiler alert) for all practical purposes it destroyed the Internet.

Go Up