Email or username:

Password:

Forgot your password?
Alyssa Rosenzweig 💜

Trusting LLMs threatens your credibility.

I read a bogus claim about GPU instruction sets, cited to GPT-4 and an anonymous "expert". This is my area of expertise, I know the claim is demonstrably false. And now I know the author is relying on bullshit generators. Now I doubt every other claim the author makes, because with egregious errors in the parts I know about, how could I trust the parts I don't?

(Edit: narrowed the scope of the lead.)

82 comments
mnl mnl mnl mnl mnl

@alyssa in a way using llm for facts in earnest is going to expose a lot of people that benefitted from the doubt before. Now everybody is on the lookout, and the easiest to fool are the fools themselves

Hayley
@mnl @alyssa I have to correct first years who were convinced ChatGPT got the right answers for their math homework, when it's consistently dead wrong. "But it gave a different answer" is really unconvincing with that in mind.

(Unrelated gripe - first years in CS, so writing a program to do it would be much less likely to fail.)
Janne Moren

@alyssa
This is something I frequently ran into as a researcher already years and years ago, way before LLMs:

Some publication or columnist confidently spouts absolute nonsense about stuff in my own field; and now I can never trust what they say about any other subject.

Sheep

@jannem @alyssa

There is a term that relates to your experience called "Gell-Mann Amnesia"

Sindastra♀️✅

@alyssa Reminds me a bit of "journalism" in general. The amount of nonsense I read in credible newspapers, speaking about "tech(nology)".

Makes me wonder how much nonsense they report in areas I'm not very knowledgeable about.

Anafabula

@sindastra @alyssa There is a name for that. Gell-Mann Amnesia

Gen X-Wing

@sindastra @alyssa I have the same feeling. Basically LLMs are simply amplifying what has been and making it even more apparent.

Ezekiel :swift:

@alyssa what are your thoughts on isolated decision making behind a fixed input/output layer? For example imagine if Siri etc. Used a LLM to interpret and do language parsing but NOT to build the response

Alex Celeste

@ezekiel

cool but you could make it better by getting rid of the bit where it does language parsing

(serious answer, i cannot think of a justification to myself for why this is an improvement over just _not_ having that step)

Ezekiel :swift:

@erisceleste I mean, isn't one of the most common issues with virtual assistants that they misinterpret a query? For example, consider the following exchange with ChatGPT that Siri (and probably others??) would've completely failed at.

Note that it properly parsed out the request AND corrected my spelling.

— Query —
Pretend to be a virtual assistant. You will receive a query for adding things to reminder lists. Consider the available lists: [Groceries, To-do, Packing]. The user may specify a location trigger, time trigger, and will specify the text to add to the list. Please respond with the list they're trying to add to, the time and/or location if specified, and the text to add.

Prompt: add at two oclock apples to my groceirries list

— Response —
Sure, I can help you with that. Based on your request, you would like to add "apples" to your "Groceries" list at 2 o'clock. Is that correct?
Alex Celeste

@ezekiel

perhaps i speak from too much of a cultural bubble when i say that

groceries:append(apples, 1400)

seems better to me than using english to describe a task that doesn't need to be described in english, in the first place

this can be three button-presses and a swipe of a drop-down

Ezekiel :swift:

@erisceleste I appreciate why you would feel that way, but I hope you understand that the general population just wants to be able to make a request and have it understood, not following a syntax

Omega Jimes

@alyssa It's a little disheartening that the mass response to public LLM availability seems to be "Well now I get to think EVEN LESS!".

Autumn

@omegajimes @alyssa
Especially with the fact that they can make automation much more accessible

They are an excellent tool for those who know how to use them, and unfortunately, misunderstood by most

Konrad Kołakowski

@alyssa LLMs works best for popular texts, with huge corpus and data. For sure GPU drivers programming is faaar from it. They might be useful, as a tool, for helping in some tedious, repetitive work, but for sure not for such niche work.

It should always be used as a tool - with big amount of scrutiny.

Huge problem is, that if they „don’t know” something - LLMs are simply confabulating 🙃 Extremely dangerous for beginners or less ML-literate people 🫤

Jennifer Kayla | Theogrin 🦊

@kkolakowski @alyssa

Even with a huge corpus of data, LLMs are useless, and here's why:

They generate text which looks like, but is not, equivalent to a researched and reviewed paper.

They will take bits and pieces from the entire set of articles and chunk them together into something which is functionally meaningless but looks acceptable at a casual glance.

And I mean individual words! Sentence fragments! Syllables!

They don't know ANYTHING. But they give the illusion of doing so.

Autumn

@theogrin @kkolakowski @alyssa LLMs don't necessarily need to generate stuff

There are signs of promise for LLMs that avoid hallucination by paraphrasing and permutating instead.

I recommend checking out perplexity.ai

LLMs are also quite helpful for automation; the base training data is just to get the relations right in the first place, then constraints, checks, temperature, and human validation can help vet things out

Autumn

@theogrin @kkolakowski @alyssa
Base LLMs like GPT are useless to your average Joe but great for developers (i.e., you need to know prompt engineering and AI model tooling); ChatGPT is fun for conversations and the public but useless otherwise, and Perplexity is only good for taking multiple raw articles and quoting them directly

Autumn

@theogrin @kkolakowski @alyssa
I might be a tad bit quick to defend LLM development because I believe the best way to go about adopting AI and general technology is to educate the public and encourage them to understand the tools they use, both their devices and their software

Yes, I am a Linux nerd who wishes people wanted to know why and how their systems work

But the curiosity AI has garnered from people could be a great opportunity, even if it's not particularly helpful for most poeple

@theogrin @kkolakowski @alyssa
I might be a tad bit quick to defend LLM development because I believe the best way to go about adopting AI and general technology is to educate the public and encourage them to understand the tools they use, both their devices and their software

Yes, I am a Linux nerd who wishes people wanted to know why and how their systems work

Ryan

on a slightly related note: the annoyances of hearing clearly-unqualified people tell you your job is worthless, your passion is not worth pursuing and your work is unnecessary because of LLMs and AIs in general...

you know they're wrong, but it's disheartening still.

Kofi Loves Efia :verified:

@ryanc be prepared to be very much more depressed. LLMs are very exciting to the Venture Capital set.

Ryan

@Seruko at least i'll be in school during the AI bubble, my condolences to everyone who's about to get/already got replaced by AI because of corporate

cbsnews.com/news/eating-disord

Demi Marie Obenour

@alyssa Could LLMs be suitable for generating proofs which are then checked by a _sound_ proof checker that is secure against malicious input? If the proof is wrong, the checker will catch it, so no harm has been done.

Jennifer Kayla | Theogrin 🦊

@alwayscurious @alyssa

Depends how you define harm. The requirement for checkers becomes exponentially greater with the use of bots and large language models. Of course, one should perform one's best efforts to check the validity of any resource, but the need for more robust and cautious checks increases the time requirements greatly.

Also, it feels like generating noise at random and then checking for anything which could be a sonnet.

ShadSterling

@alwayscurious @alyssa only if
1. The work required for repeated checking is less than the work required to get the same result without the LLM, and
2. Default access only includes post-check results

But even if those conditions can be met without undermining the viability of the product, it could only be used in contexts for which an ~infallible checker has been integrated, which could not include general use

morgan

@alwayscurious @alyssa There are automated systems to generate mathematical proofs, but I don't think those work anything like LLMs.

Alex Celeste

@alwayscurious

if you know you're trying to generate something specific like proofs, you probably don't need an overwhelmingly overpowered tool like GPT-4 to do that

the use of LLMs is that, through gigantic amounts of wasted computing power, they can appear to emulate a near-infinity of simple tasks; but if you have a specific task in mind the chances are _extremely_ good that you don't need the Swiss-Army-Chainsaw to do it

22

@alyssa do you find the code it emits too buggy to use, or the chat interface too time-consuming to use?

Dawn

@22 it's usually utterly irrelevant. I talked with someone trying to use an LLM to write a mod for a game. The code they offered up from the LLM was nonsensical and irrelevant to the game, mashed together multiple game engines and in no way would've interacted with existing systems.

22

@funky7monkey intriguing, thank you. In my experience as a JavaScript and Python dev, it is extremely helpful with those languages, plus things like ImageMagick and jq and command-line things, so I wonder if it’s better at those than game dev because of maybe more training data or whatever? I value others’ counterexamples so thank you for this.

Matt Hodges

@alyssa @anildash “cited to GPT-4” … ooof … reminds me of spam blogs that put “Source: Reddit” on their content regurgitation mill.

Jennifer Kayla | Theogrin 🦊

@alyssa

One of the best articles I've read isn't about specific types of highly specialized work, or even loosely specialized. It's something a five-year-old can typically explain:

Tic-Tac-Toe.

And ChatGPT is excellent at coming up with a seemingly convincing explanation for its tactics. Long-winded and verbose. But it's pants at playing, and I think that perfectly illustrates the difference between the illusion of intelligence, and actual brains.

aiweirdness.com/optimum-tic-ta

Aris Adamantiadis :verified:💲Paid

@alyssa LLMs are an excellent way of getting quick answers to some problems, and if you're open, learn a thing or two that you haven't considered. It's useful. But taking any LLM output at face value without double checking means you're foolish and naive.

BlueWinds

@aris

They're not actually useful for that either. They *look* useful for that, but they're actually just as garbage at that as they are at other tasks beyond "stringing together reasonable-seeming English text."

Elliott

@alyssa I asked ChatGPT some fairly basic math questions (not computations) and it very confidently gave wrong answers. Have to be really careful with it.

coupland

@alyssa

Me: "In what year was the Battle of Hastings?"

AI: "The Battle of Hastings took place on October 14, 1066."

I'm sorry but it's your credibility on the subject that's suspect. Wild general statements like "LLMs are only for entertainment" are ridiculous. Here are some *reasonable* statements:

"Be skeptical of everything you read, and when it really matters always verify."

"Don't use a hammer to fix your plumbing. Every tool has an ideal use, choose wisely."

Lotus

@coupland But whats the point in using chatgpt to answer questions like these? I dont know when that battle took place, I have no way of telling if its making things up.

The uncertainty isnt worth it for me, I would rather just make a few google searches and use websites I know are reliable.

coupland

@LotusHopper Because for questions that have a simple, deterministic answer LLMs are generally quite reliable and it's WAY FASTER than doing a web search.

As I said, right tool for the job. A search engine isn't really the best tool for simple questions with a deterministic answer anymore. There's a new tool in town that's great if you use it right.

Lotus

@coupland What is a deterministic answer? Things that dont change with time like demographic numbers?

coupland

@LotusHopper Questions that can be definitively answered and that require no interpretation nor change depending on your perspective.

"What's 2+2?" or "What year was the Battle of Hastings" or "What is Miley Cyrus' birthday?"

Questions like "what's the best pizza recipe" or "why is Iran/America/Russia so evil" are not well suited to LLMs.

Tröglödÿt

@coupland @LotusHopper

what year something happened is a statement that requires a lot of interpretation and presuppositions

like, what is an event in the context? what calendar is used? what are the commonly agreed upon limits of the type of event?

just because interpretation seems easy to you and you don't recognise that it happens, doesn't mean it isn't necessary

there are no simple facts, because human language is quite complicated

coupland

@troglodyt @LotusHopper Sorry Troglodyt but that's a whole pile of pseudo-intellectual horseshit. There is zero... ZERO... ambiguity to asking what Miley Cyrus' birthday is or what year man landed on the moon. Come back to earth space man.

Tröglödÿt

@coupland @LotusHopper

ok, looking at your feed and your nick i must make a jump and conclude that trying to make conversation with you is an absolute waste of time, you're much too gullible and already totally occupied by less sophisticated forces than my mind

good luck with your faiths little fellow

flere-imsaho

@coupland lol. you're coming from crypto-dorks instance and have dot eth in display name.

wakame

@alyssa
I read a text (a blog entry? a rant?) a few years ago that annoyed me (a lot).

It was about a researcher who basically stated that people shouldn't criticize or review his papers, because he was "right".
Paraphrasing: "The probability that someone reviewing one of papers is not understanding it or getting it wrong is vastly higher than me making a mistake."

Maybe LLMs finally have an effect that people don't take everything at face value. In the past, a text was very likely written by a human. We can't say that anymore.

(Of course, the effect could be the opposite: "Our new FactGPT makes sure to tell only 'the truth'. If you see a text with the green FactGPT checkmark™️, you can be sure that it only contains 'truth'.")

@alyssa
I read a text (a blog entry? a rant?) a few years ago that annoyed me (a lot).

It was about a researcher who basically stated that people shouldn't criticize or review his papers, because he was "right".
Paraphrasing: "The probability that someone reviewing one of papers is not understanding it or getting it wrong is vastly higher than me making a mistake."

Panegyr 🤡🎪

@alyssa I find they can be occasionally useful if you want exactly what they generate, which is a regression to the mean of subjective answers, queries like “what is a typical naming scheme for a node in a kubernetes cluster” have a very low likelihood of causing actual harm, just don’t trust them for anything more complicated than incredibly general questions that don’t have wrong answers. Which to be clear, is most things. You shouldn’t trust them for most things

jz.tusk

@alyssa

Is this the first instance of someone being "chatsplained" to?

Wendell Bell

@alyssa I ‘almost’ told the parties recently that ‘no AI was used in the preparation of this (arbitration) Award,’ which was true, but I finally figured it would be worse to say it: most wouldn’t yet get it, and those who did would think I was making a jk, that might not land right.

Kara Goldfinch

@alyssa Yeah. I've used it for daft things like "write me a dire straits song about Macbeth". I thought considering they did one about Romeo and Juliette it'd be interesting to see what it'd do.
Using it for anything serious, not a chance.

BrianOnBarrington

@alyssa Gosh, I think what you really need right now is an overconfident straight white guy who got his online learning certificate in GPT4 from LinkedIn to wade into the topic and “educate” you. 🤣

Mark - Ottawa on Tundra 🇨🇦 :mstdnca: :flag_ON:

@alyssa “In a time of deceit telling the truth is a revolutionary act.”
― George Orwell

Alexis :verifiedtransbian:

@alyssa This 1000%. ChatGPT should never be used for serious work, especially anything like legal defense. I'm sorry you have a bunch of ChatGPT apologists in your replies, so I'd like to let you know, there are a lot of us who fully agree with you

Robert Buchberger

They're also good for any time when the output is easily tested/verified. I've used GPT for little scripts in unfamiliar languages for example.

They're good for manipulating information you give them, but can't be trusted to go out and find it in the first place.

Steffen Christensen

@alyssa I use LLMs for research, for economics, for summarizing, and for programming. It's all fine. LLMs are highly useful tools.

Publishing LLM-produced output without extensive checking and editing is dumb.

Slayerranger/Crackamphetamine

@alyssa Yeah I had a friend mess with OpenAI to write a fake vulnerability report. It lied to him multiple times, and as he continued correcting it, ChatGPT started making dead links to non-existent vulnerabilities so he LOL’d hard at it because it was just citing unrelated CVEs in his troll report 😂

Val Packett

@alyssa@social.treehouse.systems in 👏 this 👏 house 👏 we 👏 only 👏 trust 👏 LLVMs

Gecko

@alyssa I have to admit I find it quite useful for generating initial PoC code.

Though usually I still have to do edits before the code even runs.

That being said, one should never use LLMs as a knowledge source. Today I had it tell me that `let` in Rust is used declare mutable variables xD

Gecko

@bluewinds @alyssa I'm well aware, hence I try to only use the generated code when I fully understand it.

The heads-up is still very much appreciated nevertheless <3

BlueWinds

@gecko

That's my real point: it's not actually good for anything. It's all hype. If you're understanding it fully before using it, you'd have been better off just doing the work yourself to begin with!

Anything that chatgpt *seems* to be good at, it's more likely to be harmful than helpful.

DELETED

@alyssa Feel like a whole looooooooooooota mothafuckers are about to learn in real time how trust / journalistic integrity works

Brian Grinter

@alyssa LLM is mansplaining-as-a-service - confidently given completely wrong answers 🤣

Space Cowboy

@alyssa Yeah I just use it like google. For some reason people know when they click a link on google they understand the information can be unreliable. With Chat GPT they implicitly trust it.

Where do people think Chat GPT gets it's data from?

In this case it's probably just quoting something from a stack overflow question from someone that didn't know what they were doing (which is why they were there). But sounding really confident while doing it.

Sean :nivenly: 🦬

@alyssa I've heard some interesting arguments for using it to come up with project names or to expand on an email you're having trouble writing.

Of course the asterisks to that are you need to edit it to your own after gpt goes at it and you need to verify everything contained within is true.

aebrer - Andrew E. Brereton

@alyssa I mean I use copilot for coding and it's an LLM and in that context I find it very helpful. It's not doing the decision making though, mostly just remembering obscure syntax for me

Feoh

@alyssa Respectfully, I don't agree. LLMs are super at helping out when you can *know* without the tiniest sliver of doubt that the results are correct, and when you treat the results like a suggestion to be vetted, corrected and massaged and not a completed final deliverable.

SnoopJ

@feoh @alyssa Respectfully, the list of people I trust to actually do this post-facto vetting when using one is very short.

Feoh

@SnoopJ @alyssa I don't wish to argue, but let me give you a very concrete example:

"Write pytest unit tests for this code".

It spews out a page full of code, including all the necessary boilerplate for test setup, database setup, etc. etc.

I then take that and add the higher value tests that the LLM doesn't write.

For another example, I am a bit of a windbag. I take a block of business prose, pass it to the LLM, and say "Rewrite this for conciseness and professional tone."

If you *know english* you can validate the correctness of the prose it generates in terms of conveying intent, and if you care you can even use other tools to validate grammatical correctness.

@SnoopJ @alyssa I don't wish to argue, but let me give you a very concrete example:

"Write pytest unit tests for this code".

It spews out a page full of code, including all the necessary boilerplate for test setup, database setup, etc. etc.

I then take that and add the higher value tests that the LLM doesn't write.

Alyssa Rosenzweig 💜

@feoh @SnoopJ Personally, I am uncomfortable using (current) LLMs for those.

For boilerplate - If a system requires large amounts of boilerplate, that's a red flag to me (as an Opinionated developer) about the solution. I would prefer to improve the ergonomics than repeat boilerplate. I realize that's not always possible, but there's enough crap software out there, I'd rather we didn't generate more. The affordance of IntelliJ proliferation is variable and function names becoming more verbose (that may or may not be good). I suspect the affordance of boilerplate generating tools is... systems requiring more boilerplate. ("It's so easy to generate, what's wrong? You don't like code audits that are needlessly difficult? Upset that defect counts are roughly proportional to quantity of code?")

For both - the issue @SnoopJ raises - the current UIs and marketing work together to discourage vetting and instead trust the generated output. Would you catch a subtle bug in the generated boilerplate that caused tests to pass unconditionally? Would you catch a subtle shift in message from the professionalized text?

For both - would you catch plagiarism or open source license violations in the unattributed generated output?

Maybe your eye is more keen than mine. But I suspect with Copilot my brain would be on Autopilot.

I can't trust the output of these tools, the way I can trust YouCompleteMe and proselint. That's reason enough for me to stay away. If I can't trust them for my own work, I don't know how I could trust what people who do trust the output claim / commit / send.

It's tempting to say the problem is misuse. As an expert on GPUs (but not LLMs), I know that the query in question is unanswerable for current LLMs. The honest response I'd expect asking a human is "I don't know, sorry". Instead, apparently GPT confidently spewed wrong info. Was the asker misusing the LLM? Maybe, but it seems that's what the UX encourages.

The point of this thread isn't a moral judgement. It's just that, looking at other people's use of the tools (and the creative ways it can go terribly wrong), it's becoming clear to me that the emperor has no clothes.

@feoh @SnoopJ Personally, I am uncomfortable using (current) LLMs for those.

For boilerplate - If a system requires large amounts of boilerplate, that's a red flag to me (as an Opinionated developer) about the solution. I would prefer to improve the ergonomics than repeat boilerplate. I realize that's not always possible, but there's enough crap software out there, I'd rather we didn't generate more. The affordance of IntelliJ proliferation is variable and function names becoming more verbose (that...

SnoopJ

@alyssa @feoh to me, the larger UX threat is the knowing misrepresentation of LLMs as expert systems for every use case.

I do see the use-case @feoh is talking about, and I've given it a try a few times at the encouragement of others. It's… fine.

But I agree that the overall effect of these systems is corrosive on trust, because as you say, it only takes one such failure to cast a shadow on everything else, even the stuff that isn't LLM output.

Feoh

@SnoopJ @alyssa Oh I totally agree, but I think the oness for that falls squarely at the feet of the people using and relying on these tools in WILDLY inappropriate contexts where they have no business.

I suspect you folks might agree with that :)

Autumn

@alyssa There are some LLMS specifically designed for permutative writing instead of generation, such as Perplexity.ai, which do show promise for being a viable AI search engine replacement, but I feel it AI is currently misunderstood and misused by the public, especially those who confuse chatbots for base models and general AI

tldr; we need better AI education for the general public to solve for false credibility

TheDoctor

@alyssa maybe I’m wrong here, but I think it’s not bad to use LLMs to ask them questions as long as you order them to give you their sources, too. So that you check again what they tell you.
Did that when asking for dog facts for a friend. The specific question I had wasn’t answered properly by any other search engine. So I asked ChatGPT but also asked it to provide me its sources.

Jacob Rowe-Lane

@alyssa Had a similar experience a couple of times. ChatGPT straight up doesn't understand pointers - I was debugging some code and ran it through to see if it could find the error and it very confidently told me that actually I should allocate space for a double pointer to a data structure, and assign the result (a pointer) to a double pointer - in which case I'm returning a pointer to a double pointer and assigning that pointer to a double pointer and ending up with a triple pointer

Minty

@alyssa Thank you for the post. This sort of thing is going to be a huge issue going forward.

DrYak

@alyssa Yes! That!

Trusting what boils down to an "autocomplete on steroids" for answering you accurate informations is completely asinine.

At best, use it to reformulate nicely information that you know and you're feeding to it.

Or don't use it in scientific context at all.

Sigma

@alyssa@social.treehouse.systems
I sort of agree.
The issue is that some people think of LLMs as a knowledge systems, which they aren't.
But I don't think this means that they're just for entertainment either. There are legitimate use cases for making sense of garbled data for example. There is also emergent behavior, like problem solving, that will be really useful in the future, I think.

Go Up