Email or username:

Password:

Forgot your password?
Charlie Stross

Study Finds That 52 Percent of ChatGPT Answers to Programming Questions Are Wrong

futurism.com/the-byte/study-ch

44 comments
Hugo Mills

@cstross ... but how many human answers to programming questions are wrong?

(OK, probably not 52%, but I bet you it's higher than you first thought...)

Emma Builds 🚀

@darkling @cstross the thing is that we're being sold the lie that 99.9% of the answers from the glorified logistic regression are correct.

And that 0.1 is still big enough to kill billions of people.

Cheradenine Zakalwe

@darkling@mstdn.social @cstross@wandering.shop It's the wrong question. The correct question is, "What proportion of answers to programming questions given by programmers who understand the language and the question are wrong."

Ask almost ANY question of someone who doesn't actually understand the question or the subject, and the answer you receive is overwhelmingly likely to be wrong. (This is doubly true in America, where it seems to be considered almost a mortal sin to ever be heard to say "I don't know".)

Cybarbie

@darkling @cstross Indeed I wonder what the StackOverflow accepted answer fail rate is. It's quite subjective. Usually the second or third answer on SO is the correct one, the first usually being the product of some diseased brain that doesn't do real work.

mathew

@darkling @cstross An important difference is that on Stack Overflow, volunteers will usually have posted corrections.

Whereas with ChatGPT, people turn up in forums to ask someone else to do the work for them of determining whether the slop from the bot is correct or not.

unlucio 🌍 :mastodon:

@darkling @cstross I'd argue that if you hire a software engineer and they're wrong 52% of the time, that wasn't a good hire.

sabik

@unlucio @darkling @cstross
If you hire a software engineer and they're wrong 52% of the time, they may still be an excellent hire if they're a good learner, open to feedback, conscientious, etc

ChatGPT is not, it has no facility to learn or handle feedback beyond the session (if that), nothing

sabik

@unlucio @darkling @cstross
If I spend longer than it would have taken me to do it myself helping a junior engineer through a problem, I've helped them grow, to the benefit of them and the team

If I spend longer than it would have taken me to do it myself helping ChatGPT through a problem, I've wasted my time

Jargoggles

@darkling @cstross
The critical difference, something I don't think I saw anyone mention in this thread, is that human beings understand how to say "I don't know."

An LLM is an even worse version of some asshole that weighs in on *everything* and asserts wrong answers just as confidently as right answers

FeralRobots

@cstross
how do we explain to folks that 52% wrong doesn't mean 48% correct?

Zeno

@cstross Still 2% worse of my previous answer machine

Cheradenine Zakalwe

@ezeno@mastodon.uno @cstross@wandering.shop I once knew someone who managed to achieve a grade of 18% on a five-options multiple-choice exam...

Cheradenine Zakalwe

@cstross@wandering.shop Color me soooooooooo shocked.

My last (as in both most recent, and final) tech employer went really whole hog into using ChatGPT to write code and DB queries.

TomGregory

@cstross

I worked for over 30 years as a geophysicist in the oil business interpreting seismic data. Companies always tried to sell us artificial intelligence software to allow the computer do the interpretation for us going back as far as the 90s. Up until my retirement about 7 years ago I found that geophysicists would spend more time correcting the "interpretation" the machine did than it would have taken to do the interpretation themselves. I do not trust this "artificial" intelligence.

Cheradenine Zakalwe

@cstross@wandering.shop I see upon reading that the article states, "AI platforms like ChatGPT often hallucinate totally incorrectly [sic] answers out of thin air."

While this is true as far as it goes, I believe it misstates — and understates — the problem. A more accurate statement of the problem is, "Large language models hallucinate ALL of their responses. Some of the hallucinations merely happen to coincide well with reality." But you cannot obviously tell them from the ones that don't.

They do not understand anything. They are not designed for understanding. What they are designed to do is very specifically to generate grammatically correct output that looks convincing.

@cstross@wandering.shop I see upon reading that the article states, "AI platforms like ChatGPT often hallucinate totally incorrectly [sic] answers out of thin air."

While this is true as far as it goes, I believe it misstates — and understates — the problem. A more accurate statement of the problem is, "Large language models hallucinate ALL of their responses. Some of the hallucinations merely happen to coincide well with reality." But you cannot obviously tell them from the ones that don't.

They do not

Justin Derrick

@cstross Every piece of sample code ever provided to me by a project manager or non-technical co-worker that used functions that didn’t exist, had obvious syntax errors, or did things I considered insane…. Turned out to be from an LLM. Their attempt to show me how easy it was to write the necessary code turned into a lesson in why programmers should just be left alone to do the necessary thing.

Cheradenine Zakalwe

@JustinDerrick@mstdn.ca @cstross@wandering.shop this is a sufficiently well-known problem that there is now an established class of software attacks that is based upon predicting fictitious library names likely to be generated by ChatGPT or other LLMs, then publishing libraries under those names containing malicious code.

Sky UwU

@cstross I find it surprisingly good for rubber ducking or just getting search terms to find elsewhere, I really want to see these tools be a bit more open, is there anything that is good at that kind of problem that is properly open?

Gracious Anthracite

@cstross

I am thinking about all the people I see on Hacker News raving about how EFFICIENT talking to AIs is making them and giggling.

I am also hoping I never have to deal with any system they were involved in building...

Weekend Editor

@cstross

Honestly, isn't it surprising it's that low?

I'd've thought it would bork a lot more questions than that.

Pēteris Krišjānis

@cstross so basically at best it is coin flip. No, thank you.

JdeBP

@kithrup @cstross

Real Programmers don't use languages that computers can *output*. Systems that don't require a keyboard with at least 15 extra non-USB keys, and a specialized foot-pedal connected to the GPIO pins, are mere children's toys used by web developers and JavaScript vapers.

And Real Programmers don't interpret figures like 52/100 in anything other than octal.

(Heh! It has been a while since someone set up a Real Programmers joke.)

#RealProgrammers

P J Evans

@JdeBP @kithrup @cstross
I've known people who had to fix their checkbooks, because there weren't any 8s or 9s in the numbers they saw.

Bee O'Problem

@JdeBP @kithrup @cstross A Real Programmer works by rewriting the flash memory on the SSD directly with a precise touch of their handheld electron tunneling device

KanaMauna

@cstross

So flipping a coin is more accurate? Awesome.

aadmaa

@cstross It is useful for figuring out syntax of fairly popular languages that you are just learning or don't use often. It is not useful for writing code.

E.g., I have had good luck asking about basic Rust questions, including debugging and explaining my borrow-checker-fights; help writing basic RegEx (since I can never ever remember anything RegEx).

If I were learning TS or JS or SQL I'm sure it would be helpful. It can probably help write a tricky TS type for example, but I don't really need help with those things.

I found it quite bad at languages with less StackOverflow coverage, like Elixir. And forget about, say, the 2023 version of Elixir LiveView - the AI can't help you there.

Also once you get deeper than the basics, it doesn't keep up with the times very well. So one thing I haven't heard much is how it would like to create a tendency towards stagnation in the ecosystem.

By "it" I mean the GPT versions, 3.5 and 4.0. The Google ones are still only good at feeding you rocks.

@cstross It is useful for figuring out syntax of fairly popular languages that you are just learning or don't use often. It is not useful for writing code.

E.g., I have had good luck asking about basic Rust questions, including debugging and explaining my borrow-checker-fights; help writing basic RegEx (since I can never ever remember anything RegEx).

Killa Koala

@cstross ChatGPT 5 will reduce that to 49%. Then everything will be fine.

⚛️Revertron :straight:

@cstross If you ask it to write a program in #Rustlang it fails 90% of tasks :)

Radio Resistance

@cstross that's the fun part. companies are going to go deep on ai coding only to absolutely fuck themselves over. the code ai can generate is often remedial code that i would never run on a production server. i've never seen it write code that isn't shit. companies think engineers are expensive, they're about to fuck around and find out.

CubeThoughts

@cstross In my (admittedly very limited) experience, I'm not even getting internally consistent answers, such that variables change name and other errors.

But the value, such that it is, has been in getting suggestions for new ways of solving something, which I then can do something with using actual reference documentation.

enoch_exe_inc

@cstross …which is why I don’t use it for that purpose.

Of course, I’ve asked ChatGPT for nonessential programming, like making a Quine in 6502 assembly, and it succeeded in doing that. But for normal work, I don’t dare touch it because if it makes a mistake, then I will have no idea how to fix it.

Pepperbike

@cstross i'm surprised it is only 52% and not much higher.

Pooblemoo

@cstross Hope healthcare, transportation aren't using AI for it because that's a whole lot of risk. They'd better lawyer up for the bugs that cause accidents and death.

Go Up