Email or username:

Password:

Forgot your password?
Top-level
Mx Autumn :blobcatpumpkin:

@david_chisnall your experience is how I expected mine to be if I had actually given the technology a chance.

Machine learning has been useful for decades, mostly quietly. The red flag against LLMs for me (aside from the authorship laundering and mass poaching of content) was there scramble by all companies to shoehorn it into their products, like a solution looking for a problem; I’ve yet to see it actually solve.

39 comments
David Chisnall (*Now with 50% more sarcasm!*)

@carbontwelve I used machine learning in my PhD. The use case there was data prefetching. This was an ideal task for ML, because the benefits of a correct answer were high and the cost of an incorrect answer were low. In the worst case, your prefetching evicts something from cache that you need later, but a 60% accuracy in predictions is a big overall improvement.

Programming is the opposite. The benefits of being able to generate correct code faster 80% of the time are small but the costs of generating incorrect code even 1% of the time are high. The entire shift-left movement is about finding and preventing bugs earlier.

@carbontwelve I used machine learning in my PhD. The use case there was data prefetching. This was an ideal task for ML, because the benefits of a correct answer were high and the cost of an incorrect answer were low. In the worst case, your prefetching evicts something from cache that you need later, but a 60% accuracy in predictions is a big overall improvement.

Mx Autumn :blobcatpumpkin:

@david_chisnall that’s a nicely eloquent way to put both into perspective.

David Clarke

@david_chisnall @carbontwelve this is what has been gnawing at the back of my brain. The purveyors of LLM's have been talking up the latest improvements in reasoning. A calculator that isn't 100% accurate at returning correct answers to inputs is 100% useless. We're being asked to conflate the utility of LLM's with the same kind of utility as a calculator. Would we choose to drive over a bridge designed using AI? How will we know?

David Chisnall (*Now with 50% more sarcasm!*)

@zebratale @carbontwelve Calculators do make mistakes. Most pocket calculators do arithmetic in binary and so propagate errors converting decimal to binary floating point, for example not being able to represent 0.1 accurately. They use floating point to approximate rationals, so collect rounding errors for things like 1/3.

The difference is that you can create a mental model of how they fail and make sure that the inaccuracies are acceptable within your problem domain. You cannot do this with LLMs. They will fail in exciting and surprising ways. And those failure modes will change significantly across minor revisions.

@zebratale @carbontwelve Calculators do make mistakes. Most pocket calculators do arithmetic in binary and so propagate errors converting decimal to binary floating point, for example not being able to represent 0.1 accurately. They use floating point to approximate rationals, so collect rounding errors for things like 1/3.

Glitzersachen.de

@david_chisnall @zebratale @carbontwelve

"do make mistakes" I wouldn't call that a mistake. The calculator does what it should do according to the spec how to approximate real numbers with a finite number of bits.

It's (as you explain) a rounding error. A "mistake" is what Pentiums with the famous Pentium bug made.

But maybe it's my understanding of English (as a second language) that is at fault here.

Pendell

@glitzersachen @david_chisnall @zebratale @carbontwelve the calculator /is/ doing exactly what it's been programmed to... and it is programmed to make specific and defined "mistakes" or errors in predictable and clear cut ways in order to make the pocket calculator run on as little power as possible.

An LLM, likewise, is also doing exactly what it was programmed to do... and that is to spew regurgitated nonsense it read off the internet.

pasta la vida

@pendell @glitzersachen @david_chisnall @zebratale @carbontwelve programmers and CPU designers are just a tad sensitive and insecure when someone points out the calculator makes a mistake and isn't mathematically perfect 😅

Martijn Faassen

@david_chisnall

@zebratale @carbontwelve

I do find myself building up intuitions for what an LLM does. It's far less reliable than a calculator but humans can build intuitions for other unreliable things that can fail excitingly.

Haelwenn /элвэн/ :triskell:
@david_chisnall @carbontwelve Well one thing where LLMs can make sense is spam filtering (sadly also for generating it, as we probably all know by now…).

Like rspamd tried GPT-3.5 Turbo and GPT-4o models against Bayes and got pretty interesting results: https://rspamd.net/misc/2024/07/03/gpt.html

Although as conclusion puts, one should use local LLMs for data privacy reasons and likely performance reasons (elapsed time for GPT being ~300s vs. 12s and 30s for bayes), which would also likely change results.
@david_chisnall @carbontwelve Well one thing where LLMs can make sense is spam filtering (sadly also for generating it, as we probably all know by now…).

Like rspamd tried GPT-3.5 Turbo and GPT-4o models against Bayes and got pretty interesting results:
PR ☮ ♥ ♬ 🧑‍💻

@david_chisnall

This 👉 “The benefits of being able to generate correct code faster 80% of the time are small but the costs of generating incorrect code even 1% of the time are high.”

Arttu Iivari

@david_chisnall @carbontwelve This sounds a classic quality management problem: the cost of the shit.

nebulossify

@david_chisnall @carbontwelve part of the AI hype brainrot seems to be a complete disregard to the possibility of failure. or if there's a consideration of which, then the next bigger, hungrier model will fix it...

jincy quones

@nebulos @david_chisnall @carbontwelve That's capitalism 101: Create a problem that didn't exist before; conjure a solution you can build a business model around; kneecap whatever regulatory mechanisms that would eliminate the problem for good; leech off society for life!

Alaric Snell-Pym

@carbontwelve @david_chisnall this also matches my expectations, and I've seen people mention studies in teams showing no productivity gain, too.

So I'm intrigued by the few people who DO report that LLMs help them code, though (eg @simon ). Is there something different about how their brains work so LLMs help? Or (cynically) are they jumping on the bandwagon and trying hard to show the world they've cracked how to use them well, to sell themselves as consultants or something?

jt7d

@kitten_tech @carbontwelve @david_chisnall @simon Something I've found LLMs useful for, and I've seen Simon say something similar, is writing code in a language or situation that I _kinda_ know. I might not have bothered writing it if I had to climb boilerplate mountain first, but the LLM serves as guardrails and crack-filler. And since I _kinda_ know the thing, I don't fall for appalling hallucinations, but I get a chance to learn more about the thing fairly painlessly.

Dana Fried

@jt7d @kitten_tech @carbontwelve @david_chisnall @simon this is a really good point - if you treat it as a intern who knows a little about the topic but might be wrong, it can still point you to the correct solution.

This is, perhaps the meatspace equivalent of the caching application described elsewhere in the replies to this thread.

This is how I also use the (mandatory 🤷🏻‍♀️) LLM results in Google search - as a way to find links and or search terms that will get me to the actual answer.

Martijn Faassen

@tess

@jt7d @kitten_tech @carbontwelve @david_chisnall @simon

Yeah, earlier I mentioned copilot can help me with the flow. It generates some bs and I am that's not what I want, I want this and do it. Or it gives me ideas I hadn't had before. (Besides autocompleting shorter or simpler bits I can easily approve)

Kris Hardy 🧐

@kitten_tech @carbontwelve @david_chisnall @simon This m matches my experience, and I gave up after trying on four small nontrivial projects. The people I personally know using it successfully only use it in limited cases as a kind of hint system to help remind of things or point out new ways of doing things.

Simon Willison

@kitten_tech @carbontwelve @david_chisnall I'm actually getting more coding work done directly in the Claude and ChatGPT web interfaces and apps vs using Copliot in my editor

The real magic for me at the moment is Claude Artifacts and ChatGPT Code Interpreter - I wrote a bunch about Artifacts here: simonwillison.net/tags/claude-

Here are all of my general notes on AI-assisted programming: simonwillison.net/tags/ai-assi

Stephen J. Anderson

@simon @kitten_tech @carbontwelve @david_chisnall How would you avoid or deal with the issues that David encountered? Specifically, subtle bugs that the process of debugging make the whole process less efficient than writing it yourself. Is there one of your notes that deals with that already?

Alaric Snell-Pym

@utterfiction @carbontwelve @david_chisnall

He has a few examples where he felt something in the output didn't look right, or ran it and found bugs, and had the LLM try again.

Most of his examples are relatively simple things of the form "I didn't want to spend time reading API docs for this quick task", though. I don't find that sort of thing a bottleneck in what I do - and I quite enjoy reading docs, and building a mental model of a tool I can then use to know what its...

Alaric Snell-Pym

@utterfiction @carbontwelve @david_chisnall ... limitations and capabilities are.

The bits of programming that eat my time, which I'd love a tool to help with, are usually understanding a bug in an undocumented and under commented ball of hundreds of kloc of code, too big for an LLM's context window, and where going and quizzing the people who wrote bits of it is essential to success.

The bits Simon gets LLMs to do look like the tasks I do to cheer myself up after that :-)

Stephen J. Anderson

@kitten_tech @carbontwelve @david_chisnall Yeah. A lot of my professional time is spent extending logic, adding new features that follow an existing pattern, refactoring when re-usable abstractions are discovered… so far, they’re just not very good at that. And I don’t think pure LLMs ever will be - limited token windows and no genuine symbolic representation of knowledge.

Martijn Faassen

@kitten_tech

@utterfiction @carbontwelve @david_chisnall

If you can suddenly create small throwaway applications far more quickly than before, applications that might be too boring or bothersome to create otherwise, that might allow new ways of working altogether.

Simon Willison

@utterfiction @kitten_tech @carbontwelve @david_chisnall you have to assume that the LLM will make weird mistakes all the time, so your job is all about code review and meticulous testing

I still find that a whole lot faster then writing all the code myself

Here's just one of many examples where I missed something important: simonwillison.net/2023/Apr/12/

Simon Willison

@utterfiction @kitten_tech @carbontwelve @david_chisnall but honestly, the disappointing answer is that most of this comes down to practice and building intuition for tasks the models are likely to do well vs mess up

Manipulating some elements in the HTML DOM with JavaScript? They'll nail that every time

Implementing something involving MDIO registers? My guess is there are FAR less examples relating to that in the (undocumented, unlicensed) training data so much more likely to make mistakes

Major Denis Bloodnok

@kitten_tech @carbontwelve @david_chisnall Wouldn't touch it with a bargepole myself, but I think a third possibility is that at least some people reporting that haven't had it write a sufficiently hilarious bug _yet_. After all, the OP hit one every four months - one could easily get lucky if that's a typical frequency.

Martijn Faassen

@denisbloodnok

@kitten_tech @carbontwelve @david_chisnall

I am a couple of years in with copilot. No such bugs yet. Context: Rust. I write lots of tests along with my code (as OP appears to do too), and currently can rely on a massive external test suite.

The one ridiculously hard to debug bug I got is when I had to debug a codebase I ported from Java to Rust and I transliterated bits wrong as the human, and had no incremental tests built up along with the code. No LLM to blame.

Moof is on Sabbatical

@kitten_tech @carbontwelve @david_chisnall @simon so, I’ve found a 10-20% productivity boost with Copilot, mostly when dealing with boilerplate and small stuff. I mostly code python and AL (an ERP-specific language, which doesn’t get much if any boost).

What does work: ending statements when you start them, it infers enough for me to want to let it finish the line, maybe the next two-three lines. Sometimes it gets the logic completely wrong, but then you don’t accept the suggestion. Sometimes it comes up with edge cases I hadn’t considered.

What is more dodgy: explaining what you want and getting it to write the code. That can get quite dodgy, and I rarely accept those suggestions, unless it’s boilerplate.
1/3

@kitten_tech @carbontwelve @david_chisnall @simon so, I’ve found a 10-20% productivity boost with Copilot, mostly when dealing with boilerplate and small stuff. I mostly code python and AL (an ERP-specific language, which doesn’t get much if any boost).

What does work: ending statements when you start them, it infers enough for me to want to let it finish the line, maybe the next two-three lines. Sometimes it gets the logic completely wrong, but then you don’t accept the suggestion. Sometimes it...

Moof is on Sabbatical

That being said, I have some experience working with code submitted by less skilled programmers who blindly copy and paste stack exchange for a living, from before the prevalence of LLMs, and I am somewhat used to reviewing code of that standard. I find the longer LLM-built code is similar to review as that style of code, and in some cases is approaching that level of code quality.

I am tempted to try one of these “code your own mobile app” demo things, as it’s a platform I’m unfamiliar with, and I have some itches to scratch.

I believe both my coding style and my speed have been affected by using Copilot, both with modest boosts to productivity.

Could I work without Copilot? Absolutely! Would I want to? I think I’d miss the speed boost in a long python project

2/3

That being said, I have some experience working with code submitted by less skilled programmers who blindly copy and paste stack exchange for a living, from before the prevalence of LLMs, and I am somewhat used to reviewing code of that standard. I find the longer LLM-built code is similar to review as that style of code, and in some cases is approaching that level of code quality.

Moof is on Sabbatical

One place where my colleagues (and to a lesser extent, myself) have found LLMs to be useful is in multilingual situations.

When English is not your first language, but your coding standard requires things to be programmed in English, sometimes you can struggle to use the correct name for a variable, especially when those words are false friends, or are concepts that aren’t one word in English. The LLM can make sensible suggestions for variable and function names and the like. I’ve had to do fewer refactorings of colleague’s work due to inappropriate name use since they started with Copilot.

Similarly, pasting a description in Spanish into one of these things and asking for an outline onto which to hang your code on in English has helped, with proper code review.

This stuff is not panacea, but it can help when applied with a healthy dose of scepticism. @kitten_tech’s OP conclusion is valid, as the benefits are still marginal.

3/3

One place where my colleagues (and to a lesser extent, myself) have found LLMs to be useful is in multilingual situations.

When English is not your first language, but your coding standard requires things to be programmed in English, sometimes you can struggle to use the correct name for a variable, especially when those words are false friends, or are concepts that aren’t one word in English. The LLM can make sensible suggestions for variable and function names and the like. I’ve had to do fewer...

David Chisnall (*Now with 50% more sarcasm!*)

@moof

I am tempted to try one of these “code your own mobile app” demo things, as it’s a platform I’m unfamiliar with, and I have some itches to scratch.

I wrote my first Android app a couple of months ago. I did it in Android Studio, which didn’t have Copilot set up. It took half a day (having it touched Java for 6-8 years, and then mostly only to write test cases when hacking on the internals of a JVM). I went from nothing to a working app in under a day.

The things that took time were:

- Google’s CADT problem meant that a lot of things in the build system had changed from the time tutorials were written and figuring out the differences always annoying.
- The MQTT library I was using needed some extra things for compatibility with older SDKs and they were enabled by default, the instructions for turning them off were documented but figuring out that this was the problem took time.
- I spent ages debugging a connection problem that I assumed was a permissions issue. It turned out that the MQTT server was down (but its status page was not).

I don’t think an LLM would have helped with any of these problems.

Android development is so much worse then OpenStep development in 1992 (iOS is a cleaned up version of OpenStep tuned for touchscreens and systems with more than 8 MiB of RAM, so I presume it’s better). Adding LLMs won’t fix that, thinking about APIs before you ship a thing that you likely have to support for a decade or so would. In spite of it being a truly terrible platform for developers, it was pretty easy to build something that worked.

Twenty years ago, we were building minimal-code platforms where you could build CRUD web and desktop apps with a few dozen lines of code for your business logic. A lot of frameworks seem to have massively regressed since then. If anything, relying on LLMs to fill in the code that shouldn’t be necessary in the first place will make this worse.

@moof

I am tempted to try one of these “code your own mobile app” demo things, as it’s a platform I’m unfamiliar with, and I have some itches to scratch.

I wrote my first Android app a couple of months ago. I did it in Android Studio, which didn’t have Copilot set up. It took half a day (having it touched Java for 6-8 years, and then mostly only to write test cases when hacking on the internals of a JVM). I went from nothing to a working app in under a day.

Moof is on Sabbatical

@david_chisnall I do miss the era when you could just code up an app with minimal thinking about the common cases that were covered by frameworks. The idea that everyone needs to have their own interface developed is something that the new web era has foisted on us, and is definitely a step back. Electron and company has just made it worse. And don’t get me started on my thoughts on WASM.

I agree that LLMs will not help there. Or if they do, it shouldn’t be like that.

Either way, I feel that the best way to get a feel for a tool is to use it. You have done so, and come to valid conclusions, and I thank you for sharing.

I expect my next job to be the sort where I will have to battle pressure both from above and below for use of LLMs as a way to accelerate or replace developers. I need to have arguments that sound authoritative in order to battle the massive propaganda^Wmarketing effort being made to sell this as the best thing since Jesus fed the 5k with sliced bread

@david_chisnall I do miss the era when you could just code up an app with minimal thinking about the common cases that were covered by frameworks. The idea that everyone needs to have their own interface developed is something that the new web era has foisted on us, and is definitely a step back. Electron and company has just made it worse. And don’t get me started on my thoughts on WASM.

David Chisnall (*Now with 50% more sarcasm!*)

@moof

I need to have arguments that sound authoritative

If you're looking for plausible and authoritative-sounding pronouncements, you've come to the right place!

jincy quones

@moof What kind of boilerplate are you having to write so often that any decent snippet engine couldn't handle perfectly well without the litany of issues of an LLM? We *already have* tools for boilerplate, I don't understand why people are so entranced by LLM's ability to deal with it in an absurdly, grossly inefficient way.

Martijn Faassen

@kitten_tech

@carbontwelve @david_chisnall .

Note that how @simon reports using this to generate little projects is an entirely different mode of working with them. I have used copilot for a few years now and like it myself, which is mostly context sensitive autocomplete.

A Q&A session to create code for a CLI tool or web app is a very different way of working I started exploring more recently. It's surprisingly capable for little projects and requires a different approach.

Go Up