I am Director of System Architecture at SCI Semiconductor and a Visiting Researcher at the University of Cambridge Computer Laboratory. I remain actively involved in the #CHERI project, where I led the early language / compiler strand of the research, and am the maintainer of the #CHERIoT Platform.
I was on the FreeBSD Core Team for two terms, have been an LLVM developer since 2008, am the author of the GNUstep Objective-C runtime (libobjc2 and associated clang support), and am responsible for libcxxrt and the BSD-licensed device tree compiler.
Opinions expressed by me are not necessarily opinions. In all probability they are random ramblings and should be ignored. Failure to ignore may result in severe boredom and / or confusion. Shake well before opening. Keep refrigerated.
Warning: May contain greater than the recommended daily allowance of sarcasm.
A lot of the current hype around LLMs revolves around one core idea, which I blame on Star Trek:
Wouldn't it be cool if we could use natural language to control things?
The problem is that this is, at the fundamental level, a terrible idea.
There's a reason that mathematics doesn't use English. There's a reason that every professional field comes with its own flavour of jargon. There's a reason that contracts are written in legalese, not plain natural language. Natural language is really bad at being unambiguous.
When I was a small child, I thought that a mature civilisation would evolve two languages. A language of poetry, that was rich in metaphor and delighted in ambiguity, and a language of science that required more detail and actively avoided ambiguity. The latter would have no homophones, no homonyms, unambiguous grammar, and so on.
Programming languages, including the ad-hoc programming languages that we refer to as 'user interfaces' are all attempts to build languages like the latter. They allow the user to unambiguously express intent so that it can be carried out. Natural languages are not designed and end up being examples of the former.
When I interact with a tool, I want it to do what I tell it. If I am willing to restrict my use of natural language to a clear and unambiguous subset, I have defined a language that is easy for deterministic parsers to understand with a fraction of the energy requirement of a language model. If I am not, then I am expressing myself ambiguously and no amount of processing can possibly remove the ambiguity that is intrinsic in the source, except a complete, fully synchronised, model of my own mind that knows what I meant (and not what some other person saying the same thing at the same time might have meant).
The hard part of programming is not writing things in some language's syntax, it's expressing the problem in a way that lacks ambiguity. LLMs don't help here, they pick an arbitrary, nondeterministic, option for the ambiguous cases. In C, compilers do this for undefined behaviour and it is widely regarded as a disaster. LLMs are built entirely out of undefined behaviour.
There are use cases where getting it wrong is fine. Choosing a radio station or album to listen to while driving, for example. It is far better to sometimes listen to the wrong thing than to take your attention away from the road and interact with a richer UI for ten seconds. In situations where your hands are unavailable (for example, controlling non-critical equipment while performing surgery, or cooking), a natural-language interface is better than no interface. It's rarely, if ever, the best.
A lot of the current hype around LLMs revolves around one core idea, which I blame on Star Trek:
Wouldn't it be cool if we could use natural language to control things?
The problem is that this is, at the fundamental level, a terrible idea.
There's a reason that mathematics doesn't use English. There's a reason that every professional field comes with its own flavour of jargon. There's a reason that contracts are written in legalese, not plain natural language. Natural language is really bad at being unambiguous
I finally turned off GitHub Copilot yesterday. I’ve been using it for about a year on the ‘free for open-source maintainers’ tier. I was skeptical but didn’t want to dismiss it without a fair trial.
It has cost me more time than it has saved. It lets me type faster, which has been useful when writing tests where I’m testing a variety of permutations of an API to check error handling for all of the conditions.
I can recall three places where it has introduced bugs that took me more time to to debug than the total time saving:
The first was something that initially impressed me. I pasted the prose description of how to communicate with an Ethernet MAC into a comment and then wrote some method prototypes. It autocompleted the bodies. All very plausible looking. Only it managed to flip a bit in the MDIO read and write register commands. MDIO is basically a multiplexing system. You have two device registers exposed, one sets the command (read or write a specific internal register) and the other is the value. It got the read and write the wrong way around, so when I thought I was writing a value, I was actually reading. When I thought I was reading, I was actually seeing the value in the last register I thought I had written. It took two of us over a day to debug this. The fix was simple, but the bug was in the middle of correct-looking code. If I’d manually transcribed the command from the data sheet, I would not have got this wrong because I’d have triple checked it.
Another case it had inverted the condition in an if statement inside an error-handling path. The error handling was a rare case and was asymmetric. Hitting the if case when you wanted the else case was okay but the converse was not. Lots of debugging. I learned from this to read the generated code more carefully, but that increased cognitive load and eliminated most of the benefit. Typing code is not the bottleneck and if I have to think about what I want and then read carefully to check it really is what I want, I am slower.
Most recently, I was writing a simple binary search and insertion-deletion operations for a sorted array. I assumed that this was something that had hundreds of examples in the training data and so would be fine. It had all sorts of corner-case bugs. I eventually gave up fixing them and rewrote the code from scratch.
Last week I did some work on a remote machine where I hadn’t set up Copilot and I felt much more productive. Autocomplete was either correct or not present, so I was spending more time thinking about what to write. I don’t entirely trust this kind of subjective judgement, but it was a data point. Around the same time I wrote some code without clangd set up and that really hurt. It turns out I really rely on AST-aware completion to explore APIs. I had to look up more things in the documentation. Copilot was never good for this because it would just bullshit APIs, so something showing up in autocomplete didn’t mean it was real. This would be improved by using a feedback system to require autocomplete outputs to type check, but then they would take much longer to create (probably at least a 10x increase in LLM compute time) and wouldn’t complete fragments, so I don’t see a good path to being able to do this without tight coupling to the LSP server and possibly not even then.
Yesterday I was writing bits of the CHERIoT Programmers’ Guide and it kept autocompleting text in a different writing style, some of which was obviously plagiarised (when I’m describing precisely how to implement a specific, and not very common, lock type with a futex and the autocomplete is a paragraph of text with a lot of detail, I’m confident you don’t have more than one or two examples of that in the training set). It was distracting and annoying. I wrote much faster after turning it off.
So, after giving it a fair try, I have concluded that it is both a net decrease in productivity and probably an increase in legal liability.
Discussions I am not interested in having:
- You are holding it wrong. Using Copilot with this magic config setting / prompt tweak makes it better. At its absolute best, it was a small productivity increase, if it needs more effort to use, that will be offset.
- This other LLM is much better. I don’t care. The costs of the bullshitting far outweighed the benefits when it worked, to be better it would have to not bullshit, and that’s not something LLMs can do.
- It’s great for boilerplate! No. APIs that require every user to write the same code are broken. Fix them, don’t fill the world with more code using them that will need fixing when the APIs change.
- Don’t use LLMs for autocomplete, use them for dialogues about the code. Tried that. It’s worse than a rubber duck, which at least knows to stay silent when it doesn’t know what it’s talking about.
The one place Copilot was vaguely useful was hinting at missing abstractions (if it can autocomplete big chunks then my APIs required too much boilerplate and needed better abstractions). The place I thought it might be useful was spotting inconsistent API names and parameter orders but it was actually very bad at this (presumably because of the way it tokenises identifiers?). With a load of examples with consistent names, it would suggest things that didn't match the convention. After using three APIs that all passed the same parameters in the same order, it would suggest flipping the order for the fourth.
I finally turned off GitHub Copilot yesterday. I’ve been using it for about a year on the ‘free for open-source maintainers’ tier. I was skeptical but didn’t want to dismiss it without a fair trial.
It has cost me more time than it has saved. It lets me type faster, which has been useful when writing tests where I’m testing a variety of permutations of an API to check error handling for all of the conditions.
@david_chisnall
I wanted to thank you for sharing a considered criticism based on experience, which is refreshing after so many a priori arguments against copilot.
The negative trade-offs you sketch involve review time and cost of bugs. The positives involve faster typing, test generation (tests are a special case where I think boilerplate costs less and has benefits).
@david_chisnall I've also found that having copilot write the algorithm was not helpful. But for me it can do the boring parts of writing code which makes me much more productive. It's good at writing log statements. Sometimes comments are good. It can write Short helper functions. When I have to do those things I get so annoyed and bored I have to take longer breaks.
- You disproportionately lose women.
- You disproportionately lose your best people.
- You find it hard to replace the good people that you lose.
For a long time, companies were excited about offshoring because it meant that they could take advantage of a global labour force to increase supply and drive down labour costs. Now they’re starting to see the flip side: with a lot of places offering remote work, the demand is also global and they are competing with companies worldwide. If you don’t offer a good work-life balance for your best people, someone else will and they may be anywhere in the world.
After running a successful research project at Microsoft with a team spanning several thousand miles between the furthest members, I was quite surprised to be told that we all needed to come back into the lab because research requires people to be face to face. Especially by people who had spent two years doing nothing to promote collaboration and who were pushing policies that would exclude my close collaborators in other countries.
At SCI, we’re remote first. My most recent hire is on a boat in the South Pacific. As long as people can communicate and have a decent Internet connection, I don’t care where they are (the accountants may, for tax purposes). You need to actively build teams when people are remote, just as when they’re local. Mostly of the people who felt RTO was important were the ones who weren’t doing this in either setting and were relying on similarity biases to create teams (I’d love to see a correlation between managers who advocate RTO and managers who have a higher turnover for folks who are not white cishet males: I suspect it would be strong).
- You disproportionately lose women.
- You disproportionately lose your best people.
- You find it hard to replace the good people that you lose.
For a long time, companies were excited about offshoring because it meant that they could take advantage of a global labour force to increase supply and drive down labour costs. Now they’re starting to see the flip side: with a lot of places offering remote...
To put a billionaire in perspective, it's more interesting to think about the income than the capital. You can get a 5% annual return from some fairly low-risk investments (and you can further reduce risk by spreading your money across a load of these). You can often get higher, but 5% is a fairly good baseline.
If you start with $1bn and do nothing clever, you can get an income of $50 M per year by doing nothing. Even with the kind of 95% tax rate that we had for the highest income levels when The Beatles sang about it, that leaves you with $2.5M/year in income.
That's enough to buy a nice house every year, with no mortgage. It's a disposable income of almost $7K. It's enough to take a first-class transatlantic flight every day.
And that's just one billion.
Even with a 95% tax rate on investment income and a conservative investment strategy, someone with a billion dollars would have a daily disposable income that's more than the monthly income of anyone in the bottom 99% of earners, without having to work.
Now try these calculations again with real tax rates and Musk or Bezos' wealth.
To put a billionaire in perspective, it's more interesting to think about the income than the capital. You can get a 5% annual return from some fairly low-risk investments (and you can further reduce risk by spreading your money across a load of these). You can often get higher, but 5% is a fairly good baseline.
There seems to be a very strong correlation between tasks that can be accomplished by generative AI and tasks that are necessary solely as a result of bad choices made earlier.
@david_chisnall how quickly can you emulate CHERI on x86-64? (Does qemu support it for example?) If you compile your C code, or language bindings/runtime to a CHERI architecture could that act as a better valgrind/ASAN?
Probably not what you had in mind when designing the hardware, but might be a useful intermediate step until CHERI hardware is widely available.
Someone recently suggested to me that AI systems bring the users' ability closer to the average. I was intrigued by this idea because it reflects my experience. I am, for example, terrible at any kind of visual art, but with something like Stable Diffusion I can produce things that are merely quite bad, whereas without it I can produce things that are absolutely terrible. Conversely, with GitHub Copilot I can write code with more bugs that's harder to read. Watching non-programmers use it and ChatGPT with Python, they can produce fairly mediocre code that mostly works.
I suppose it shouldn't surprise anyone that a machine that's trained to produce output from a statistical model built from a load of examples would tend towards the mean.
An unflattering interpretation of this would suggest that the people who are most excited by AI in any given field are the people with the least talent in that field.
Someone recently suggested to me that AI systems bring the users' ability closer to the average. I was intrigued by this idea because it reflects my experience. I am, for example, terrible at any kind of visual art, but with something like Stable Diffusion I can produce things that are merely quite bad, whereas without it I can produce things that are absolutely terrible. Conversely, with GitHub Copilot I can write code with more bugs that's harder to read. Watching non-programmers use it and ChatGPT...
I agree with this take on the AI Hype in that, pretty obviously what we have in general AI applications is a regression to the mean.
The real shame is that there are quite a few fields in science where machine learning is really extremely useful in finding patterns in data that might be too subtle for ordinary perception. The hype covers up (for people in the general public) this actual advance.
It’s really not fair to blame CrowdStrike for the outages today. The blame lies with the people in a position to make procurement decisions, who saw a product that added a load of additional code that runs in Ring 0 and though ‘yes, this will make us more secure, I will mandate this must be deployed across the entire company’. A large part of the blame lies with the people who created auditing frameworks that would lead people to believe that this was necessary for compliance.
@david_chisnall 🔥🔥🔥 indictment of LLMs 👆
"In C, compilers do this for undefined behaviour and it is widely regarded as a disaster. LLMs are built entirely out of undefined behaviour."
@david_chisnall
“Fruit flies like a banana.”
It is indeed unsolvable.
@david_chisnall
Spot on.