Email or username:

Password:

Forgot your password?
Glyph

Until there is a definitive adjudication of the copyright status of LLM training data, it is *deeply irresponsible* to use Github Copilot for open source. I will refuse contributions created with it on any project I'm involved with, as well as permanently ban any user caught sneaking in Copilot-generated code in defiance of this rule. I would strongly encourage all maintainers to take this stand as well. License headaches are already bad enough without secret poison pills being injected.

43 comments
Feoh

@glyph Agreed. We're using Tab9 - tabnine.com/ which trains only on the code in your repository and doesn't treat your code like a publicly exploitable commodity.

Also? I think it produces vastly more useful, if less audacious in certain terms, results.

I've found it saves me probably around 30-40m a day in boilerplate I don't have to type.

mort

@feoh @glyph I seriously doubt that it only trains on the code already in the repo… these LLM style networks take an absolute ton of training data.

In fact, their privacy page explicitly says that it *does not* use your code for training.

It seems exactly identical to Copilot in terms of copyright ramifications.

Stephan

@mort @feoh @glyph how it is usually done is: the model is trained on a bunch of publicly available and/or private data, and later fine-tuned based on your own data to yield more relevant results for your own use case. I doubt that anyone's own code output is enough to train a good large language model.

Gerbrand van Dieyen

@durchaus @mort @feoh @glyph according to their website "Trained exclusively on permissive open-source repositories"
You can optionally adapted with your own code base, where they promise the code won't be exposed.

I must say does seem useful and legit tabnine.com/

mort

@gerbrand @durchaus @feoh @glyph As was pointed out already (mastodon.gamedev.place/@Doomed), “permissively licensed” doesn’t mean public domain. Permissive licenses still have terms, such as the requirement to include a copyright notice.

Callie

@feoh "Tabnine models only train on open source code with permissive licenses"

Daniel Gibson

@pidgeon_pete @feoh
Even most "permissive" licenses requires you to keep the copyright header in the code intact (e.g. zlib license, boost license), and often also in the documentation (BSD, MIT, ...).
Or is it exclusively trained on public domain/CC0/Unlicense/WTFPL/... code?

Glyph

If you don't like this rule, I'm happy to make exceptions for anyone who can secure a binding contractual blanket copyright infringement indemnification from Microsoft. As long as they'll cover the full costs of both any legal defense and hiring contractors to do clean-room rewrites with full copyright assignment to the relevant person or org, then I would consider at least the show-stopper issue to be addressed.

mirabilos

@glyph no, I wouldn’t even take that.

I promise to my users to include code under a good licence. I’d not willingly add anything bad or vague (even if indemnified) and work to get rid of inherited instances of vagueness.

Philippe Jadin

@glyph do you think at some point it will be possible to prove that code has been generated by a LLM, and after that, to find the sources used? Really wondering 🙂

Leszek

@glyph We're not even at a "Microsoft allows the use of copilot in their own software" stage. So I think you don't have to check your mailbox too often in search of the indemnification letter.

See opensource.microsoft.com/cla/

David Zaslavsky

@glyph Generally I look suspiciously at strongly-worded hot takes like this, but I do agree with the underlying point.

Glyph

@diazona Given that the whole industry is sleepwalking into an open sewer with the way that both LLM output and input are being treated both legally and ethically, a certain level of stridency is required to emphasize the seriousness of the point.

Glyph

@diazona But also it's not a particularly hot take. Copilot reproduces its training data verbatim on a somewhat regular basis. hackernoon.com/legal-issues-su

Glyph

@diazona In some sense it is not even a novel policy. If you were caught stealing code from another open source project, or from your employer's proprietary codebase, you'd be banned in the same way. Using an extremely slow probabilistic token generator to file off the copyright notices is not meaningfully different than using a text editor to do the same thing.

Glyph

@diazona The only reason I even need to say anything is that contributors are being somewhat deliberately mislead by Github's speculative legal reasoning here.

Eric Carroll

@glyph
You are doing exactly the right thing.

I speak as a former technical expert witness before the Copyright Board of Canada during the Internet copyright wars.

My advice to clients is stay far away from this technology until litigation has settled its copyright status. You can't afford to be the test case. Or the loser of the test case.

David Zaslavsky

@glyph Well true, I suppose I didn't really mean "hot take" in the sense of controversial. I was thinking more of a strongly worded opinion and couldn't come up with the right word in the moment.

mirabilos

@diazona @glyph I think the words here are not even strong enough. They must arrive at the audience despite “comfort” and other blockades.

JayF

@glyph Can you make sure this is documented in an obvious place for contributors? e.g. PR templates on packages you maintain?

Glyph

@jay I am tooting it out here because it is a truly obnoxious amount of unpaid work that just got dumped on me, to do this across a hundred or so repos (not to mention hashing it out with other contributors on bigger projects) but yeah, I am going to need to do this.

Glyph

@jay also it's not really particular to copilot, other LLMs may not have that indemnification policy you mentioned and are therefore arguably worse, but also there isn't an ad for those LLMs on every single UI element in github now

Glyph

@jay so I've got to write a *general* policy

JayF

@glyph honestly, I don't contribute to many of your open source projects so I don't think it'll actually impact me, but I do leave copilot enabled on my IDE for most work that I do.

I would hate for someone to characterize an accidental PR pushed by me as trying to sneak AI generated code through just because I forgot that this was a repo where submitting code with this add-on enabled in my IDE was forbidden.

Mark Eichin

@jay
Have you signed any CLAs? They often have traceability clauses that implicitly have the same result.
@glyph

Eleanor Saitta

@jay
I mean, that's your choice though, yes? Maintainers have an obligation to the viability of the project.
@glyph

James Bennett

@glyph The hierarchy of open source, laid bare by this post.

AlgoCompSynth by znmeb

@glyph I have no reason to use *any* AI tool based on scraped data for *any* reason. Life is too short to learn how to use a new syntax / semantics or do free QA for a vendor.

BJ Swope :verified:➖

@glyph @AlgoCompSynth right. Everybody saying coders are doomed because of AI. The way I see it, it’s just another layer of abstraction that you have to master to actually achieve your goal. And this abstraction has random results to boot.

AlgoCompSynth by znmeb

@cybeej @glyph The effort to payoff ratio for mastering that layer of abstraction is way too high for someone like me who's written code both for a living and on hobby projects for over five decades.

Hubert Figuière

@glyph how do you detect it comes from Copilot?

Glyph

@hub Maybe I don’t! Many lies are believed, many crimes go unpunished

Glyph

@hub I similarly have no mechanism to detect any other form of plagiarism or copyright infringement; open source mostly runs on the honor system.

Farce Majeure

@glyph @hub tbf, not seeing how closed source isn't exactly the same in this regard. Either you trust your contributors to not be stealing their work, or you don't. Either they are stealing it or they aren't.

Astro

@glyph@mastodon.social out of curiosity - how would you see with 100% certainty that code is Copilot-generated?

Chris Johnson

@glyph Thank you for taking the correct moral stand. While as far as I know, virtually zero people use my open source projects, I will implement a similar policy.

Jet Balsa

@glyph I want to see the fall out of it from the closed source world as well, What happens when half your game is pieces of copilot code?!

Marcin 🍔:pocket:

@glyph Makes sense. Our lawyers told us to stay away from LLM generated code as well.

Go Up