Until there is a definitive adjudication of the copyright...

Until there is a definitive adjudication of the copyright status of LLM training data, it is *deeply irresponsible* to use Github Copilot for open source. I will refuse contributions created with it on any project I'm involved with, as well as permanently ban any user caught sneaking in Copilot-generated code in defiance of this rule. I would strongly encourage all maintainers to take this stand as well. License headaches are already bad enough without secret poison pills being injected.

Like 9 Nov 2023 at 22:03 | Open on mastodon.social

43 comments

Feoh

@glyph Agreed. We're using Tab9 - https://www.tabnine.com/ which trains only on the code in your repository and doesn't treat your code like a publicly exploitable commodity.

Also? I think it produces vastly more useful, if less audacious in certain terms, results.

I've found it saves me probably around 30-40m a day in boilerplate I don't have to type.

9 Nov 2023 at 22:07 | Open on oldbytes.space

mort

@feoh @glyph I seriously doubt that it only trains on the code already in the repo… these LLM style networks take an absolute ton of training data.

In fact, their privacy page explicitly says that it *does not* use your code for training.

It seems exactly identical to Copilot in terms of copyright ramifications.

10 Nov 2023 at 0:22 | Open on fosstodon.org

Stephan

@mort @feoh @glyph how it is usually done is: the model is trained on a bunch of publicly available and/or private data, and later fine-tuned based on your own data to yield more relevant results for your own use case. I doubt that anyone's own code output is enough to train a good large language model.

10 Nov 2023 at 3:34 | Open on mastodon.social

Gerbrand van Dieyen

@durchaus @mort @feoh @glyph according to their website "Trained exclusively on permissive open-source repositories"
You can optionally adapted with your own code base, where they promise the code won't be exposed.

I must say does seem useful and legit https://www.tabnine.com/

10 Nov 2023 at 8:01 | Open on fosstodon.org

mort

@gerbrand @durchaus @feoh @glyph As was pointed out already (https://mastodon.gamedev.place/@Doomed_Daniel/111383817293869390), “permissively licensed” doesn’t mean public domain. Permissive licenses still have terms, such as the requirement to include a copyright notice.

10 Nov 2023 at 8:05 | Open on fosstodon.org

Callie

@feoh "Tabnine models only train on open source code with permissive licenses"

10 Nov 2023 at 1:15 | Open on mas.to

Daniel Gibson

@pidgeon_pete @feoh
Even most "permissive" licenses requires you to keep the copyright header in the code intact (e.g. zlib license, boost license), and often also in the documentation (BSD, MIT, ...).
Or is it exclusively trained on public domain/CC0/Unlicense/WTFPL/... code?

10 Nov 2023 at 2:09 | Open on mastodon.gamedev.place

Glyph

If you don't like this rule, I'm happy to make exceptions for anyone who can secure a binding contractual blanket copyright infringement indemnification from Microsoft. As long as they'll cover the full costs of both any legal defense and hiring contractors to do clean-room rewrites with full copyright assignment to the relevant person or org, then I would consider at least the show-stopper issue to be addressed.

9 Nov 2023 at 22:07 | Open on mastodon.social

mirabilos

@glyph no, I wouldn’t even take that.

I promise to my users to include code under a good licence. I’d not willingly add anything bad or vague (even if indemnified) and work to get rid of inherited instances of vagueness.

9 Nov 2023 at 23:37 | Open on toot.mirbsd.org

Philippe Jadin

@glyph do you think at some point it will be possible to prove that code has been generated by a LLM, and after that, to find the sources used? Really wondering 🙂

10 Nov 2023 at 16:14 | Open on tchafia.be

Leszek

@glyph We're not even at a "Microsoft allows the use of copilot in their own software" stage. So I think you don't have to check your mailbox too often in search of the indemnification letter.

See https://opensource.microsoft.com/cla/

10 Nov 2023 at 17:32 | Open on chaos.social

David Zaslavsky

@glyph Generally I look suspiciously at strongly-worded hot takes like this, but I do agree with the underlying point.

9 Nov 2023 at 22:08 | Open on techhub.social

Glyph

@diazona Given that the whole industry is sleepwalking into an open sewer with the way that both LLM output and input are being treated both legally and ethically, a certain level of stridency is required to emphasize the seriousness of the point.

9 Nov 2023 at 22:10 | Open on mastodon.social

Glyph

@diazona But also it's not a particularly hot take. Copilot reproduces its training data verbatim on a somewhat regular basis. https://hackernoon.com/legal-issues-surrounding-copilots-use-of-training-data

9 Nov 2023 at 22:10 | Open on mastodon.social

Glyph

@diazona In some sense it is not even a novel policy. If you were caught stealing code from another open source project, or from your employer's proprietary codebase, you'd be banned in the same way. Using an extremely slow probabilistic token generator to file off the copyright notices is not meaningfully different than using a text editor to do the same thing.

9 Nov 2023 at 22:13 | Open on mastodon.social

Glyph

@diazona The only reason I even need to say anything is that contributors are being somewhat deliberately mislead by Github's speculative legal reasoning here.

9 Nov 2023 at 22:14 | Open on mastodon.social

Eric Carroll

@glyph
You are doing exactly the right thing.

I speak as a former technical expert witness before the Copyright Board of Canada during the Internet copyright wars.

My advice to clients is stay far away from this technology until litigation has settled its copyright status. You can't afford to be the test case. Or the loser of the test case.

10 Nov 2023 at 17:46 | Open on cosocial.ca

David Zaslavsky

@glyph Well true, I suppose I didn't really mean "hot take" in the sense of controversial. I was thinking more of a strongly worded opinion and couldn't come up with the right word in the moment.

9 Nov 2023 at 22:36 | Open on techhub.social

mirabilos

@diazona @glyph I think the words here are not even strong enough. They must arrive at the audience despite “comfort” and other blockades.

9 Nov 2023 at 23:39 | Open on toot.mirbsd.org

JayF

@glyph Can you make sure this is documented in an obvious place for contributors? e.g. PR templates on packages you maintain?

9 Nov 2023 at 22:53 | Open on oldos.me

Glyph

@jay I am tooting it out here because it is a truly obnoxious amount of unpaid work that just got dumped on me, to do this across a hundred or so repos (not to mention hashing it out with other contributors on bigger projects) but yeah, I am going to need to do this.

9 Nov 2023 at 23:05 | Open on mastodon.social

Glyph

@jay also it's not really particular to copilot, other LLMs may not have that indemnification policy you mentioned and are therefore arguably worse, but also there isn't an ad for those LLMs on every single UI element in github now

9 Nov 2023 at 23:07 | Open on mastodon.social

Glyph

@jay so I've got to write a *general* policy

9 Nov 2023 at 23:07 | Open on mastodon.social

JayF

@glyph honestly, I don't contribute to many of your open source projects so I don't think it'll actually impact me, but I do leave copilot enabled on my IDE for most work that I do.

I would hate for someone to characterize an accidental PR pushed by me as trying to sneak AI generated code through just because I forgot that this was a repo where submitting code with this add-on enabled in my IDE was forbidden.

9 Nov 2023 at 23:19 | Open on oldos.me

Mark Eichin

@jay
Have you signed any CLAs? They often have traceability clauses that implicitly have the same result.
@glyph

10 Nov 2023 at 1:06 | Open on mastodon.mit.edu

Eleanor Saitta

@jay
I mean, that's your choice though, yes? Maintainers have an obligation to the viability of the project.
@glyph

10 Nov 2023 at 17:03 | Open on infosec.exchange

James Bennett

@glyph The hierarchy of open source, laid bare by this post.

9 Nov 2023 at 23:14 | Open on infosec.exchange

Kevin Riggle

@ubernostrum @glyph all it’s missing is “you get *useful* contributors?”

10 Nov 2023 at 10:26 | Open on ioc.exchange

John Mark :blobcatverified: ☑️

@ubernostrum @glyph Hahahaha this is brilliant

10 Nov 2023 at 18:29 | Open on freeradical.zone

AlgoCompSynth by znmeb

@glyph I have no reason to use *any* AI tool based on scraped data for *any* reason. Life is too short to learn how to use a new syntax / semantics or do free QA for a vendor.

9 Nov 2023 at 23:51 | Open on ravenation.club

BJ Swope :verified:➖

@glyph @AlgoCompSynth right. Everybody saying coders are doomed because of AI. The way I see it, it’s just another layer of abstraction that you have to master to actually achieve your goal. And this abstraction has random results to boot.

10 Nov 2023 at 22:05 | Open on infosec.exchange

AlgoCompSynth by znmeb

@cybeej @glyph The effort to payoff ratio for mastering that layer of abstraction is way too high for someone like me who's written code both for a living and on hobby projects for over five decades.

10 Nov 2023 at 22:13 | Open on ravenation.club

BJ Swope :verified:➖

@AlgoCompSynth @glyph amen!

10 Nov 2023 at 22:14 | Open on infosec.exchange

DELETED

@glyph

Thank you for not being a thief.

10 Nov 2023 at 1:09 | Open on mastodon.social

Hubert Figuière

@glyph how do you detect it comes from Copilot?

10 Nov 2023 at 3:07 | Open on cosocial.ca

Glyph

@hub Maybe I don’t! Many lies are believed, many crimes go unpunished

10 Nov 2023 at 3:14 | Open on mastodon.social

Glyph

@hub I similarly have no mechanism to detect any other form of plagiarism or copyright infringement; open source mostly runs on the honor system.

10 Nov 2023 at 3:15 | Open on mastodon.social

Farce Majeure

@glyph @hub tbf, not seeing how closed source isn't exactly the same in this regard. Either you trust your contributors to not be stealing their work, or you don't. Either they are stealing it or they aren't.

10 Nov 2023 at 17:50 | Open on better.boston

Michał "rysiek" Woźniak · 🇺🇦

@glyph great reminder to make it clear in the context of my side-project, thanks!
https://gitlab.com/rysiekpl/libresilient/-/commit/e6f03e7dbdf10c257b110e65b61f5ffe9f51bdb6

10 Nov 2023 at 15:54 | Open on mstdn.social

Astro

@glyph@mastodon.social out of curiosity - how would you see with 100% certainty that code is Copilot-generated?

10 Nov 2023 at 16:03 | Open on firefish.intragon.org

Chris Johnson

@glyph Thank you for taking the correct moral stand. While as far as I know, virtually zero people use my open source projects, I will implement a similar policy.

10 Nov 2023 at 16:50 | Open on phpc.social

Jet Balsa

@glyph I want to see the fall out of it from the closed source world as well, What happens when half your game is pieces of copilot code?!

10 Nov 2023 at 18:17 | Open on hackers.town

Marcin 🍔:pocket:

@glyph Makes sense. Our lawyers told us to stay away from LLM generated code as well.

11 Nov 2023 at 10:37 | Open on mozilla.social

Go Up