Email or username:

Password:

Forgot your password?
Top-level
mcc

…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)

40 comments
mcc

So… what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from.

mcc

Like, heck, how am I *supposed* to rely on my code getting preserved after I lose interest, I die, BitBucket deletes every bit of Mercurial-hosted content it ever hosted, etc? Am I supposed to rely on *Microsoft* to responsibly preserve my work? Holy crud no.

We *want* people to want their code widely mirrored and distributed. That was the reason for the licenses. That was the social contract. But if machine learning means the social contract is dead, why would people want their code mirrored?

Graham Spookyland🎃/Polynomial

@mcc I have generally come to the conclusion that this is an intended effect. All the things you feel compelled to do for the good of others, in an ordinarily altruistic sense, are essentially made impossible unless you accept that your works and your expressions will be repackaged, sold, and absorbed into commercialised datasets.

The SoaD line "manufacturing consent is the name of the game" has been in my head a lot lately.

Mark T. Tomczak

@gsuberland @mcc One almost wonders if the end-game is to stop pulling and try pushing.

Maybe instead of trying to claw back data we've made publicly crawlable because "I wanted it visible, but not like that" we ask why any of these companies get to keep their data proprietary when it's built on ours?

Would people be more okay with all of this if the rule were "You can build a trained model off of publicly-available data, but that model must itself be publicly-available?"

mcc replied to Mark T. Tomczak

@mark @gsuberland In my opinion, a trapdoor like "okay, well if copyright doesn't apply to the training data you stole, your model isn't copyrightable either" is no good. The US Gov has already said GenAI images and text are not copyrightable. It doesn't help. The thing about generative AI is it inherently takes heavy computational resources (disk space, CPU time, often-unacknowledged low-wage tagging work). Therefore, as a tool, it is inherently biased toward capital and away from individuals.

mcc replied to mcc

@mark @gsuberland If we say "AI is a new class of thing that is outside the copyright regime entirely", that is not a level playing field. The tool is designed in a way it inherently serves the powerful. "Machine learning models are inherently open" is the exact model I am afraid of— a world where copyright is something that applies to actors who have less than some specific amount of money, and anyone with more than that specific amount of money is liberated from it.

Graham Spookyland🎃/Polynomial replied to mcc

@mcc @mark yes. the only real push back solution that levels the playing field would be to say that you are not allowed to unilaterally make money off it, which essentially just falls back to enforcing copyright law against the rich, which... yeah, exactly the problem.

jbaggs replied to mcc

@mcc @mark @gsuberland Every x number of years we get business people trying to circumvent law by claiming the old laws don't apply, because computers.

datarama replied to mcc

@mcc @mark @gsuberland Exactly.

Even if, say, GPT-4 wasn't covered by copyright, so what? Even if you could get it out of OpenAI's data centres in the first place, you couldn't run it with reasonable performance. And you *certainly* couldn't retrain it.

Oblomov replied to datarama

@datarama @mcc @mark @gsuberland there is one upside to forcing these models to be open and it's that it removes one of the, of not the primary, incentives in developing them in the first place. Yes, they could still sell its execution as a service, but if they lose control of the model itself, it becomes a considerably less profitable endeavor.

datarama replied to Oblomov

@oblomov @mcc @mark @gsuberland How, though?

Let's say that tomorrow, a judge rules that GPT-4 is not covered by copyright. What has actually changed? OpenAI isn't compelled to share it with anyone, and it's too big for anyone except large and wealthy corporations to actually do anything with.

Sure, you couldn't get sued if you got a bittorrent of it somehow. But you're not getting a bittorrent of a 1.76 trillion parameter neural network anyway.

Graham Spookyland🎃/Polynomial replied to datarama

@datarama @oblomov @mcc @mark and you sure as shit can't afford a whole rack of H200 cards to make use of it, even if you and all your friends pitch in. it's only useful with people who have the capital to wield it.

crzwdjk ✅ replied to datarama

@datarama @oblomov @mcc @mark @gsuberland 1.76 trillion parameters is about a hard drive's worth of data, no?

datarama replied to crzwdjk ✅

@crzwdjk @oblomov @mcc @mark @gsuberland It is, but that's *still* beside the point. You can't actually do anything with it unless you have the resources of a large corporation.

And my other point was that just because it isn't copyrighted, they can still keep it secret.

datarama

@gsuberland @mcc This isn't why the AI craze has made me anxious, but it *is* why I have become terribly depressed.

I like writing code and making various weird computer programs, and sharing them with people for mutual entertainment and occasional enlightenment. Now I can't do that without accepting that everything I do will be appropriated and commoditized by some of the most horrible people in tech, unless I do it in secret.

And then what's the point?

asmaloney (Andy) 🌎 replied to datarama

@datarama @gsuberland @mcc > And then what's the point?

💯 I'm feeling exactly the same way and I'm really struggling with it.

Not just code but blog posts/tutorials as well. I've "lost" my main creative outlets.

datarama replied to asmaloney (Andy) 🌎

@asmaloney @gsuberland @mcc That's where I'm at too.

And I have never been as depressed as I have this last year. For every other awful period in my life, I always had creative computer things to fall back on - literally, that has been how I kept from going too crazy in the entire story from "tiny bullied autistic kid" to "middle-aged guy holed up all alone during a pandemic". There was always coding and writing.

datarama replied to datarama

@asmaloney @gsuberland @mcc Coding feels especially meaningless now. I try to convince myself that even after we all get fired and replaced with shitty AI, we could still do it for fun - but it's not fun when you know all you're really doing is providing more free training data for the same assholes who are actively working to destroy your life.

asmaloney (Andy) 🌎 replied to datarama

@datarama "middle-aged guy holed up all alone during a pandemic"

I feel seen (as the kids say these days). 😆

datarama replied to asmaloney (Andy) 🌎

@asmaloney I sometimes think about how much that particular experience has coloured the rest of my experience of this bleak, bleak decade. I sat at home with nearly no social contact for 1½ years (except what came in through Teams), and even if I'm a bit of an introvert, I'm sure it made me a bit crazy.

asmaloney (Andy) 🌎 replied to datarama

@datarama Me too. Really struggling to "dig out" of that and then all this other shit ("AI", wars, climate, politics, layoffs for even more profit, the shoddy state of software in general, etc.) just piles on.

I think we're very much in the same situation, so you aren't alone. I hope you find some peace or at least some outlet to move things in a positive direction.

I'm still lookin'... 😀

datarama replied to asmaloney (Andy) 🌎

@asmaloney I've been looking for a long, long time too. And I don't know the way out.

Every crisis is immediately followed by the next, without any of them being resolved. I am so tired.

margot

@mcc have we considered starting a secret society with arcane rites devoted to preserving and protecting open source code

Mark T. Tomczak

@emaytch @mcc So there's a lot of stuff that Paul Graham says that I don't agree with (these days; used to be pretty bought in), but I think the point he made about the nature of copyright and patent protection ages ago rings true.

Paraphrasing without citation because I'm not going to go crawling around to find it right now: the alternative to IP protection isn't a magical utopia of shared ideas... It's guilds and secret knowledge protected with violence. We already tried society without intellectual property protection.

@emaytch @mcc So there's a lot of stuff that Paul Graham says that I don't agree with (these days; used to be pretty bought in), but I think the point he made about the nature of copyright and patent protection ages ago rings true.

Paraphrasing without citation because I'm not going to go crawling around to find it right now: the alternative to IP protection isn't a magical utopia of shared ideas... It's guilds and secret knowledge protected with violence. We already tried society without intellectual...

✧✦✶✷Catherine✷✶✦✧ replied to Mark T. Tomczak

@mark @emaytch @mcc if this was true I could get documentation for any of the ASICs Broadcom sells and I can't

Peter Linss

@emaytch @mcc where each member chooses a repo to memorize. At the secret meetings in the woods we take turns reciting them back to each other…

StaringAtClouds

@emaytch @mcc Ossiris

Just 'cos it sounds fun & it's got OSS in it

Sorry it's a bit late here & brain isn't up to working out a proper acronym

bob

@mcc there would only be a cost to you as an open source author if LLM code generation worked, though

mcc

@bob Depends on what "works" means. I believe that LLMs are capable of substantially reproducing entire paragraphs, code functions or images from their training set under circumstances where the origin is not disclosed or easily traced back.

bob replied to mcc

@mcc in a world where people were already copy/pasting from stackoverflow all day does that make a difference?

Aedius Filmania ⚙️🎮🖊️

@mcc

Please don't opt out all your repositories, leave the ones that didn't work or didnt compile or are full of security hole.

josh

@mcc i feel like we need llm opt out considerations in foss licenses tbh, then host code off github and nothing changes? Hard to enforce idk unlikely politicians will get it right, maybe the ftc will get lucky?

mcc

@josh I don't like this because (1) it means GPL2 is dead, and (2) it feels like admitting that an AI opt-out is something we specifically needed. Meanwhile, machine transformation of my work is something I generally want, I just want the license to be observed.

josh

@mcc yeah plus if the opt out for the stack stands it means they got everything in the past at least once so anyone with every version of it can combine them to get all the old stuff anyways. I HOPE that someone can get lucky and stop companies from shittifying everything, but it does kinda feel like this is the break in case of emergency that the clause in the gpl about adhering to future versions was made for

clacke: looking for something 🇸🇪🇭🇰💙💛
@josh @mcc Either copyright doesn't apply and then whatever you put in your license doesn't matter, or copyright does apply and then the existing copyleft licenses are enough.
sebastian

@mcc The other problem is that they copied stuff that isn't code. They have multiple repos of mine, containing CAD files or PCB layouts misclassified as prolog code (of all things...). About half of the stuff they scraped from me is like that, so jokes on them I guess.

What pisses me off, at least a little, is that they took stuff that did not have a license added to it yet. According to German copyright law, that means they can look at those repos all day long, but not reproduce them or distribute them without my permission. As a German citizen I have to play by those rules, otherwise I'll get letters from smug lawyers within weeks. They don't seem to have to.

@mcc The other problem is that they copied stuff that isn't code. They have multiple repos of mine, containing CAD files or PCB layouts misclassified as prolog code (of all things...). About half of the stuff they scraped from me is like that, so jokes on them I guess.

What pisses me off, at least a little, is that they took stuff that did not have a license added to it yet. According to German copyright law, that means they can look at those repos all day long, but not reproduce them or distribute...

datarama

@mcc That's also basically how LAION made the dataset for Stable Diffusion. They collected a bunch of links to images with descriptive alt-text.

(Are you taking time to write good alt-text because you respect disabled people? Congratulations, your good work is being exploited by the worst assholes in tech. Silicon Valley never lets a good deed go unpunished.)

Go Up