…but wait! If you look at what they actually did (correct...

mcc

…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)

Like 3 April at 20:36 | Open on mastodon.social

40 comments

mcc

So… what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from.

3 April at 20:39 | Open on mastodon.social

mcc

Like, heck, how am I *supposed* to rely on my code getting preserved after I lose interest, I die, BitBucket deletes every bit of Mercurial-hosted content it ever hosted, etc? Am I supposed to rely on *Microsoft* to responsibly preserve my work? Holy crud no.

We *want* people to want their code widely mirrored and distributed. That was the reason for the licenses. That was the social contract. But if machine learning means the social contract is dead, why would people want their code mirrored?

3 April at 20:39 | Open on mastodon.social

Graham Spookyland🎃/Polynomial

@mcc I have generally come to the conclusion that this is an intended effect. All the things you feel compelled to do for the good of others, in an ordinarily altruistic sense, are essentially made impossible unless you accept that your works and your expressions will be repackaged, sold, and absorbed into commercialised datasets.

The SoaD line "manufacturing consent is the name of the game" has been in my head a lot lately.

3 April at 20:48 | Open on chaos.social

Mark T. Tomczak

@gsuberland @mcc One almost wonders if the end-game is to stop pulling and try pushing.

Maybe instead of trying to claw back data we've made publicly crawlable because "I wanted it visible, but not like that" we ask why any of these companies get to keep their data proprietary when it's built on ours?

Would people be more okay with all of this if the rule were "You can build a trained model off of publicly-available data, but that model must itself be publicly-available?"

3 April at 20:53 | Open on mastodon.fixermark.com

mcc replied to Mark T. Tomczak

@mark @gsuberland In my opinion, a trapdoor like "okay, well if copyright doesn't apply to the training data you stole, your model isn't copyrightable either" is no good. The US Gov has already said GenAI images and text are not copyrightable. It doesn't help. The thing about generative AI is it inherently takes heavy computational resources (disk space, CPU time, often-unacknowledged low-wage tagging work). Therefore, as a tool, it is inherently biased toward capital and away from individuals.

3 April at 20:59 | Open on mastodon.social

mcc replied to mcc

@mark @gsuberland If we say "AI is a new class of thing that is outside the copyright regime entirely", that is not a level playing field. The tool is designed in a way it inherently serves the powerful. "Machine learning models are inherently open" is the exact model I am afraid of— a world where copyright is something that applies to actors who have less than some specific amount of money, and anyone with more than that specific amount of money is liberated from it.

3 April at 20:59 | Open on mastodon.social

Graham Spookyland🎃/Polynomial replied to mcc

@mcc @mark yes. the only real push back solution that levels the playing field would be to say that you are not allowed to unilaterally make money off it, which essentially just falls back to enforcing copyright law against the rich, which... yeah, exactly the problem.

3 April at 21:04 | Open on chaos.social

jbaggs replied to mcc

@mcc @mark @gsuberland Every x number of years we get business people trying to circumvent law by claiming the old laws don't apply, because computers.

3 April at 21:10 | Open on infosec.exchange

datarama replied to mcc

@mcc @mark @gsuberland Exactly.

Even if, say, GPT-4 wasn't covered by copyright, so what? Even if you could get it out of OpenAI's data centres in the first place, you couldn't run it with reasonable performance. And you *certainly* couldn't retrain it.

3 April at 21:15 | Open on hachyderm.io

Oblomov replied to datarama

@datarama @mcc @mark @gsuberland there is one upside to forcing these models to be open and it's that it removes one of the, of not the primary, incentives in developing them in the first place. Yes, they could still sell its execution as a service, but if they lose control of the model itself, it becomes a considerably less profitable endeavor.

3 April at 21:23 | Open on sociale.network

datarama replied to Oblomov

@oblomov @mcc @mark @gsuberland How, though?

Let's say that tomorrow, a judge rules that GPT-4 is not covered by copyright. What has actually changed? OpenAI isn't compelled to share it with anyone, and it's too big for anyone except large and wealthy corporations to actually do anything with.

Sure, you couldn't get sued if you got a bittorrent of it somehow. But you're not getting a bittorrent of a 1.76 trillion parameter neural network anyway.

3 April at 21:28 | Open on hachyderm.io

Graham Spookyland🎃/Polynomial replied to datarama

@datarama @oblomov @mcc @mark and you sure as shit can't afford a whole rack of H200 cards to make use of it, even if you and all your friends pitch in. it's only useful with people who have the capital to wield it.

3 April at 21:31 | Open on chaos.social

crzwdjk ✅ replied to datarama

@datarama @oblomov @mcc @mark @gsuberland 1.76 trillion parameters is about a hard drive's worth of data, no?

4 April at 0:23 | Open on mastodon.social

datarama replied to crzwdjk ✅

@crzwdjk @oblomov @mcc @mark @gsuberland It is, but that's *still* beside the point. You can't actually do anything with it unless you have the resources of a large corporation.

And my other point was that just because it isn't copyrighted, they can still keep it secret.

4 April at 5:33 | Open on hachyderm.io

datarama

@gsuberland @mcc This isn't why the AI craze has made me anxious, but it *is* why I have become terribly depressed.

I like writing code and making various weird computer programs, and sharing them with people for mutual entertainment and occasional enlightenment. Now I can't do that without accepting that everything I do will be appropriated and commoditized by some of the most horrible people in tech, unless I do it in secret.

And then what's the point?

3 April at 21:38 | Open on hachyderm.io

asmaloney (Andy) 🌎 replied to datarama

@datarama @gsuberland @mcc > And then what's the point?

💯 I'm feeling exactly the same way and I'm really struggling with it.

Not just code but blog posts/tutorials as well. I've "lost" my main creative outlets.

4 April at 3:16 | Open on fosstodon.org

datarama replied to asmaloney (Andy) 🌎

@asmaloney @gsuberland @mcc That's where I'm at too.

And I have never been as depressed as I have this last year. For every other awful period in my life, I always had creative computer things to fall back on - literally, that has been how I kept from going too crazy in the entire story from "tiny bullied autistic kid" to "middle-aged guy holed up all alone during a pandemic". There was always coding and writing.

4 April at 5:38 | Open on hachyderm.io

datarama replied to datarama

@asmaloney @gsuberland @mcc Coding feels especially meaningless now. I try to convince myself that even after we all get fired and replaced with shitty AI, we could still do it for fun - but it's not fun when you know all you're really doing is providing more free training data for the same assholes who are actively working to destroy your life.

4 April at 5:40 | Open on hachyderm.io

asmaloney (Andy) 🌎 replied to datarama

@datarama "middle-aged guy holed up all alone during a pandemic"

I feel seen (as the kids say these days). 😆

4 April at 13:05 | Open on fosstodon.org

datarama replied to asmaloney (Andy) 🌎

@asmaloney I sometimes think about how much that particular experience has coloured the rest of my experience of this bleak, bleak decade. I sat at home with nearly no social contact for 1½ years (except what came in through Teams), and even if I'm a bit of an introvert, I'm sure it made me a bit crazy.

4 April at 19:46 | Open on hachyderm.io

asmaloney (Andy) 🌎 replied to datarama

@datarama Me too. Really struggling to "dig out" of that and then all this other shit ("AI", wars, climate, politics, layoffs for even more profit, the shoddy state of software in general, etc.) just piles on.

I think we're very much in the same situation, so you aren't alone. I hope you find some peace or at least some outlet to move things in a positive direction.

I'm still lookin'... 😀

4 April at 23:24 | Open on fosstodon.org

datarama replied to asmaloney (Andy) 🌎

@asmaloney I've been looking for a long, long time too. And I don't know the way out.

Every crisis is immediately followed by the next, without any of them being resolved. I am so tired.

5 April at 16:28 | Open on hachyderm.io

margot

@mcc have we considered starting a secret society with arcane rites devoted to preserving and protecting open source code

3 April at 20:49 | Open on mastodon.social

Hugo Mills

@emaytch @mcc I propose "The IlluminFTP".

3 April at 20:51 | Open on mstdn.social

Mark T. Tomczak

@emaytch @mcc So there's a lot of stuff that Paul Graham says that I don't agree with (these days; used to be pretty bought in), but I think the point he made about the nature of copyright and patent protection ages ago rings true.

Paraphrasing without citation because I'm not going to go crawling around to find it right now: the alternative to IP protection isn't a magical utopia of shared ideas... It's guilds and secret knowledge protected with violence. We already tried society without intellectual property protection.

Expand text...

3 April at 20:55 | Open on mastodon.fixermark.com

✧✦✶✷Catherine✷✶✦✧ replied to Mark T. Tomczak

@mark @emaytch @mcc if this was true I could get documentation for any of the ASICs Broadcom sells and I can't

3 April at 21:01 | Open on mastodon.social

Foone🏳️‍⚧️

@emaytch @mcc why would we need two of those?

3 April at 20:57 | Open on digipres.club

gkrnours

@emaytch @mcc 🧙‍♀️

3 April at 21:18 | Open on mastodon.gamedev.place

Peter Linss

@emaytch @mcc where each member chooses a repo to memorize. At the secret meetings in the woods we take turns reciting them back to each other…

3 April at 23:33 | Open on social.linss.com

StaringAtClouds

@emaytch @mcc Ossiris

Just 'cos it sounds fun & it's got OSS in it

Sorry it's a bit late here & brain isn't up to working out a proper acronym

4 April at 0:52 | Open on mastodon.social

bob

@mcc there would only be a cost to you as an open source author if LLM code generation worked, though

3 April at 21:27 | Open on feed.hella.cheap

mcc

@bob Depends on what "works" means. I believe that LLMs are capable of substantially reproducing entire paragraphs, code functions or images from their training set under circumstances where the origin is not disclosed or easily traced back.

3 April at 21:35 | Open on mastodon.social

bob replied to mcc

@mcc in a world where people were already copy/pasting from stackoverflow all day does that make a difference?

3 April at 21:36 | Open on feed.hella.cheap

Aedius Filmania ⚙️🎮🖊️

@mcc

Please don't opt out all your repositories, leave the ones that didn't work or didnt compile or are full of security hole.

3 April at 20:40 | Open on lavraievie.social

josh

@mcc i feel like we need llm opt out considerations in foss licenses tbh, then host code off github and nothing changes? Hard to enforce idk unlikely politicians will get it right, maybe the ftc will get lucky?

3 April at 20:42 | Open on wetdry.world

mcc

@josh I don't like this because (1) it means GPL2 is dead, and (2) it feels like admitting that an AI opt-out is something we specifically needed. Meanwhile, machine transformation of my work is something I generally want, I just want the license to be observed.

3 April at 20:43 | Open on mastodon.social

josh

@mcc yeah plus if the opt out for the stack stands it means they got everything in the past at least once so anyone with every version of it can combine them to get all the old stuff anyways. I HOPE that someone can get lucky and stop companies from shittifying everything, but it does kinda feel like this is the break in case of emergency that the clause in the gpl about adhering to future versions was made for

3 April at 20:48 | Open on wetdry.world

clacke: looking for something 🇸🇪🇭🇰💙💛

@josh @mcc Either copyright doesn't apply and then whatever you put in your license doesn't matter, or copyright does apply and then the existing copyleft licenses are enough.

3 April at 20:47 | Open on libranet.de

sebastian

@mcc The other problem is that they copied stuff that isn't code. They have multiple repos of mine, containing CAD files or PCB layouts misclassified as prolog code (of all things...). About half of the stuff they scraped from me is like that, so jokes on them I guess.

What pisses me off, at least a little, is that they took stuff that did not have a license added to it yet. According to German copyright law, that means they can look at those repos all day long, but not reproduce them or distribute them without my permission. As a German citizen I have to play by those rules, otherwise I'll get letters from smug lawyers within weeks. They don't seem to have to.

Expand text...

3 April at 20:57 | Open on schottkydio.de

datarama

@mcc That's also basically how LAION made the dataset for Stable Diffusion. They collected a bunch of links to images with descriptive alt-text.

(Are you taking time to write good alt-text because you respect disabled people? Congratulations, your good work is being exploited by the worst assholes in tech. Silicon Valley never lets a good deed go unpunished.)

3 April at 20:39 | Open on hachyderm.io

Go Up