(The exact value of "N" is not known yet; I assume it will be solidly fixed by some upcoming court case.)
Top-level
(The exact value of "N" is not known yet; I assume it will be solidly fixed by some upcoming court case.) 126 comments
@pinkdrunkenelephants The licenses already bar this because they govern derivative works. If you can make the derivative work non-derivative by defining it "AI", then if we add a nonsense clause banning "AI", the AI companies can simply rename "AI" to "floopleflorp" and say "Ah, but your license only bans 'AI'โ it doesn't ban 'floopleflorp'!" @mcc They can rename it Clancy for all it matters. AI is still AI and actions don't just lose meaning because of evil people playing with language. @pinkdrunkenelephants But AI is not AI. The things that they're calling "AI" are just some machine learning statistical models. Ten years ago this wouldn't have been considered "AI". @mcc Doesn't matter, what matters is the definition behind the word. That is what licenses ought to ban outright. It's like saying rape is perfectly legal so long as we call it forced sex. Who would believe that that wasn't already predisposed to rape? Don't fall for other people's manipulative mindgames? @pinkdrunkenelephants The definition behind the law is, again, decided by humans, who are capable of inconsistency or poor decisions. Rape is legal in New York because rape there is legally defined by the specific use of certain specific genitals. See E. Jean Carroll v. Donald J. Trump @mcc And no one accepts that because of what I'm saying. A rose by any other name would smell as sweet. People need to start recognizing that fact. That's the only way things will change. @pinkdrunkenelephants Well, per my belief as to the meaning of words, ML statistical models are derivative works like any other, and my licenses which place restrictions on derivative works already apply to the ML statistical models @pinkdrunkenelephants @mcc That doesn't work if copyright *itself* doesn't apply to AI training, which is what all those court cases are about. Licenses start from the assumption that the copyright holder reserves all rights, and then the license explicitly waives some of those rights under a set of given conditions. But with AI, it's up in the air whether a copyright holder has any rights at all. @pinkdrunkenelephants @datarama Because humans also are the ones who interpret and enforce laws and if the government does not enforce copyright against companies which market their products as "AI", then copyright does not apply to those companies. @pinkdrunkenelephants @mcc In the EU, there actually is some legislation. Copyright explicitly *doesn't* protect works from being used in machine learning for academic research, but ML training for commercial products must respect a "machine-readable opt-out". But that's easy enough to get around. That's why eg. Stability funded an "independent research lab" who did the actual data gathering for them. @datarama I consider this illegitimate and fundamentally unfair because I have already released large amounts of work under creative commons/open source licenses. I can't retroactively add terms to some of them because the plain language somehow no longer applies. If I add such opt-outs now, it would be like I'm admitting the licenses previously didn't apply to statistics-based derivative works @pinkdrunkenelephants @mcc I think if there was a simple clear-cut answer to that, the world would be a *very* different place. @mcc it's kinda gross that the only (current) way to meaningfully and tangibly refuse to be exploited by the mass commercialised theft of the commons is to, well, commercialise the commons. @mcc although if there's an angstrom-thick silver lining to this whole thing, it's that it has proved incontrovertibly that copyright law was only ever intended to be used as a cudgel by the wealthy and powerful, and never to protect the rights of the individual artist. @gsuberland @mcc The artists occasionally tried using the cudgel, but the opponents brought an AK47 to the courtroom... @mcc Literally zero. I have a thing I've been hacking on for a while, niche shit, probably not interesting to many others. I was planning on releasing it, but once I realized it'd probably have 0-1 other human users, but end up in every LLM training set, I decided not to. @mcc "the legal system is ultimately a weapon wielded by those with more capital against those with less" is of course the punchline after every movement that has tried to use legal mechanisms like licenses to enact social change. it'd be nice if there were some deep pan-institutional awareness of and correction for this. Did you see this? The whole thing with "the stack". https://post.lurk.org/@emenel/112111014479288871 Some jerks did mass scraping of open source projects, putting them in a collection called "the stack" which they specifically recommend other people use as machine learning sources. If you look at their "Github opt-out repository" you'll find just page after page of people asking to have their stuff removed: https://github.com/bigcode-project/opt-out-v2/issues (1/2) โฆbut wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2) Soโฆ what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from. Like, heck, how am I *supposed* to rely on my code getting preserved after I lose interest, I die, BitBucket deletes every bit of Mercurial-hosted content it ever hosted, etc? Am I supposed to rely on *Microsoft* to responsibly preserve my work? Holy crud no. We *want* people to want their code widely mirrored and distributed. That was the reason for the licenses. That was the social contract. But if machine learning means the social contract is dead, why would people want their code mirrored? @mcc I have generally come to the conclusion that this is an intended effect. All the things you feel compelled to do for the good of others, in an ordinarily altruistic sense, are essentially made impossible unless you accept that your works and your expressions will be repackaged, sold, and absorbed into commercialised datasets. The SoaD line "manufacturing consent is the name of the game" has been in my head a lot lately. @gsuberland @mcc One almost wonders if the end-game is to stop pulling and try pushing. Maybe instead of trying to claw back data we've made publicly crawlable because "I wanted it visible, but not like that" we ask why any of these companies get to keep their data proprietary when it's built on ours? Would people be more okay with all of this if the rule were "You can build a trained model off of publicly-available data, but that model must itself be publicly-available?" Please don't opt out all your repositories, leave the ones that didn't work or didnt compile or are full of security hole. @mcc yeah plus if the opt out for the stack stands it means they got everything in the past at least once so anyone with every version of it can combine them to get all the old stuff anyways. I HOPE that someone can get lucky and stop companies from shittifying everything, but it does kinda feel like this is the break in case of emergency that the clause in the gpl about adhering to future versions was made for @mcc That's also basically how LAION made the dataset for Stable Diffusion. They collected a bunch of links to images with descriptive alt-text. (Are you taking time to write good alt-text because you respect disabled people? Congratulations, your good work is being exploited by the worst assholes in tech. Silicon Valley never lets a good deed go unpunished.) @mcc Did copyleft licenses ever meaningfully restrict the behavior of large corporations? Licenses are effectively a statement of intent with respect to future litigation, and if the copyright holder is not willing or able to actually *perform* that litigation, everyone gradually understands that this is a Mexican standoff where one side's guns aren't loaded. @mcc IIRC, Sony did it much earlier. I cannot even find any record of this, but as I recall, Sony distributed a modified version of GCC as part of their early Playstation SDKs, in a way which clearly violated the GPL. FSF found out somehow, and the result was just that Sony said "oops, our bad, we forgot to contractually forbid members of our SDK program from talking to you" and then later switched to LLVM. @mcc one thing that gives me hope is that we are reaching an inflection point on internet curatorship, in which AI is so pervasive that you have to actually dig into internet archives to find valuable information. This gives me the impression that we are looking into a future that yes, AI will be very pervasive, but niche communities built on trust and self-curatorship (stuff like Web Rings) will be more common. Users will look for stuff written by humans, not AI. |
In a world where copyleft licenses turn out to restrict only the small actors they were meant to empower, and don't apply to big bad-actor "AI" companies, what is the incentive to put your work out under a license that will only serve to make it a target for "AI" scraping?
With NFTs, we saw people taking their work private because putting something behind a clickwall/paywall was the only way to not be stolen for NFTs. I assume the same process will accelerate in an "AI" world.