…but wait! If you look at what they actually did (correct me if I'm wrong), they aren't actually doing any machine learning in the "stack" repo itself. The "stack" just collects zillions of repos in one place. Mirroring my content as part of a corpus of open source software, torrenting it, putting it on microfilm in a seedbank is the kind of thing I want to encourage. The problem becomes that they then *suggest* people create derivative works of those repos in contravention of the license. (2/2)
So… what is happening here? All these people are opting out of having their content recorded as part of a corpus of open source code. And I'll probably do the same, because "The Stack" is falsely implying people have permission to use it for ML training. But this means "The Stack" has put a knife in the heart of publicly archiving open source code at all. Future attempts to preserve OSS code will, if they base themselves on "the stack", not have any of those opted-out repositories to draw from.