@TheDcoder @Codeberg I think the main reason people oppose the training of LLM on GitHub repos is, that they totally ignored the licences under which these projects were published. They just took all they can get and now make piles of money on the back of the community.
@ascendo @Codeberg AFAIK GitHub requires all public projects to be under a "forkable" license, so technically they're all free range.
I don't think the dataset can generate verbatim code samples from those projects... if it is doing something like that, then it's clearly theft (since it does not attribute).
I also get that it's somewhat unfair for them to be making off of this, but it is also one of the beautiful things about FOSS, you can make money and sustain from it!