@yabellini SciHub makes papers public that are behind paywalls. I agree, that they shouldn't be behind paywalls, but it's completely different to OpenAI.
I think they used mostly sources that are public anyway, like Wikipedia, etc. They also didn't publish them but trained an AI with it, that creates new texts. So they did a remix in a way. Remixes are handled differently in copyright law.
"The corpus [GPT-2] was trained on, […] 40 [GB] of text from URLs shared in Reddit" https://en.wikipedia.org/wiki/OpenAI
@duco
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html#:~:text=1.3k-,The%20Times%20Sues%20OpenAI%20and%20Microsoft%20Over%20A.I.,with%20it%2C%20the%20lawsuit%20said.
I recommend reading the lawsuit, it was not only written by lawyers who know the law but it is also very clear:
https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf