@llimllib the idea of “model collapse” is almost irresistible, because it’s a story of LLMs being brought down by their dual sins of polluting the web and then training on unverified and unlicensed scraped data
If AI labs continued to train indiscriminately it might be a problem, but those researchers are smarter than that: their whole game is about sourcing (and often deliberately generating) high quality training data
@simon that's pretty dystopian: the only source of consistently un-slopped data is locked up in the AI companies' vaults; the rest of us make do with the crap that's on the web