@davidrevoy @chengdulittlea @sepia also it claims they didn't scrape anything at all, common crawl is an old project that created dumps of the ~entire internet, they seem to have simply formated whatever can be formatted there into a txt-img paralel corpus.