Most generative AI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development #commoncrawl #ai #generativeAI #llm #datagovernance #sts https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl (1/10)
Common Crawl is created by a small nonprofit of the same name founded in 2007. Its mission is to level the playing field for technology development by giving free access to data that only companies like Google used to have. Proving data for AI training has never been its primary goal (2/10)
@tootbaack Can't wait to read!! Your earlier blogs on how data trains algorithms are the only reason I understand how the back end works!