Stefan Baack

This profile might be incomplete.
Open on mozilla.social

Friends

1 friend

Johannes
Ernst

Stefan Baack

Homepage:

https://sbaack.com

Github:

https://github.com/sbaack

Publications:

https://scholar.google.com/citations?user=cphFdLUAAAAJ&hl=en&oi=ao

Bluesky:

https://bsky.app/profile/sbaack.bsky.social

Personal info

About:

Research and data analyst @mozilla Studies journalism and tech activism, and more recently alternative data governance. he/him

Wall 1 post

Stefan Baack

Most generative AI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development #commoncrawl #ai #generativeAI #llm #datagovernance #sts https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl (1/10)

Like 6 February at 16:18 | Open on mozilla.social

Stefan Baack

Common Crawl is created by a small nonprofit of the same name founded in 2007. Its mission is to level the playing field for technology development by giving free access to data that only companies like Google used to have. Proving data for AI training has never been its primary goal (2/10)

6 February at 16:21 | Open on mozilla.social

Show 9 replies

seher

@tootbaack Can't wait to read!! Your earlier blogs on how data trains algorithms are the only reason I understand how the back end works!

6 February at 16:23 | Open on mozilla.social

Go Up