Stefan Baack has published an insightful research article on Common Crawl, a nonprofit organization behind the huge web crawl data archive that most generative AI models are trained on. Stefan spoke to a number of people at Common Crawl and explored its influence on large language models, AI research, and Common Crawl’s mission to “enable others to work like Google”. Here’s a quick excerpt:
When AI builders use Common Crawl uncritically as though its data covers all or a majority of the entire internet, they essentially declare a relatively small subsection of primarily English web pages as being representative of the entire world, even though it includes proportionately little content from other languages and cultures. Moreover, given the imperfections of automated filtering techniques that leave a lot of problematic content in Common Crawl untouched (and unannotated), including in popular filtered versions like C4 (Schaul, Chen, and Tiku 2023), AI builders have to rely on containing this toxicity by adapting LLMs after pre-training, for example, with fine-tuning techniques like RLHF. To what extent this is possible is questionable (Birhane et al. 2023), but even if we assume that toxic pre-training data can be sufficiently contained with fine-tuning and other techniques, as of today, keeping LLMs “clean” requires the continued efforts of data workers in often precarious working conditions (Williams, Miceli, and Gebru 2022).(via)
Common Crawl is not the entirety of the Web but the people behind the models we see and use give the impression that it’s close enough to that, or that they can access the entirety of the Web to use in conjunction with those models. All of this is flawed or just straight up lies. Research like this is helpful to demystify a lot of these claims and assumptions so it’s worth your time.