Building language models is no small task. It takes a long time to source the material, get it into a machine learnable state and include labels (which requires internal or external resource, depending on your choice). Then you have to update it if you want to stay ahead. Is there a possible solution? A DeepMind researcher-led paper says there might be: what if the data could be scraped from the Web dynamically?
The paper proposes giving language models access to search engines and find the information themselves. This would alleviate the problem with models becoming outdated very quickly. But there are caveats, as Kyle Wiggers from VentureBeat advised:
The “live” web is less curated than the static datasets historically used to train language models and, by implication, less filtered. Most labs developing language models take pains to identify potentially problematic content in the training data to minimize potential future issues. For example, in creating an open source text dataset containing hundreds of gigabytes of webpages, research group EleutherAI claims to have performed “extensive bias analysis” and made “tough editorial decisions” to exclude data they felt were “unacceptably negatively biased” toward certain groups or views.
The live web can be filtered to a degree, of course. And as the DeepMind researchers note, search engines like Google and Bing use their own “safety” mechanisms to reduce the chances unreliable content rises to the top of results. But these results can be gamed — and aren’t necessarily representative of the totality of the web. As a recent piece in The New Yorker notes, Google’s algorithm prioritizes websites that use modern web technologies like encryption, mobile support, and schema markup. Many websites with otherwise quality content get lost in the shuffle as a result.
And scraping search engines like Google mean language models are subject to different types of bias. Most of the popular content you find on search engines is optimised for them (all hail SEO) and Google has its own form of curated content in the form of its “featured snippets” and various search algorithms. And with Alphabet owning both DeepMind and Google, is there a conflict of interest there with regard to fair and unbiased data?
Filed under: data DeepMind Google language models research search engines