Remember IBM’s Watson, the AI Jeopardy! champion? A 2010 promotion proclaimed, “Watson understands natural language with all its ambiguity and complexity.” However, as we saw when Watson subsequently failed spectacularly in its quest to “revolutionize medicine with artificial intelligence,” a veneer of linguistic facility is not the same as actually comprehending human language.
Natural language understanding has long been a major goal of AI research. At first, researchers tried to manually program everything a machine would need to make sense of news stories, fiction or anything else humans might write. This approach, as Watson showed, was futile — it’s impossible to write down all the unwritten facts, rules and assumptions required for understanding text. More recently, a new paradigm has been established: Instead of building in explicit knowledge, we let machines learn to understand language on their own, simply by ingesting vast amounts of written text and learning to predict words. The result is what researchers call a language model. When based on large neural networks, like OpenAI’s GPT-3, such models can generate uncannily humanlike prose (and poetry!) and seemingly perform sophisticated linguistic reasoning.
But has GPT-3 — trained on text from thousands of websites, books and encyclopedias — transcended Watson’s veneer? Does it really understand the language it generates and ostensibly reasons about?
I’ve spent the last year or so getting into natural language processing and it’s been a fascinating journey but I’ve always been wary of the text generation problem. Language models like GPT-3 are expansive but are riddled with biases. It’s up to those using models to train them for their own uses and weed those out but that takes time, money, and resources that companies don’t want to provide even if it means stopping the models from being racist, sexist, or homophobic.
One of the biggest attempts at getting AI to pass the Turing test is the Winograd schema challenge:
In a 2012 paper, the computer scientists Hector Levesque, Ernest Davis and Leora Morgenstern proposed a more objective test, which they called the Winograd schema challenge. This test has since been adopted in the AI language community as one way, and perhaps the best way, to assess machine understanding — though as we’ll see, it is not perfect. A Winograd schema, named for the language researcher Terry Winograd, consists of a pair of sentences, differing by exactly one word, each followed by a question.
[…] These schemas were the subject of a competition held in 2016 in which the winning program was correct on only 58% of the sentences — hardly a better result than if it had guessed. Oren Etzioni, a leading AI researcher, quipped, “When AI can’t determine what ‘it’ refers to in a sentence, it’s hard to believe that it will take over the world.”
But then the language models got smarter and soon started scores of 90% and over. Rejoice for AI… except there was a flaw. The Winograd schema system was supposed to involve questions that couldn’t be Googled.
These challenges, like many other current tests of AI language understanding, sometimes permit shortcuts that allow neural networks to perform well without understanding. For example, consider the sentences “The sports car passed the mail truck because it was going faster” and “The sports car passed the mail truck because it was going slower.” A language model trained on a huge corpus of English sentences will have absorbed the correlation between “sports car” and “fast,” and between “mail truck” and “slow,” and so it can answer correctly based on those correlations alone rather than by drawing on any understanding.
In 2019, a group of researchers created an improved and more expanded version called WinoGrande where the questions were crowdsourced. And there lied another problem—human error and bias. AI and human scores were inconsistent between answering one or both questions correctly. It finally boiled down to what constituted gaining knowledge. An AI can easily process data to answer questions but they’re a long way from understanding human idiosyncrasies like sarcasm, idioms, and double entendres. Even humans struggle with those as they don’t often translate between languages and we can’t learn those without experience, something AI cannot do (distinct from having more data added for training). You can’t train human experience. But lord knows the data overlords will try.
Filed under: linguistics natural language processing neural networks