If you believe the sensational headlines from January 2018, both Microsoft and Alibaba have developed computer programs that outperform humans on reading comprehension. Can this be correct? Well, it really depends on how you define “comprehension.” Let’s first start with the claims.
The test given to the programs is actually a dataset compiled by a group of computer scientists at Stanford University. It’s known as the Stanford Question Answering Dataset (or SQuAD for short), and it’s made up of more than 100,000 pairs of questions and answers based on 536 paragraph-length excerpts from Wikipedia. The program or the person reads the excerpt and then answer questions about it.
At a surface glance, SQuAD can appear formidable. The topics cover a wide range of knowledge, including historical trivia (“When did Martin Luther die?”), pop culture (“Which Doctor Who enemy is also a Time Lord?”), and basic chemistry (“What do you need to make combustion happen?”). The source paragraphs make for dense reading, often focusing on arcane topics like the EU’s legislative protocol and the concept of civil disobedience.
Faced with SQuAD’s passages and questions, humans got around 82.3 percent of the answers right. Alibaba and Microsoft’s AIs barely edged this out — getting 82.4 percent and 82.6 percent, respectively. It might be close, but a win’s a win.
But if you look below the surface, the test itself is actually quite easy. For each question, both the programs and the humans knew that the answer had to be located somewhere in the source paragraph — not just the answer, but the exact wording the question asked for. When you read “Whose authority did Luther’s theology oppose?”, it may seem difficult, but when the source text includes the sentence “[Luther’s] theology challenged the authority and office of the Pope,” it’s quite clear what the answer should be. You don’t need to understand what “authority” means, you just need to look for the subject and object of a sentence. There’s no comprehension of the passage required.
In fact, AI experts say the test is far too limited to compare with real reading. The test used only cleanly formatted Wikipedia articles — not the wide-ranging styles and complexities of books, news articles, and other reading materials that humans encounter every day. And the answers generated by the programs come from finding specific patterns in the text and matching terms in the questions and answers.
When the researchers added gibberish to the passages, something a human would easily ignore, the AIs tended to get confused, spitting out the wrong result. And every passage in the test was guaranteed to include the answer, unlike with normal reading, meaning the models didn’t have to to process concepts or reason with other ideas.
Stephen Merity, a research scientist who works on language AI indicated that calling the programs “superhuman” was “madness…. There’s no built-in ability for the model to determine or signal that it thinks the paragraph is insufficient to answer the question … and it’ll always spit you back something.”
Oren Etzioni, chief executive of the Allen Institute for Artificial Intelligence, an AI research group funded by Microsoft co-founder Paul Allen, commented that “These systems are brittle, in that small changes to paragraphs result in very bad behavior [and misunderstandings].” And when it came to something like drawing conclusions from two sentences or understanding implied ideas, the models lagged even further behind. “These kind (sic) of implications that we do naturally, without even thinking about it, these systems don’t do,” he said.
The real beauty of human reading comprehension is the ability to read between the lines — connecting concepts, reasoning with ideas, and understanding implied messages that aren’t specifically detailed in the text. It takes practice, but once we learn how to do it, it becomes almost second nature.
So where does that leave AIs? Right now, Alibaba said its technology could be used for “customer service, museum tutorials, and online responses to medical inquiries — in other words, straight factual information that involves no doubt about the correct answer.” Microsoft said it’s using similar programs in its Bing search engine. Maybe some day AIs will be intelligent enough to truly outperform humans in reading comprehension, but right now it’s the stuff of science fiction.
Harwell, Drew. (January 16, 2018). “AI models beat humans at reading comprehension, but they’ve still got a ways to go.” Retrieved from https://www.washingtonpost.com/business/economy/ais-ability-to-read-hailed-as-historical-milestone-but-computers-arent-quite-there/2018/01/16/04638f2e-faf6-11e7-a46b-a3614530bd87_story.html.
Vincent, James. (January 17, 2018). “No, machines can’t read better than humans: Headlines have claimed AIs outperform humans at ‘reading comprehension,’ but in reality they’ve got a long way to go.” Retrieved from https://www.theverge.com/2018/1/17/16900292/ai-reading-comprehension-machines-humans.