Recent progress in automated question answering about facts in Wikipedia articles.
By David Talby.
NLP Summit: Join David Talby, Kira Radinsky, Amy Heineike, Clément Delangue, Joel Grus, Piero Molino, and many other speakers at the first NLP Summit, a FREE virtual conference which takes place in early October.
One day – not next year, but within your lifetime – you will make a call to a customer service hotline and be greeted by:
- “This is Emma speaking from Boise. How may I help you today?”
… and your first thought will be:
- “Oh, great. Human customer service again. Why don’t they just let me speak to a machine?”
When that happens, for better or worse, remember March 2019. That was the month when new academic research results surpassed the human baseline in two widely studied natural language understanding challenges: GLUE and SuperGLUE.
The pace of progress is remarkable–on benchmarks like GLUE and SuperGLUE, the “best” result is outdated usually within 6-8 weeks. The goal of this article is to provide two use case examples to facilitate an intuitive understanding of how far we’ve come.
Use case: Passing school exams
The Stanford Question Answering Dataset (SQuAD) is a collection of crowdsourced questions and answers on Wikipedia articles. The questions resemble the reading comprehension exams you took at school. Here is a sample question excerpt about the University of Chicago:
Founded by the American Baptist Education Society with a donation from oil magnate and wealthiest man in history John D. Rockefeller, the University of Chicago was incorporated in 1890; William Rainey Harper became the university’s first president in 1891, and the first classes were held in 1892.
Q: What society founded the University of Chicago?
A: the American Baptist Education Society
Q: What person helped establish the school with a donation?
A: John D. Rockefeller
Q: What year was the university’s first president given his position?
The initial challenge was to train an automated software model on a set of training examples so it could answer such questions, and then be able to read new–previously unseen–articles and answer questions about those. That is: can we train a computer to do reading comprehension?
The challenge was published in 2016 with a human baseline exact match (EM) score of 82.3 (out of 100). Initial automated models did far worse; by the end of 2016, the best model scored just over 70. By the end of 2017, it reached 82.1. By October 2018, it reached 87.4, and as of the end of 2019, it’s at 89.9.
Human readers have shown no such progress during these three years. Unfortunately, the reading comprehension of US students has not improved at all in the past 20 years.
What the current scores mean is that state-of-the-art automated systems will make almost half as many errors as humans in this type of exam. The humans in question were adults, based either in the United States or Canada, who answered at least 1,000 questions with a 97% accuracy rate (full paper here).
Use case: Knowing what you don’t know
In 2018, the Stanford natural language processing (NLP) group published SQuAD 2.0: a new and tougher task. The problem to address was that there were ways to “cheat” the previous test. For example, assume that the previous example about the University of Chicago was written in German, which you cannot read, but you’d still be able to recognize dates, people’s names, and place names in the text. So you’d have a one-in-three chance of correctly answering a question starting with “What year…” (there are only three years in the text above), or one-in-two chance of correctly answering a question starting with “What person…” or “Who…”.
In a sense, software will learn the same tricks school teachers teach as “test taking strategies,” which is cool but doesn’t really reflect reading comprehension.
To address this, the 2.0 version of SQuAD extends the original 100,000 questions with 50,000 additional questions that do not have an answer in the text. The correct answer to those questions is to abstain from answering at all. Here are sample questions about our previous text:
Q: Who is known as the poorest man in all history?
A: <no answer>
Q: In what year did Robert Harper become the first president of the University?
A: <no answer>
These unanswerable questions were crowdsourced as well, with the goal of always being relevant to the article they were about and always having a possible answer (just not stated in the article). Typical types of unanswerable questions use negation, antonyms (rich -> poor), replacing a name (William -> Robert), asking about impossible or mutually exclusive conditions, or replacing verbs (start -> end).
While this was initially a harder challenge, it turned out to have a shorter shelf life than its predecessor. The human performance baseline for the SQuAD 2.0 challenge, published in June 2018, was an EM score of 86.8. On March 2019, the first model to score above 87 was published. In November 2019, the first model to score above 90 was published.
Reading comprehension: Still improving
Researchers are still working to produce improved models, and there’s no indication that we’ve hit a wall yet–new models still make the top of the SQuAD Leaderboard every few weeks. It will be interesting to see when the current crop of deep learning NLP techniques will hit a natural limit.
Meanwhile, other researchers are hard at work building more difficult language understanding challenges. The next article in this series will describe the subsequent question answering challenge–which was published in August 2018 (and, incidentally, was beaten for the first time in March 2019).