Recent progress in automated conversational question answering, with natural sounding answers in the context of the flow of conversation.
By David Talby.
Part one of this series showed recent progress in automated question answering about facts in Wikipedia articles. You may have noticed three ways in which “real-world” reading comprehension is harder than that:
- Wikipedia is professionally edited to have correct grammar, correct spelling, and a neutral tone. That’s not the case in text coming from social media, opinion essays, conversations, or stories.
- Using only factual questions to evaluate a reader does not require them to understand emotions, situations, and behavior.
- Humans gather information by engaging in conversations, using a series of connected questions and answers. Readers should be able to answer questions in context.
With the goal of accelerating research in this space, in August 2018, Stanford’s natural language processing (NLP) group published CoQA: a Conversational Question Answering dataset.
Spoiler alert: it took 7 months and 8 days from publishing the paper until an automated system beat the human performance baseline.
NLP Summit: Join David Talby, Kira Radinsky, Amy Heineike, Clément Delangue, Joel Grus, Piero Molino, and many other speakers at the first NLP Summit, a FREE virtual conference which takes place in early October.
Use case: Conversations and practical logic
The goal of CoQA was to require software to show pragmatic logic, going beyond lexical understanding of text; reply with natural sounding answers; and answer in context while tracking the flow of conversations. It was also important that the models generalize beyond just one type of text (like Wikipedia), so the 127,000 questions and answers from 8,000 conversations were crowdsourced from seven domains:
- Children’s stories from MCTest (Richardson, et al., 2013)
- Literature from Project Gutenberg
- Middle and high school English exams from RACE (Lai, et al., 2017)
- News articles from CNN (Hermann, et al., 2015)
- Articles from Wikipedia
- Reddit articles from the Writing Prompts dataset (Fan, et al., 2018)
- Science articles from AI2 Science Questions (Welbl, et al., 2017)
People writing the questions were asked not to use exact words from the passage they’re asking about when forming the questions. People answering the questions were encouraged to first highlight the sentence that includes the answer, and then write a free-form answer. In the example below, we ignore the highlighted sentences (although marking them correctly is a required part of the challenge) and include the four allowed answers (given by different people who wrote answers) for each question.
An example: Understanding behaviors and feelings
Here’s an excerpt from a short fictional story and four of the CoQA questions about it:
- To both the girl and the dog's surprise, there was a small brown bear resting in the bushes. The bear was not surprised and did not seem at all interested in the girl and her dog. The bear looked up at the girl and it was almost as if he was smiling at her. He then rested his head on his bear paws and went back to sleep. The girl and the dog kept walking and finally made it out of the woods.
Q How did the girl and the dog feel?
A surprised, surprised, confused, surprised
Q How did the bear react?
A not surprised, friendly, looked at the girl like he was smiling, smiled and went back to sleep
Q What did he do?
A Looked at the girl, looked up at the girl and it was almost as if he was smiling at her, went back to sleep, looked up at the girl
Q Was he mean?
A no, No, no, He smiled
From a language understanding perspective, there’s a lot going on here:
- “How did they feel?”–pragmatic logic dictates that surprise is a kind of feeling, so it applies here. Note that such simple deductions were not required in SQuAD.
- “How did the bear react?”–that is almost an open question, as evidenced by the different answers the four human answerers gave. The reader should know that “react” refers to a response or behavior.
- “What did he do?”–the reader should know that “he” refers to the bear, given the context of the previous question. It’s a natural follow-up question. The reader should also infer that since this is a fictional story, “he” is an acceptable pronoun for a bear here (“she” and “it” would have been possible, too).
- “Was he mean?”–the text of the story doesn’t mention the word “mean,” its synonyms, or its antonyms. Common sense implies that smiling is not a mean reaction.
Providing natural responses
Another requirement from CoQA answers is to provide natural-sounding responses. This often means that “copying and pasting” responses from the passage isn’t possible. Here are a few examples of questions, the passages that contain the answer, and an acceptable correct answer:
Question: Is it played outside?
Passage: … AFL is the highest level of professional indoor American football
Question: How many languages is it offered in?
Passage: The service provides curated consumer health information in English
Question: What did she try just before that?
Passage: She would give her baby sister one of her toy horses.
Answer: She gave her a toy horse
Question: What else do they get for their work?
Passage: ...paid well, both in potatoes, carrots.
Answer: Potatoes and carrots
Question: Who was messing up the neighborhoods?
Passage: ...vandalism in the neighborhoods
Until as recently as five years ago, each of these questions was a separate academic pursuit. Making the leap from “who does vandalism?” to “vandals” is a question of morphology. Other fields of study were syntax, semantics, and pragmatics in linguistics–or knowledge representation, reasoning, ontologies, machine learning, and search in computer science. Recent advances in deep learning and transfer learning have resulted in breakthrough leaps in what’s newly achievable in natural language understanding (NLU).
When first published in August 2018, the CoQA baseline automated system had an F1 score of 65.4%, well below the human performance of 88.8%. In March 2019, a published model surpassed human performance with a score of 89.4%, and at the end of 2019 the best performing model is at 90.7%.
This means that, currently, that model would make roughly 17% fewer mistakes than a (native speaking, adult, vetted) human. Another major point to consider is that automated models will keep improving since this is a new area of very active research. Improving human performance is tied to improving the education system and general population literacy–a far more difficult undertaking.
If you’re interested in learning more about NLP benchmarks, take a look at a third breakthrough in natural language processing from 2019: mastering of the GLUE benchmark. Humans started 2019 as the top performers in this challenge and finished the year outside the top 10.