Site icon Gradient Flow

An intuitive overview of recent advances in automated reading comprehension, Part II

Recent progress in automated conversational question answering, with natural sounding answers in the context of the flow of conversation.

By David Talby.

[Part I: Recent progress in automated question answering about facts in Wikipedia articles.]

Part one of this series showed recent progress in automated question answering about facts in Wikipedia articles. You may have noticed three ways in which “real-world” reading comprehension is harder than that:

  1. Wikipedia is professionally edited to have correct grammar, correct spelling, and a neutral tone. That’s not the case in text coming from social media, opinion essays, conversations, or stories.
  2. Using only factual questions to evaluate a reader does not require them to understand emotions, situations, and behavior.
  3. Humans gather information by engaging in conversations, using a series of connected questions and answers. Readers should be able to answer questions in context.

With the goal of accelerating research in this space, in August 2018, Stanford’s natural language processing (NLP) group published CoQA: a Conversational Question Answering dataset. 

Spoiler alert: it took 7 months and 8 days from publishing the paper until an automated system beat the human performance baseline.


NLP Summit: Join David Talby, Kira Radinsky, Amy Heineike, Clément Delangue, Joel Grus, Piero Molino, and many other speakers at the first NLP Summit, a FREE virtual conference which takes place in early October.


Use case: Conversations and practical logic

The goal of CoQA was to require software to show pragmatic logic, going beyond lexical understanding of text; reply with natural sounding answers; and answer in context while tracking the flow of conversations. It was also important that the models generalize beyond just one type of text (like Wikipedia), so the 127,000 questions and answers from 8,000 conversations were crowdsourced from seven domains:

  1. Children’s stories from MCTest (Richardson, et al., 2013)
  2. Literature from Project Gutenberg
  3. Middle and high school English exams from RACE (Lai, et al., 2017)
  4. News articles from CNN (Hermann, et al., 2015)
  5. Articles from Wikipedia
  6. Reddit articles from the Writing Prompts dataset (Fan, et al., 2018)
  7. Science articles from AI2 Science Questions (Welbl, et al., 2017)

People writing the questions were asked not to use exact words from the passage they’re asking about when forming the questions. People answering the questions were encouraged to first highlight the sentence that includes the answer, and then write a free-form answer. In the example below, we ignore the highlighted sentences (although marking them correctly is a required part of the challenge) and include the four allowed answers (given by different people who wrote answers) for each question.

An example: Understanding behaviors and feelings

Here’s an excerpt from a short fictional story and four of the CoQA questions about it:

From a language understanding perspective, there’s a lot going on here:

Providing natural responses

Another requirement from CoQA answers is to provide natural-sounding responses. This often means that “copying and pasting” responses from the passage isn’t possible. Here are a few examples of questions, the passages that contain the answer, and an acceptable correct answer:

Until as recently as five years ago, each of these questions was a separate academic pursuit. Making the leap from “who does vandalism?” to “vandals” is a question of morphology. Other fields of study were syntax, semantics, and pragmatics in linguistics–or knowledge representation, reasoning, ontologies, machine learning, and search in computer science. Recent advances in deep learning and transfer learning have resulted in breakthrough leaps in what’s newly achievable in natural language understanding (NLU).

What’s next

When first published in August 2018, the CoQA baseline automated system had an F1 score of 65.4%, well below the human performance of 88.8%. In March 2019, a published model surpassed human performance with a score of 89.4%, and at the end of 2019 the best performing model is at 90.7%. 

This means that, currently, that model would make roughly 17% fewer mistakes than a (native speaking, adult, vetted) human. Another major point to consider is that automated models will keep improving since this is a new area of very active research. Improving human performance is tied to improving the education system and general population literacy–a far more difficult undertaking.

If you’re interested in learning more about NLP benchmarks, take a look at a third breakthrough in natural language processing from 2019: mastering of the GLUE benchmark. Humans started 2019 as the top performers in this challenge and finished the year outside the top 10.

Exit mobile version