AREX

AMI Requests for Explanations and Relevance Judgments for their Answers.

Abstract

The AREX dataset was designed to evaluate a question answering system to be used during meetings. The dataset contains 74 excerpts of the AMI Meeting Corpus, with a request for explanation inserted at the end of each excerpt, targeting an acronym mentioned in the excerpt (e.g., “I need more information about RSI”). The goal of the system is to retrieve Wikipedia pages that enable the users to find more information about the acronym and the correct underlying notion in the context of the meeting. The dataset provides relevance judgments from human judges for sets of about thirty Wikipedia pages retrieved for each request by a pool of four systems. Moreover, an evaluation metric comparing two sets of answers based on the human judgments is provided, though other evaluation strategies are possible too.

Description

The AREX dataset contains the relevance judgments for sets of documents retrieved as an answer to requests for explanation (such as “Tell me more about NTSC”) during group meetings recorded in the AMI Meeting Corpus. Since the number of requests for explanation that occur naturally in the AMI Meeting Corpus is relatively small, we created new requests using the following procedure. We identified sentences containing an acronym X, and appended to them a request such as “I want more information about X”. We expect a computer system to return in real time a set of Wikipedia pages providing the requested information, including the correct definition of the acronym (based on the meeting context) and additional information about the concept.

The seven following acronyms were searched for in the corpus: LCD, VCR, PCB, RSI, TFT, NTSC, and IC (their correct definitions in context are given in the Appendix below). These were selected because they are related to the domain of remote controls, and the AMI Meeting Corpus contains conversations on designing remote controls.

Based on the occurrences of these acronyms in various places of the AMI Meeting Corpus, the AREX dataset consists of 74 different conversation fragments (for which only the timing, but not the actual words are given) and the request for explanation (or query) that was created at the end. For each query, we used four different systems (described in the appendix) to retrieve potentially relevant articles from the English Wikipedia. Merging the lists of top-10 results and discarding duplicates, we found that each fragment had at least 31 different document candidates, and we decided to keep this number constant for all fragments.

The relevance judgments were made for each list of retrieved documents by human judges recruited through the Amazon Mechanical Turk crowdsourcing platform. The relevance value of each document was assessed by showing the transcript of a conversation fragment to a human subject in a web browser, followed below it by a control question about its content, by the request for explanation, and by the list of 31 documents (Wikipedia pages). The judges had to read the conversation transcript, answer the control question, and then decide on the relevance value of each document by selecting one of the three following options: ‘irrelevant’, ‘somewhat relevant’, or ‘relevant’.

We collected judgments over a large number of subjects and conversation fragments. The evaluation of each of the 74 conversation fragments (with 31 documents per fragment) was crowdsourced via Amazon’s Mechanical Turk as “human intelligence tasks” (HITs). For each HIT we recruited ten workers. The average time spent per HIT was around 90 seconds (hence about 3 seconds per document, to read its title and decide its relevance). For qualification control, we only accepted workers with greater than 95% approval rate and with more than 1000 previously approved HITs. We only kept answers from the workers who answered correctly our control questions about each HIT, and these answers (total per option for each fragment and query) are provided in the AREX dataset.

Format of AREX dataset files

Each file corresponds to a fragment of the AMI Meeting Corpus, indicated using the meeting name (AMI codes) and the timing (start and end time) included in the file name. Then, the first line indicates the acronym for which a request for explanation is made (at the end of the conversation fragment), using the following format: “Query: I want more information about X.”

The second line consists of the timing of the conversation fragment before this query from the AMI Meeting Corpus (around 400 words) as follows: “history: [startTime-endTime]”. The actual transcript of the conversation from the AMI Corpus is not disclosed, as it requires a separate license; the AMI Corpus can be obtained from http://corpus.amiproject.org.

The rest of each file is in tabular format, with each line corresponding to one document and each column to the total of relevance judgments, as follows (NB. the first line of the table, hence the third line of the file, is the header of the table). The four columns are:

"irrel." – number of workers who considered "irrelevant" the document in the 4th column.
"SW rel." –number of workers who considered the document "somewhat relevant".
"rel." – number of workers who considered the document to be "relevant".
"doc" – the name of the document (Wikipedia article) .

Evaluating a question answering system using the AREX dataset

The goal of the AREX dataset is to allow replicable evaluation of systems that are capable to answer the requests for explanations for each conversation fragment, over Wikipedia articles. By pooling a selection of retrieval systems and providing human judgments of relevance, we designed a TREC-like resource. Several metrics can be applied using the Wikipedia pages judged as ground truth (e.g. recall and precision at N). However, we propose a more nuanced metric which compares two lists of Wikipedia articles and indicates which one is “better” in terms of proximity with the human judgments.

The Matlab code to compare two retrieval results is provided with the AREX dataset. This code compares two sets of retrieved documents obtained in two different ways. To perform comparison, the MainFile.m should be run. It requires the following information:

"query": the address of a directory file includes files each contains the name of files corresponds to each abbreviation word
"rootRef": the address of the directory of dataset explained above
"rootDoc1","rootDoc2": the two directory addresses contain the name of documents retrieved for two methods aimed to be compared.

Appendix

Brief description of the four versions of the system. The list of assessed documents retrieved for four different types of queries over the English Wikipedia articles using the Apache Lucene search engine. The four query types used for retrieval are as follows:

the acronym appearing in the request for explanations;
the acronym from the request, plus keywords extracted from the conversation fragment, with equal weights;
the acronym from the request, plus keywords extracted from the conversation fragment, weighted in proportion to their topical similarity to the acronym (using LDA over Wikipedia);
keywords extracted from the conversation fragment, but not the acronym.

Definitions of the seven acronyms in the senses in which they appear in the AMI Corpus Meetings (text and URL of Wikipedia page, excluding redirects or disambiguation pages):

LCD: Liquid crystal display
VCR: Videocassette recorder
PCB: Printed circuit board
RSI: Repetitive strain injury
TFT: Thin-film transistor liquid crystal display
NTSC: National television system committee
IC: Integrated circuit

Please contact us for any additional information.