TableArXiv: Dataset for Scientific Table Search

Kyle Yingkai Gao and Jamie Callan
Tian Tian and Ed Hovy

Dataset Overview

This dataset was created to support research on scientific table retrieval, and it was used for project A Method to Retrieve Non-Textual Data From Widespread Repositories. It uses 341,573 tables extracted from physics e-prints on arXiv.org, 105 information needs and corresponding relevance judgements in TREC format. Due to the policy of arXiv, we are not permitted to distribute arXiv articles, instead please obtain original article data from arXiv Bulk Data Access.

Download TableArXiv

Version	Link
TableArXiv v1.0	download

Information Need

Following TREC format, each information need contains 4 fields, query identifier, description, narrative, and query. Inspired by a work about IR task design, we add an extra field category to indicate the task category of the information need. The role of each field is given as below

Query Identifier (QID): An integer uniquely represents the information need;
Description: A detailed description of the information need written in natural language;
Narrative: The criteria for a table to be relevant or not relevant;
Query: The actual query issued to search engines;
Category: One of (A single fact, List of facts, Comparison, Summary, Other), indicating the task category of the information need.

Relevance Judgements

We recruited 8 students (from undergraduates to PhD candidates) majoring in Physics or Physics-related majors to compose information needs in TREC format. For each information need, students assessed a pooled list of at most 100 tables returned by 8 simple ranking algorithms; specifically one bag-of-words ranker for each table field plus another bag-of-words ranker that treated tables as full-text documents. The order of the result list was randomized to eliminate an ordering bias, and assessors were reminded frequently that the list was in random order. For each table, assessors chose a rating from a 4-point scale to describe its relevance to his/her information need. We removed information needs that were duplicated, queries that contained unsupported Unicode characters, and those that had no relevant results according to the assessor. The relevance scale is given as blow

Key (3):The table represents the information entirely named by the query; the user may be searching for this specific table;
HRel (2): The table provides comprehensive information of the topic;
Rel (1): The table provides some information on the topic, which may be minimal;
Non (0): The table does not provide useful information on the topic.

The format of relevance judgments is compatible to trec_eval, standard evaluation script for TREC tasks. Each line is a judgement in the format
QID Q0 DOCNO REL
where QID and REL are query identifier and relevance score as defined above. DOCNO is table identifier that will be defined below.

Tips on Downloading Data from arXiv.org

It is encouraged to have bulk data access to arXiv data through Amazon s3, and here is a help document. The source files are grouped into chunk files of ~500MB each. The complete list of all chunks is provided in a manifest file. In order to reproduce the same table corpus used by TableArXiv, please compare the latest manifest file with ours.

Each table in the dataset is given an unique identifier (DOCNO). The name convention is as follow
ARXIV_ID.TABLE_ID
where ARXIV_ID is the distinct paper ID provided by arXiv and TABLE_ID is the table order in the paper (starts from 0). For example, 'astro/ph/0304562.4' represents the fifth table in document 'astro-ph/0304562'.

This research is sponsored by National Science Foundation grant IIS-1450545. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.

Updated on January 28, 2016

Kyle Gao