 
TableArXiv: Dataset for Scientific Table Search
| Kyle Yingkai Gao and 
    Jamie Callan Tian Tian and Ed Hovy | 
This dataset was created to support research on scientific table retrieval, and it was used for project A Method to Retrieve Non-Textual Data From Widespread Repositories. It uses 341,573 tables extracted from physics e-prints on arXiv.org, 105 information needs and corresponding relevance judgements in TREC format. Due to the policy of arXiv, we are not permitted to distribute arXiv articles, instead please obtain original article data from arXiv Bulk Data Access.
| Version | Link | 
|---|---|
| TableArXiv v1.0 | download | 
Following TREC format, each information need contains 4 fields, query identifier, description, narrative, and query. Inspired by a work about IR task design, we add an extra field category to indicate the task category of the information need. The role of each field is given as below
We recruited 8 students (from undergraduates to PhD candidates) majoring in Physics or Physics-related majors to compose information needs in TREC format. For each information need, students assessed a pooled list of at most 100 tables returned by 8 simple ranking algorithms; specifically one bag-of-words ranker for each table field plus another bag-of-words ranker that treated tables as full-text documents. The order of the result list was randomized to eliminate an ordering bias, and assessors were reminded frequently that the list was in random order. For each table, assessors chose a rating from a 4-point scale to describe its relevance to his/her information need. We removed information needs that were duplicated, queries that contained unsupported Unicode characters, and those that had no relevant results according to the assessor. The relevance scale is given as blow
It is encouraged to have bulk data access to arXiv data through Amazon s3, and here is a help document. The source files are grouped into chunk files of ~500MB each. The complete list of all chunks is provided in a manifest file. In order to reproduce the same table corpus used by TableArXiv, please compare the latest manifest file with ours.
Each table in the dataset is given an unique identifier (DOCNO).  The name convention is as follow
ARXIV_ID.TABLE_ID
where ARXIV_ID is the distinct paper ID provided by arXiv and TABLE_ID is the table order in the paper (starts from 0).  For example, 'astro/ph/0304562.4' represents the fifth table in document 'astro-ph/0304562'.
|  | This research is sponsored by National Science Foundation grant IIS-1450545. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors. |