LTI logl Elsevier tree logo

A Method to Retrieve Non-Textual Data
From Widespread Repositories

Ed Hovy, Jamie Callan, and
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
    Anita de Waard
Research Data Services
Elsevier, Inc

Project Overview

As the results of scholarly research in data-rich sciences, the growth of non-textual data, for example, numerical data, continues to expand. This project explores through a prototype whether alternatives can be found to large centralized repositories of non-textual data that are easy to use, universal and low cost. The idea is to be able to access non-textual information as easily and readily as documents without laborious additional work. The automatic creation of a number of indices would provide a mechanism to facilitate indexing and retrieval, such as done by internet search giants for text. If successful, it will be easy to post numerical data and and have it be available via standard web search engines as easily as textual information is today.

This project is building a system using a wide variety of data sources and objects from a range of disciplines, and designing representations that specify the characteristics of non-textual data, including numerical data. It also is developing enhanced indexing capabilities to handle queries in an extended search engine and procures to annotate the data by each type. A query and interaction interface for developers and pilot users is being created to support testing the viability of the approach and the effectiveness for creating descriptive additions for various types of data.

 

Project Personnel

Kyle Yingkai Gao and Huiying Li, Research Assistants

 

Demonstrations and Prototypes

Table arXiv
A search engine for tables extracted from arXiv.org publications.

Elsevier Science Direct Search
A search engine for tables extracted from academic publications provided by Elsevier.

Federated Search
A federated search engine for the Elsevier Science Direct data, DataDryad, Neuroscience Information Framework, Harvard Dataverse Network, and Pubmed. Query requests are distributed to different search engines, and results from which are merged and re-ranked to render.

Neuroscience Table Search
A search engine for tables collected from neuroscience publications provided by Elsevier and curated data provided NeuroElectro. NeuroElectro is a website that extracts information about the electrophysiological properties of diverse neuron types from the existing literature. We thank Elsevier and Shreejoy Tripathy for his assistance in obtaining this data.

Tables Like This
A numerical table search engine that takes (sub)tables as input and retrieves similar tables from Neuro Science publications provided by Elsevier.

 

Datasets

TableArXiv
A dataset for evaluating table search on papers from arXiv.org. The dataset consists of 105 information needs, relevance judgments, and instructions for downloading the papers from arXiv.org.

 

Dissemination of Research Results

Research results are disseminated by research publications (see Ed's publications and Jamie's publications) and via the search demonstrations shown above.

 


NSF logo     This research is sponsored by National Science Foundation grant IIS-1450545. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.

Updated on January 25, 2016
Jamie Callan