Language Technologies Institute Carnegie Mellon University |
Allen Institute for Artificial Intelligence |
Language Technologies Institute Carnegie Mellon University |
This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine, SemanticScholar.org, reveals that a major error source is its inability to understand the meaning of research concepts in queries. To addresses this challenge, ESR represents queries and documents in the entity space and ranks them based on their semantic connections from their knowledge graph embedding. Experiments demonstrate ESR's ability in improving Semantic Scholar's online production system, especially on hard queries where word-based ranking fails.
Download the whole dataset It is 1.2 GB (compressed).
In the zip file you will find the following files and folders:
s2_query.json contains the queries used in this paper. Each line of it is a json format dictionary, with the following format:
{"qid": "the query id", "query": "the query string", "ana": {the annotated entity id and frequency}}
The entities are from Freebase. Please refer to the final dump of Freebase to get more information about these entities.
s2.trec is the TREC format ranking files. It contains the ranking lists from semanticscholar.org's production search engine (as of 2016 summer).
s2_doc.json contains the candidate documents. Each line of it is a json format dictionary. Its fields include:
docno: the doc id
title
keyPhrase: the automatically extracted key phrases for this paper.
paperAbstract: paper abstract
venue
numCitedBy: number of citations
numKeyCitations: number of key citations. Key citation means the other paper considers this one as a very important related work. It is from semanticsholar's production system.
Ana: the annotation of each of the title, paperAbstract, and body field.
Due to copyright restrictions, we are not allowed to release the body text. Please check http://corpus.semanticscholar.org/ to get the full corpus and more information about each document.
s2.qrel is the relevance judgments for these queries. It was labeled by
the first two authors. Judging the relevance of computer science papers is very
hard. We have to read many papers' abstract or even introductions ourselves
before making any reasonable judgments. The current size of labels is limited.
Keep updated with SemanticScholar.org for future possible benchmark release.
ranking_res folder includes the ranking results of all baselines, develop methods, and alternative methods in the experiments and analysis of this paper. Feel free to conduct future experiments based on them.
knowledge_graph_embedding folder contains the entity embeddings trained using our knowledge graph. It is in Google word2vec format.
The BibTex of this paper is as follows:
@inproceedings{xiong2017ESR,
title={Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding},
author={Xiong, Chenyan and Power, Russell and Callan, Jamie},
booktitle={Proceedings of the 26th International Conference on World Wide Web (WWW 2017)},
note={To appear},
year={2017},
organization={ACM}
};
This research was supported by National Science Foundation (NSF) grant IIS-1422676 and a fellowship granted to the first author from the Allen Institute for Artificial Intelligence. Part of this work is done while the first author was interning at AI2. Any opinions, findings, and conclusions in this paper are the authors' and do not necessarily reflect those of the sponsors.
Updated on March 8, 2017