Query Expansion with Freebase

Chenyan Xiong

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

Jamie Callan

Language Technologies Institute

School of Computer Science

Carnegie Mellon University

Abstract

Large knowledge bases are being developed to describe entities, their attributes, and their relationships to other entities. Prior research mostly focuses on the construction of knowledge bases, while how to use them in information retrieval is still an open problem.

This paper presents a simple and effective method of using one such knowledge base, Freebase, to improve query expansion, a classic and widely studied information retrieval task. It investigates two methods of identifying the entities associated with a query, and two methods of using those entities to perform query expansion. A supervised model combines information derived from Freebase descriptions and categories to select terms that are effective for query expansion. Experiments on the ClueWeb09 dataset with TREC Web Track queries demonstrate that these methods are almost 30% more effective than strong, state-of-the-art query expansion algorithms. In addition to improving average performance, some of these methods have better win/loss ratios than baseline algorithms, with 50% fewer queries damaged.

Datasets

FbObjRank contains the linked Freebase objects of TREC Web Track 2009-2012 queries (those for ClueWeb09).

FbSearchObj and FbFaccObj are objects linked using Google Freebase Search API and FACC1 annotations from top retrieved documents correspondingly.

Each line in the two files has the format as:

Query id \t query text (stemmed) \t object id \t object name \t linking score

ExpansionTerms contains the final expansion terms by our methods.

FbSearchPRFExpTerm, FbSearchCatExpTerm, FbFaccPRFExpTerm, and FbFaccCatExpTerm are the expansion terms and their weights produced our unsupervised expansion methods. FbSVMExpTerm is the result of our supervised expansion method. Please check the paper for more details about how these terms are selected.

The files contain lines as:

Query id \t query text (stemmed) \t expansion term \t expansion weight

EvaluationResults contains the evaluation results of our methods and baselines used in the paper. IndriLmEva, RmWikiEva, SDMEva and SVMPRFEva are the evaluation results (on TREC Web Track 2009-2012 queries) of our implemented baselines: Indri language model, Pseudo Relevance Feedback on Wikipedia corpus, sequential dependency model and supervised expansion using terms from top retrieved documents. Those whose file name starts with Fb are the evaluation results of our methods.

How to cite this paper:

The BibTex of this paper is as follows:

@inproceedings{xiong2015fbexpansion,

title={Query Expansion with {F}reebase},

author={Xiong, Chenyan and Callan, Jamie},

booktitle={Proceedings of the Fifth ACM International Conference on the Theory of Information Retrieval},

note={To appear},

year={2015},

organization={ACM}

};

Acknowledgements

This research is sponsored by National Science Foundation grant IIS-1422676 and by Google through its support of the Worldly Knowledge and Using Freebase for Improved Information Retrieval projects. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.

Updated on August 7, 2015

Chenyan Xiong