Deeper Text Understanding for IR with Contextual Neural Language Modeling

Deeper text understanding for IR with contextual neural language modeling

Zhuyun Dai
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
zhuyund@cs.cmu.edu

Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
callan@cs.cmu.edu

Abstract

Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR. Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.

Source Code

Source code are in the GihHub repositorty. It covers:

Generate candidate passages from a document
Train/test BERT re-ranker
Generate document rankings from BERT output files

Initial Document Rankings

The initial document rankings and the 5-fold cross-validation splits can be downloaded with links listed below. Files are in TRECEVAL input format (.trec files):

Format: "qid Q0 docid rank score runname"

Candidate Document and Passages

We also release the text contents of ClueWeb09-B documents in the initial rankings, as well as the passages direved from the documents. These files can be directly used to train/test BERT re-rankers.

File are in TRECEVAL input format (.trec files), plus a JSON string at the end of each line containing the title and body text of the document/passage (.trec.with_json files):

Format: "qid Q0 docid rank score runname # {"doc": "title":"title text...", "body":"body text......"}"

A login ID is required to access the data. If your organization has a ClueWeb09 dataset license, you can obtain a username and password by contacting Jamie Callan.

Bing-augmented BERT

We augment the officially pre-trained BERT by training it on a ranking task on Bing search logs following the domain adaptation settings used in Dai et al. The augmented BERT model can be downloaded with the link below:

bing_augmented_bert.zip

Citation

Z. Dai and J. Callan. Deeper text understanding for IR with contextual neural language modeling In Proceedings of 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR). 2019.

Updated on June 3, 2019.

Zhuyun Dai