Deeper text understanding for IR with contextual neural language modeling
Zhuyun Dai Language Technologies Institute School of Computer Science Carnegie Mellon University zhuyund@cs.cmu.edu |
Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University callan@cs.cmu.edu |
Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR. Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.
Source code are in the GihHub repositorty. It covers:
The initial document rankings and the 5-fold cross-validation splits can be downloaded with links listed below. Files are in TRECEVAL input format (.trec files):
Format: "qid Q0 docid rank score runname"
We also release the text contents of ClueWeb09-B documents in the initial rankings, as well as the passages direved from the documents. These files can be directly used to train/test BERT re-rankers.
File are in TRECEVAL input format (.trec files), plus a JSON string at the end of each line containing the title and body text of the document/passage (.trec.with_json files):Format: "qid Q0 docid rank score runname # {"doc": "title":"title text...", "body":"body text......"}"
A login ID is required to access the data. If your organization has a ClueWeb09 dataset license, you can obtain a username and password by contacting Jamie Callan.
We augment the officially pre-trained BERT by training it on a ranking task on Bing search logs following the domain adaptation settings used in Dai et al. The augmented BERT model can be downloaded with the link below: