Context-Aware Document Term Weighting for Ad-Hoc Search

Zhuyun Dai
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
zhuyund@cs.cmu.edu

Jamie Callan
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
callan@cs.cmu.edu

Abstract

Bag-of-words document representations play a fundamental role in modern search engines, but their power is limited by the shallow frequency-based term weighting scheme. This paper proposes HDCT, a context-aware document term weighting framework for document indexing and retrieval. It first estimates the semantic importance of a term in the context of each passage. These finegrained term weights are then aggregated into a document-level bag-of-words representation, which can be stored into a standard inverted index for efficient retrieval. This paper also proposes two approaches that enable training HDCT without relevance labels. Experiments show that an index using HDCT weights significantly improved the retrieval accuracy compared to typical term-frequency and state-of-the-art embedding-based indexes.

Source Code

The source code is in the DeepCT and HDCT GihHub repositorty

Data

Rankings generaed by HDCT for MS-MARCO-Doc: here

Coming Soon: Training data and HDCT term weights for ClueWeb09-B and MS-MARCO-Doc.

A login ID will be required to access ClueWeb09-B. If your organization has a ClueWeb09 dataset license, you can obtain a username and password by contacting Jamie Callan.

Hyperparamters

Coming Soon: Effects of the scaling function.

Citation

Z. Dai and J. Callan. Context-Aware Document Term Weighting for Ad-Hoc Search In Proceedings of the Web Conference. 2020. Updated on Feb 18, 2020.

Zhuyun Dai