Efficient Distributed Selective Search

Abstract

Simulation and analysis have shown that selective search can reduce the cost of large-scale distributed information retrieval. By partitioning the collection into small topical shards, and then using a resource ranking algorithm to choose a subset of shards to search for each query, fewer postings are evaluated. In this paper we extend the study of selective search into new areas using a fine-grained simulation, examining the difference in efficiency when term-based and sample-based resource selection algorithms are used; measuring the effect of two policies for assigning index shards to machines; and exploring the benefits of index-spreading and mirroring as the number of deployed machines is varied. Results obtained for two large datasets and four large query logs confirm that selective search is significantly more efficient than conventional distributed search architectures and can handle higher query rates. Furthermore, we demonstrate that selective search can be tuned to avoid bottlenecks, and thus maximize usage of the underlying computer hardware.

Citation

Published in the Information Retrieval Journal, 2016.

@Article{Kim2016,
  author="Kim, Yubin and Callan, Jamie and Culpepper, J. Shane and Moffat, Alistair",
  title="Efficient distributed selective search",
  journal="Information Retrieval Journal",
  year="2016",
  pages="1--32",
  issn="1573-7659",
  doi="10.1007/s10791-016-9290-6",
  url="http://dx.doi.org/10.1007/s10791-016-9290-6"
}

Code and artifacts

The simulator used in this paper can be downloaded from the git repository http://boston.lti.cs.cmu.edu/appendices/jir17-yubink/loadsim

The shard maps used in this paper: ClueWeb09 Category A (1.9 GB) and GOV2 (133 MB).

Acknowledgements

This research is sponsored by National Science Foundation grant IIS-1302206 and by the Australian Research Council (DP140101587 and DP140103256). Shane Culpepper is the recipient of an Australian Research Council DECRA Research Fellowship (DE140100275). Yubin Kim is the recipient of the Natural Sciences and Engineering Research Council of Canada PGS-D3 (438411). Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.