The ClueWeb09 dataset was created by the
Language Technologies Institute
at
Carnegie Mellon University
to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The dataset is used by several tracks of the
TREC
conference.
- Dataset Information : Information on the structure of the dataset on disk, the formatting of the data and extra information.
- Page Encodings : How the character encodings for the dataset are formatted
- Web Graph : Information on the web graph of nodes and oulinks for the dataset
- Redirects : Redirect Information for the Category B dataset
- Sample Files : Sample files in various languages from the ClueWeb09 dataset
Acknowledgements


The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English Wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors.