Print

ClueWeb09 Wiki

The ClueWeb09 dataset was created by the Language Technologies Institute (external link) at Carnegie Mellon University (external link) to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The dataset is used by several tracks of the TREC (external link) conference.


  • Dataset Information : Information on the structure of the dataset on disk, the formatting of the data and extra information.


  • Page Encodings : How the character encodings for the dataset are formatted

  • Web Graph : Information on the web graph of nodes and oulinks for the dataset

  • Redirects : Redirect Information for the Category B dataset



  • Sample Files : Sample files in various languages from the ClueWeb09 dataset




Acknowledgements


 (external link)The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English Wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors.



Created by: admin. Last Modification: Monday 07 of December, 2009 10:34:00 EST by lezhao.