ClueWeb09 Category A (English Only) Dataset

The ClueWeb09 dataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The full dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The dataset is used by several tracks of the TREC conference.

This index is an index of only the English portion of the Category A ClueWeb09 dataset which consists of roughly the first 500 million English web pages.

More information about the ClueWeb09 dataset can be found on the ClueWeb09 Homepage.

Source	Size (Number of Documents)	Dataset Size
ClueWeb09 Category A (English Only)	503,903,810	About 15.0 TB uncompressed (about 2.5 TB compressed)

This index was created using Indri version 5.11 in January 2017. The index has stopwords removed and contains these fields: title, heading, url, body, and inlink. The parameter file used to create the index is located here: clueweb09 indexing parameters