The 2013 TREC Crowdsourcing Track was done with subsets of the ClueWeb12 dataset. The track organizers extracted the documents from the ClueWeb12 dataset and distributed them via tgz files asa convenience to track participants. These files were prepared by Gaurav Baruah, Gabriella Kazai, and Mark D. Smucker, who organized the 2013 TREC Crowdsourcing Track.
There are two versions of the dataset. Version 1.0 is the original dataset. Some documents in Version 1.0 were inadvertently truncated during the extraction process. Version 1.1 contains the complete versions of those documents. Version 1.1 is considered the current version of the dataset.
Version 1.1:
A readme.txt file: The treccrowd2013-dataset-readme-v1.1-20130821.txt describes the two assessment pools.
The Basic assessment pool: The treccrowd2013-basic-subset-documents-pool-v1.1-20130821.tar.gz file contains 3,470 documents in 10 directories (for 10 topics). (46 MB)
The Standard assessment pool: The treccrowd2013-standard-full-documents-pool-v1.1-20130821.tar.gz file contains 17,796 documents in 50 directories (for 50 topics). (242 MB)
Version 1.0:
A readme.txt file: The treccrowd2013-dataset-readme-v1.0-20130809.txt describes the two assessment pools.
The Basic assessment pool: The treccrowd2013-basic-subset-documents-pool-v1.0-20130809.tar.gz file contains 3,470 documents in 10 directories (for 10 topics). (46 MB)
The Standard assessment pool: The treccrowd2013-standard-full-documents-pool-v1.0-20130809.tar.gz file contains 17,796 documents in 50 directories (for 50 topics). (241 MB)