TREC 2013 Crowdsourcing Track Data ================================== version: 1.1 (Aug 21, 2013) Authors: Gaurav Baruah, Gabriella Kazai, Mark D. Smucker Maintainer: Gaurav Baruah (gbaruah[at]uwaterloo.ca) Date: August 9, 2013 Revised: Aug 21, 2013 The dataset for TREC 2013 Crowdsourcing track consists of the following two assessment pools: 1. Basic: The treccrowd2013-basic-subset-documents-pool-v1.0-20130809.tar.gz (size 46M) file contains 3,470 documents in 10 directories (for 10 topics), as specified in basic-subset-document-pool.txt 2. Standard: The treccrowd2013-standard-full-documents-pool-v1.0-20130809.tar.gz (size 242M) file contains 17,796 documents in 50 directories (for 50 topics), as specified in standard-full-document-pool.txt Note that the basic pool is a subset of the standard pool. The tar files for each pool contain all the documents to be judged - these were extracted from the ClueWeb12 corpus. The pools contain all the data that TREC Crowdsourcing participants need to take part in the track - there is no need to obtain the full ClueWeb12 corpus. These data sets are only available for participants of the TREC 2013 Crowdsourcing track, subject to the license that covers the full ClueWeb12 collection, see disclaimer below. Participants may choose to use either or both pools and submit crowdsourced relevance judgments to the track. NOTE: The standard-full-document-pool.txt and basic-subset-document-pool.txt files are available in the active participants section of the TREC Website. Both these files list topic-docno pairs for respective topic assessment pools. The process for generating the dataset included the following steps: 1. The document pool from TREC 2013 Web Track (that NIST assessors will be judging for relevance) was provided by the TREC Organizers at NIST. This is the standard-full-document-pool.txt file. 2. Ten topics (202, 214, 216, 221, 227, 230, 234, 243, 246, and 250) were randomly selected from the standard-full-document-pool.txt to create the basic-subset-document-pool.txt. 3. Documents for each pool.txt were extracted from the Clueweb12 collection. Each document file is named as its clueweb12-id. Each file contains the WARC header, the HTTP header and the HTML content for that Clueweb12 document. Each file is essentially a WARC record. To understand the WARC record format, please see http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf . Dataset File Organization Example: basic-subset-documents-pool/ |- treccrowd2013-dataset-readme-v1.0.txt |- basic-subset-documents/ |- 202/ |- clueweb12-dddddd-ff-rrrrr |- ... |- 214/ |- ... |- ... REVISION 1.1 Updates: 1. Some WARC records in the Clueweb12 collection were found to contain null (^@) characters. 2. The extraction program was modified to better handle null characters in the middle of the document. 3. After a fresh extraction, 114 documents have been updated in v1.1 from v1.0 for the standard-full-document-pool. Correspondingly, 33 documents have been updated for the basic-subset-documents-pool. NO WARRANTY; DISCLAIMERS ======================== This dataset has been compiled by Gaurav Baruah of the University of Waterloo. No warranty is given by the authors or by the University of Waterloo. This dataset is covered by the "Organization Agreement to use the ClueWeb12 Web Research Collections". Please review your copy of the agreement and its section "No Warranty; Disclaimers". While we have made our best effort to accurately extract documents from the ClueWeb12 collection, we may have made errors. RESEARCHERS WHO WANT TO BE CERTAIN THAT THEY HAVE THE CORRECT DOCUMENTS, MUST OBTAIN AN ORIGINAL COPY OF CLUEWEB12 FROM CMU.