TREC 2013 Crowdsourcing Track Data
==================================
version: 1.1 (Aug 21, 2013)
Authors: Gaurav Baruah, Gabriella Kazai, Mark D. Smucker 
Maintainer: Gaurav Baruah (gbaruah[at]uwaterloo.ca)
Date: August 9, 2013
Revised: Aug 21, 2013
 
The dataset for TREC 2013 Crowdsourcing track consists of the
following two assessment pools:
 
1. Basic: 
The treccrowd2013-basic-subset-documents-pool-v1.0-20130809.tar.gz
(size 46M) file contains 3,470 documents in 10 directories (for 10
topics), as specified in basic-subset-document-pool.txt
 
2. Standard: 
The treccrowd2013-standard-full-documents-pool-v1.0-20130809.tar.gz
(size 242M) file contains 17,796 documents in 50 directories (for 50
topics), as specified in standard-full-document-pool.txt
 
Note that the basic pool is a subset of the standard pool. The tar
files for each pool contain all the documents to be judged - these
were extracted from the ClueWeb12 corpus. The pools contain all the
data that TREC Crowdsourcing participants need to take part in the
track - there is no need to obtain the full ClueWeb12 corpus. These
data sets are only available for participants of the TREC 2013
Crowdsourcing track, subject to the license that covers the full
ClueWeb12 collection, see disclaimer below.
 
Participants may choose to use either or both pools and submit
crowdsourced relevance judgments to the track.
 
NOTE: The standard-full-document-pool.txt and
basic-subset-document-pool.txt files are available in the active
participants section of the TREC Website. Both these files list
topic-docno pairs for respective topic assessment pools.
 
The process for generating the dataset included the following steps:

1. The document pool from TREC 2013 Web Track (that NIST assessors will 
be judging for relevance) was provided by the TREC Organizers at NIST. 
This is the standard-full-document-pool.txt file.

2. Ten topics (202, 214, 216, 221, 227, 230, 234, 243, 246, and 250)
were randomly selected from the standard-full-document-pool.txt to
create the basic-subset-document-pool.txt.

3. Documents for each pool.txt were extracted from the Clueweb12
collection.  Each document file is named as its clueweb12-id.  Each
file contains the WARC header, the HTTP header and the HTML content
for that Clueweb12 document.  Each file is essentially a WARC record.
To understand the WARC record format, please see
http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf .
 
Dataset File Organization Example:
basic-subset-documents-pool/
|- treccrowd2013-dataset-readme-v1.0.txt
|- basic-subset-documents/
   |- 202/
      |- clueweb12-dddddd-ff-rrrrr
      |- ...
   |- 214/
      |- ...
   |- ...

REVISION 1.1 Updates:
1. Some WARC records in the Clueweb12 collection were found to contain 
null (^@) characters.
2. The extraction program was modified to better handle null characters in
the middle of the document.
3. After a fresh extraction, 114 documents have been updated in v1.1 from v1.0 for the 
standard-full-document-pool. Correspondingly, 33 documents have been
updated for the basic-subset-documents-pool.
 
 
NO WARRANTY; DISCLAIMERS
========================
This dataset has been compiled by Gaurav Baruah of the University of
Waterloo.  No warranty is given by the authors or by the University of
Waterloo.
 
This dataset is covered by the "Organization Agreement to use the
ClueWeb12 Web Research Collections".  Please review your copy of the
agreement and its section "No Warranty; Disclaimers".
 
While we have made our best effort to accurately extract documents
from the ClueWeb12 collection, we may have made errors. RESEARCHERS
WHO WANT TO BE CERTAIN THAT THEY HAVE THE CORRECT DOCUMENTS, MUST
OBTAIN AN ORIGINAL COPY OF CLUEWEB12 FROM CMU.