|
The ClueWeb09 Dataset |
|
The ClueWeb09 dataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. The dataset is used by several tracks of the TREC conference.
Please see our project planning document. It is slightly outdated, but still approximately correct. For information about the Sapphire web crawler, please see the Sapphire FAQ.
Web Pages:
Web Graph:
Information on how the crawl progressed is also available.
The dataset is distributed as tarred/gzipped files on four 1.5 terabyte (TB) hard disk drives, in Linux ext3 format. The physical drives are standard SATA 3 Gbit/sec (SATA/300) 3.5" drives that should be compatible with any SATA/300 interface, including external USB to SATA/300 enclosures.
Web pages are in the WARC file format. Each WARC file is about 1 gigabyte, uncompressed. Each WARC file contains several tens of thousands of web pages (e.g., 40,000). Each WARC file is compressed by gzip.
Please see the Dataset Information and Sample Files page for a detailed description of the contents of the dataset including the format of the dataset and sample files.
The ClueWeb09 datasets are distributed by Carnegie Mellon University for research purposes only. A dataset may be obtained from Carnegie Mellon by signing a data license agreement with Carnegie Mellon University, and paying a fee that covers the cost of maintaining and distributing the dataset.
The process for obtaining a ClueWeb09 dataset is described below.
Sign an Organizational Agreement . This agreement must be signed by a person with the authority to sign agreements on behalf of your organization. The person signing must also initial each page of the agreement on the bottom right corner.
The organizational data license typically applies to a single research group or unit within a larger legal entity. For example, in a university, the organizational license might apply to a research group consisting of a few professors, and the students and staff doing research with them. In this case, the organization would be the name of the research group (e.g., the Information Retrieval Laboratory), and the Corporation/Legal Entity would be the name of the university.
Fax the complete copy (all five pages) of the signed organizational agreement to Dana Houston at the Language Technologies Institute. The fax number is +1 412-268-6298.
After you have faxed the organizational agreement, please notify Dana Houston by email (dhouston at cs dot cmu dot edu) that the signed dataset license was faxed to her. Please also provide the following information:
We will send you an email confirmation that we have received your order.
We will send you an invoice for payment, by mail and/or email.
The costs of each dataset are shown below.
| Item | Cost | Notes and Explanations |
|---|---|---|
| ClueWeb09 The full dataset of about 1 billion pages (TREC 2009 "Category A" dataset) |
$790 | Includes four 1.5 terabyte hard disks |
| ClueWeb09-T09B A subset of about 50 million English pages (TREC 2009 "Category B" dataset) |
$240 | Includes one 1.0 terabyte hard disk |
| Shipping | (varies) | US options: 1 day, 2 day, 7 day International options: 1 week |
Payment information will be included on the invoice, and should be paid in U.S. dollars only.
The dataset will not be shipped until your payment is confirmed. Payment can only be made via check or wire transfer.
Cash and credit card payments are not accepted.
If you are in a hurry, wire transfer is faster than checks.
If you use wire transfer, please be aware that we are not automatically notified when wired funds
arrive in CMU's bank account. After you wire your payment, please
notify Dana Houston by email (dhouston at cs dot cmu dot edu)
so that we know to watch for it.
We ship the dataset to the mailing address that you specified. Please note: It typically takes 1 week from the time we receive your payment until we ship your dataset. We make every effort to ship data quickly, but i) distributing data is not our only job, and ii) other groups may be ahead of you in the data distribution queue.
Each individual who will use or have access to the dataset must sign an Individual Agreement. You must retain these signed individual agreements within your organization.
To stay informed about the latest information and updates to the ClueWeb09 Dataset, you can subscribe to the ClueWeb09 Mailing List. Please note that when you browse to this page, you may receive a warning stating that the security certificate for the domain is invalid. The certificate is not invalid - it is just self-signed by the list maintainers at Carnegie Mellon University. It is safe to accept the certificate.
Additional information on updates to the ClueWeb09 Dataset may also be found in the ClueWeb09 Wiki.
|
The creation of the ClueWeb09 dataset was sponsored by National Science Foundation grant IIS-0841275, under its Cluster Exploratory program. We thank Google and IBM for the use of the CluE computer cluster. We thank Nick Craswell, Dennis Fetterly, Don Metzler, NIST's ITL Retrieval Group, and Yahoo! for their assistance and advice. We thank the Wikimedia Foundation for enabling the inclusion of the English wikipedia. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s) of this site, and do not necessarily reflect those of the sponsors. |
If you still have questions, please see the Frequently Asked Questions web page.