|
The ClueWeb09 Dataset: Frequently Asked Questions |
|
Why is the dataset named ClueWeb09? The U.S. National Science Foundation's Cluster Exploratory (CluE) program provided computational resources and funding that enabled creation of the dataset. The data was gathered from the web in 2009.
Why is the dataset so expensive? Most of the cost of each dataset covers the hard disk drive(s) used to ship data to you. The hard disk drive(s) is/are yours to keep. The remainder covers the staff time required to process dataset licenses, process invoices, buy disks, copy disks, buy packing materials, and prepare disks for shipping; and a small fee that helps us maintain the hardware used for duplicating disks.
What is the "Category B" subset? The TREC2009 "Category B" data set is the data from the directory "ClueWeb09_English_1" from the entire dataset. This is roughly the first 50 million documents of the English corpus.