The Lemur Project is creating a new web dataset, tentatively called ClueWeb12, that will be a companion or successor to the ClueWeb09 web dataset. This new dataset is expected to be ready for distribution in November 2012. Dataset construction consists of crawling the web for about 1 billion pages, web page filtering, and organization into a research-ready dataset.
The crawl began on February 10, 2012. We used five instances of the Internet Archive's open source web crawler, Heritrix running on five Dell PowerEdge R410 machines with 64GB RAM. The crawler was configured to follow typical crawling guidelines. There is a FAQ page for the crawler, in case you are curious.
The crawl was initially seeded with 2,820,500 uniq URLs. This list was generated by taking the 10 million ClueWeb09 urls that had the highest PageRank scores, and then removing any page that was not in the top 90% of pages as ranked by Waterloo spam scores (i.e., least likely to be spam). Two hundred sixty-two (262) seeds were added from the most popular sites in English-speaking countries, as reported by Alexa. The number of sites selected from each country depended on its relative population size, for example, United States (71.0%), United Kindom (14.0%), Canada (7.7%), Australia (5.2%), Ireland (3.8%), and New Zealand (3.7%). Finally, Charles Clark, University of Waterloo, provided 5,950 seeds specific to travel sites.
A blacklist was used to avoid sites that are reported to distribute pornography, malware, and other material that would not be useful in a dataset intended to support a broad range of research on information retrieval and natural language understanding. The blacklist was obtained from a commercial managed URL blacklist service, URLBlacklist.com, which was downloaded on 2012-02-03. The crawler blackliset consists of urls in the malware, phishing, spyware, virusinfected, filehosting and filesharing categories. Also included in the blacklist is a small number (currently less than a dozen) of sites that opted out of the crawl.
Addtionally, urls mentioned in English tweets from a Twitter Gardenhose stream are harvested each day. Those pages are downloaded using a separate instance of the Heritrix crawler. The domains of tweeted urls are injected into the main web crawl on a regular basis, which is intended to create a more connected graph between the web crawl and the tweeted urls.
The crawler is configured to capture page text, css, xml,and javascript files, any images on a page, and http response headers. The crawler skips multimedia files, for example, flash images, audio and video files as well as compressed files (e.g. zip, tar, gz, sit, hqx). The crawler will also clip any file that is larger than 10MB in size.
The crawled web pages will be filtered to remove certain types of pages, for example, pages that a text classifier identifies as non-English, pornography, or spam. The dataset will contain a file that identifies each url that was removed and why it was removed. The web graph will contain all pages visited by the crawler, and will include information about redirected links.
The crawler captures an average of 10-15 million pages (and associated images, etc) per day. Its progress is documented in a daily progess report.