The project is creating a dataset of blog, forum, and microblog data that will be used for research purposes by scientists around the world.
Our group is preparing to collect a dataset of blog, forum, and microblog data. The dataset is an extension of ClueWeb12, a set of about 1 billion English web pages that contain the kinds of web documents that commercial search engines such as Google, Bing, and Yahoo would contain and rank highly for some query. The dataset will be used by scientists around the world for research purposes.
The project will begin with a phase of crawler testing and customization in October, 2012. During this phase, we will run small crawls to test crawler behavior and performance. We expect the actual dataset collection to begin sometime between November, 2012 and January, 2013.
Our crawler obeys the Robot Exclusion Standard (the robots.txt file), so you can exclude it from part or all of your site.
Specifically, it obeys the first entry with the User-Agent containing mandalay. If there is no such record, it will obey the first entry with a User-agent of *. Disallowed documents are not crawled, nor are links in those documents followed. If the crawler is unable to retrieve a robots.txt file from your site, it assumes that your site has no restrctions on being crawled.
The user-agent string for our crawler should appear in your log files as:
"Mozilla/5.0 (compatible; mandalay email@example.com; +http://boston.lti.cs.cmu.edu/crawler/clueweb12pp/)"
For example, if you want to exclude all the documents within the /private directory, you can use the following robots.txt file:
User-agent: mandalay Disallow: /private
robots.txt is the only way to prevent the crawler from downloading a Web page. (This is true for all web crawlers.) However, modifying robots.txt on a file-by-file basis is not always possible or convenient, so many people prefer a more convenient alternative that instructs the crawler to discard the file after it is downloaded. Place the following META tags in the <HEAD> section of your web page:
<META NAME="robots" CONTENT="noindex">
<META NAME="mandalay" CONTENT="noindex">
The Lemur crawler will discard this document after it is downloaded.
The crawler tries to guess what is an acceptable rate for your site based on how quickly your site responds. You can use the Crawl-Delay directive in your robots.txt to explicitly specify how long to wait between page downloads. For example, to allow the crawler to only fetch a page from your site once every ten seconds:
Yes. You can use our opt-out form to exclude specific web pages, a portion of a web site, or an entire web site from our crawl.
After you make an opt out request, the crawler will be instructed to exclude the specified page(s) from future crawling activity, and to discard any page(s) that it has collected already. Please note that the opt-out process is not immediate. Although we try to process requests quickly, it may take some time before the crawler is informed. During that time, your site may continue to be accessed.
The crawler keeps track of what it has fetched previously, so usually it will not try to re-fetch the same document more than once. Occasionally, the crawler might need to be restarted, in which case you might see repeated requests for the same page. However, this is unusual and rare behavior.
Like other web crawlers, the crawler finds pages by following links from web pages on both your site and from other sites.
No. The dataset is being created according to a set of guidelines developed after consultation with the scientific community. We are trying to follow those guidelines closely, so that others may reproduce this research in the future, e.g., to compare a snapshot of the web in 2012 with a future snapshot of the web.
We are using the open source web crawler Heritrix, developed by the Internet Archive. The crawler is running on the computing facilities of the School of Computer Science at Carnegie Mellon University; all the domains and IP addresses used to run the crawler are part of the University's network.