The Lemur Web Crawler

The Lemur web crawler is operated by a research project at the Language Technologies Institute, a computer science department within Carnegie Mellon University's School of Computer Science.

The project is creating a dataset of blog, forum, and microblog data that will be used for research purposes by scientists around the world.

The project is supervised by Professor Jamie Callan, and is sponsored by the U.S. National Science Foundation.

Frequently Asked Questions

Why are you crawling my site?
How do I prevent part or all of my site from being crawled?
Is there an easier way to prevent you from crawling certain pages?
Can you crawl my site more slowly?
Can I opt-out of having part or all of my site crawled without doing anything on my end?
Why do I see repeated download requests?
How does your crawler find pages on my site?
Can I request that certain pages or sites be crawled?
Why is the crawler hitting my site from multiple domains?
Where can I get more information?

Why are you crawling my site?

Our group is preparing to collect a dataset of blog, forum, and microblog data. The dataset is an extension of ClueWeb12, a set of about 1 billion English web pages that contain the kinds of web documents that commercial search engines such as Google, Bing, and Yahoo would contain and rank highly for some query. The dataset will be used by scientists around the world for research purposes.

The project will begin with a phase of crawler testing and customization in October, 2012. During this phase, we will run small crawls to test crawler behavior and performance. We expect the actual dataset collection to begin sometime between November, 2012 and January, 2013.

How do I prevent part or all of my site from being crawled?

Our crawler obeys the Robot Exclusion Standard (the robots.txt file), so you can exclude it from part or all of your site.

Specifically, it obeys the first entry with the User-Agent containing mandalay. If there is no such record, it will obey the first entry with a User-agent of *. Disallowed documents are not crawled, nor are links in those documents followed. If the crawler is unable to retrieve a robots.txt file from your site, it assumes that your site has no restrctions on being crawled.

The user-agent string for our crawler should appear in your log files as:

"Mozilla/5.0 (compatible; mandalay admin@lemurproject.org;
    +http://boston.lti.cs.cmu.edu/crawler/clueweb12pp/)"

For example, if you want to exclude all the documents within the /private directory, you can use the following robots.txt file:

User-agent: mandalay
Disallow: /private

Is there an easier way to prevent you from crawling certain pages?

robots.txt is the only way to prevent the crawler from downloading a Web page. (This is true for all web crawlers.) However, modifying robots.txt on a file-by-file basis is not always possible or convenient, so many people prefer a more convenient alternative that instructs the crawler to discard the file after it is downloaded. Place the following META tags in the <HEAD> section of your web page:

<META NAME="robots" CONTENT="noindex">

Or:

<META NAME="mandalay" CONTENT="noindex">

The Lemur crawler will discard this document after it is downloaded.

Can you crawl my site more slowly?

The crawler tries to guess what is an acceptable rate for your site based on how quickly your site responds. You can use the Crawl-Delay directive in your robots.txt to explicitly specify how long to wait between page downloads. For example, to allow the crawler to only fetch a page from your site once every ten seconds:

Crawl-delay: 10

Can I opt-out of having part or all of my site crawled without doing anything on my end?

Yes. You can use our opt-out form to exclude specific web pages, a portion of a web site, or an entire web site from our crawl.

After you make an opt out request, the crawler will be instructed to exclude the specified page(s) from future crawling activity, and to discard any page(s) that it has collected already. Please note that the opt-out process is not immediate. Although we try to process requests quickly, it may take some time before the crawler is informed. During that time, your site may continue to be accessed.

Why do I see repeated download requests?

The crawler keeps track of what it has fetched previously, so usually it will not try to re-fetch the same document more than once. Occasionally, the crawler might need to be restarted, in which case you might see repeated requests for the same page. However, this is unusual and rare behavior.

How does your crawler find pages on my site?

Like other web crawlers, the crawler finds pages by following links from web pages on both your site and from other sites.

Can I request that certain pages or sites be crawled?

No. The dataset is being created according to a set of guidelines developed after consultation with the scientific community. We are trying to follow those guidelines closely, so that others may reproduce this research in the future, e.g., to compare a snapshot of the web in 2012 with a future snapshot of the web.

Why is the crawler hitting my site from multiple domains?

We are using the open source web crawler Heritrix, developed by the Internet Archive. The crawler is running on the computing facilities of the School of Computer Science at Carnegie Mellon University; all the domains and IP addresses used to run the crawler are part of the University's network.

Where can I get more information?

Please see the abstract for National Science Foundation grants NSF IIS-1160894 and NSF IIS-1160862, which sponsors this research. If you need additional information, you may send email to Professor Jamie Callan.