The Lemur Web Crawler |
The Lemur web crawler is operated by a research project at the Language Technologies Institute, a computer science department within Carnegie Mellon University's School of Computer Science. The project is creating a dataset of 1-2 billion web pages that will be used for research purposes by scientists around the world. The project is supervised by Professor Jamie Callan, and is sponsored by the U.S. National Science Foundation.
Q: | Why are you crawling my site? |
A: |
Our group is preparing to collect a dataset of about 1-2 billion web pages. The dataset is intended to contain the kinds of web pages that commercial search engines such as Google, Bing, and Yahoo would contain and rank highly for some query. The dataset will be used by scientists around the world for research purposes. The project will begin with a phase of crawler testing and customization beginning in August, 2011. During this phase, we will run small crawls to test crawler behavior and performance. We expect the actual dataset collection to begin sometime between November, 2011 and June, 2012. This dataset will augment an earlier web dataset called ClueWeb09 that was created in 2009. More information about the ClueWeb09 dataset is available on the ClueWeb09 web page. |
Q: | How do I prevent part or all of my site from being crawled? |
A: | Our crawler obeys the Robot Exclusion Standard (the robots.txt file), so you can exclude it from part or all of your site. Specifically, it obeys the first entry with the User-Agent containing "lemurwebcrawler". If there is no such record, it will obey the first entry with a User-agent of "*". Disallowed documents are not crawled, nor are links in those documents followed. If the crawler is unable to retrieve a robots.txt file from your site, it assumes that your site has no restrctions on being crawled. The user-agent string for our crawler should appear in your log files as: "Mozilla/5.0 (compatible; lemurwebcrawler admin@lemurproject.org; +http://boston.lti.cs.cmu.edu/crawler_12/ )" |
Q: | Is there an easier way to prevent you from crawling certain pages? |
A: | Robots.txt is the only way to prevent the crawler from downloading a Web page. (This is true for all web crawlers.) However, modifying robots.txt on a file-by-file basis is not always possible or convenient, so many people prefer a more convenient alternative that instructs the crawler to discard the file after it is downloaded. Place the following META tags in the <HEAD> section of your web page: <META NAME="robots" CONTENT="noindex">The Lemur crawler will discard this document after it is downloaded. |
Q: | Can you crawl my site more slowly? |
A: |
The crawler tries to guess what is an acceptable rate for your site based on how quickly your site responds. You can use the Crawl-Delay directive in your robots.txt to explicitly specify how long to wait between page downloads. For example, to allow the crawler to only fetch a page from your site once every ten seconds: Crawl-delay: 10 |
Q: | Can I opt-out of having part or all of my site crawled without doing anything on my end? |
A: |
Now that the crawl is complete, the opt-out
capability is no longer available. The information below is retained
for historical completeness. |
Q: | Why do I see repeated download requests? |
A: | The crawler keeps track of what it has fetched previously, so usually it will not try to re-fetch the same document more than once. Occasionally, the crawler might need to be restarted, in which case you might see repeated requests for the same page. However, this is unusual and rare behavior. |
Q: | How does your crawler find pages on my site? |
A: | Like other web crawlers, the crawler finds pages by following links from web pages on both your site and from other sites. |
Q: | Can I request that certain pages or sites be crawled? |
A: | No. The dataset is being created according to a set of guidelines developed after consultation with the scientific community. We are trying to follow those guidelines closely, so that others may reproduce this research in the future, e.g., to compare a snapshot of the web in 2011/2012 with a future snapshot of the web. |
Q: | Why is the crawler hitting my site from multiple domains? |
A: | We are using five instances of the Internet Archive's open source web crawler, Heritrix running on five Dell PowerEdge R410 machines with 64GB RAM. We are crawling from machines at the Carnegie Mellon University School of Computer Science and machines at The Pittsburgh Supercomputing Center. |
Q: | Where can I get more information? |
A: | Please see the abstract for National Science Foundation grant CNS-0934322, which sponsors this research. If you need additional information, you may send email to Professor Jamie Callan. |
This research is sponsored in part by National Science Foundation grant CNS-0934322. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors. |