The Sapphire Web Crawler

The Sapphire web crawler is operated by a research project at the Language Technologies Institute, a computer science department within Carnegie Mellon University's School of Computer Science. The project is creating a dataset of 1 billion web pages in 10 languages that will be used for research purposes by scientists around the world. The project is supervised by Professor Jamie Callan, and is sponsored by the U.S. National Science Foundation.

 

Frequently Asked Questions


Q: Why are you crawling my site?
A:

The Sapphire crawler is a web robot that is building a dataset of about 1 billion web pages in 10 languages. The dataset is intended to contain the kinds of web pages that commercial search engines such as Google, Yahoo, and Live Search would contain and rank highly for some query. The dataset will be used by scientists around the world for research purposes.

More information on what the data will be used for can be found on the ClueWeb09 Home Page


Q: Why does your crawler appear to be crawling only images and not web pages?
A:

Our crawl of web pages took place between January and February of 2009. Nearly all of the scientists and students using this dataset will use only the text portions of each page. A small number of scientists - for example, the U.S. government's National Institute of Standards and Technology (NIST) - will also use the images on some pages so that web pages are rendered accurately for user studies (e.g., to determine if web page X is a good match to query Y, it is often helpful to see the images as well as the text). The image portion of the dataset is huge - many terabytes - so we expect that there will be only a few copies of it - perhaps only one.

Our initial web page crawl was used to compile a complete listing of unique image URLs that we needed to gather. Chances are if you are seeing us crawling only images on your website now, at least one of your web pages was picked up in the initial crawl.

If you wish, you may exclude your images from our dataset simply by informing us that you won't want to be included and optionally, blocking our crawler via your robots.txt file. However, we hope that you will allow us to collect this data from your web site. It helps the scientific community, which is currently far behind industry in having a detailed understanding of the web. We hope that this dataset, which is larger and more realistic than what was available previously, will help the scientific community narrow the gap a bit.


Q: How do I prevent part or all of my site from being crawled?
A:

Our crawler obeys the Robot Exclusion Standard (the robots.txt file), so you can exclude it from part or all of your site. Specifically, it obeys the first entry with the User-Agent containing "SapphireWebCrawler" or "Nutch". If there is no such record, it will obey the first entry with a User-agent of "*". Disallowed documents are not crawled, nor are links in those documents followed. If the crawler is unable to retrieve a robots.txt file from your site, it assumes that your site has no restrctions on being crawled.

Notably, the user-agent string for our crawler should appear in your log files as:
"SapphireWebCrawler/Nutch-1.0-dev (Sapphire Web Crawler using Nutch; http://boston.lti.cs.cmu.edu/crawler/; <admin_email>)"


Q: Is there an easier way to prevent you from crawling certain pages?
A:

Robots.txt is the only way to prevent the crawler from downloading a Web page. (This is true for all web crawlers.) However, modifying robots.txt on a file-by-file basis is not always possible or convenient, so many people prefer a more convenient alternative that instructs the crawler to discard the file after it is downloaded.

Place the following META tags in the <HEAD> section of your web page:

<META NAME="robots" CONTENT="noindex">
    or
<META NAME="SapphireWebCrawler" CONTENT="noindex">
The Sapphire crawler will discard this document after it is downloaded.


Q: Can you crawl my site more slowly?
A:

The crawler tries to guess what is an acceptable rate for your site based on how quickly your site responds. You can use the Crawl-Delay directive in your robots.txt to explicitly specify how long to wait between page downloads. For example, to allow the crawler to only fetch a page from your site once every ten seconds:

Crawl-delay: 10


Q: Can I opt-out of having part or all of my site crawled without doing anything on my end?
A:

Now that the crawl is complete, the opt-out capability is no longer available. The information below is retained for historical completeness.

Yes. You can use our opt-out form to exclude specific web pages, a portion of a web site, or an entire web site from our crawl. After you make an opt out request, the crawler will be instructed to exclude the specified page(s) from future crawling activity, and to discard any page(s) that it has collected already. Please note that the opt-out process is not immediate. Although we try to process requests quickly, it may take some time before the crawler is informed. During that time, your site may continue to be accessed.


Q: Why do I see repeated download requests?
A:

The crawler keeps track of what it has fetched previously, so usually it will not try to re-fetch the same document more than once. Occasionally, the crawler might need to be restarted, in which case you might see repeated requests for the same page. However, this is unusual and rare behavior.


Q: How does your crawler find pages on my site?
A:

Like other web crawlers, the crawler finds pages by following links from web pages on both your site and from other sites.

For downloading images on your site, our crawler has a list of absolute image URLs that was compiled from the crawled web pages to fetch. If your site's images are being downloaded, chances are that one or more of your web pages was included in the initial web page crawl.


Q: Can I request that certain pages or sites be crawled?
A:

No. The dataset is being created according to a set of guidelines developed after consultation with the scientific community. We are trying to follow those guidelines closely, so that others may reproduce this research in the future, e.g., to compare a snapshot of the web in 2008 with a future snapshot of the web.


Q: Where can I get more information?
A:

Please see the abstract for National Science Foundation grant IIS-0841275, which sponsors this research. If you need additional information, you may send email to Professor Jamie Callan.


This research is sponsored in part by National Science Foundation grant IIS-0841275. Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsors.


Updated on May 22, 2009.
Maintained by Jamie Callan