The web09-bst Dataset

Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao

The web09-bst dataset is a 25 terabyte dataset of about 1 billion web pages crawled in Janary and February, 2009. The crawl order was best-first search, using the OPIC metric. The crawl was started from about 28 million URLs that either i) had high OPIC values in a web graph produced from an earlier 200 million page crawl, or ii) were ranked highly by a commercial search engine for one of 4,000 sample queries in one of 10 languages. This dataset covers web content in English, Chinese, Spanish, Japanese, French, German, Arabic, Portuguese, Korean, and Italian.

More information about the dataset is available in our project planning document. This document is slightly outdated - for example, our plans for the seed URLs changed - but it is still the best description of our plans.

Dataset construction is in progress now. Current progress on the crawler and statistics for pages fetched can be found here. It is expected to be available to other researchers by April, 2009, under a TREC-style data license, for a small fee.

Crawl Seeds

The web crawler was seeded with 29 million URLs that were obtained using two different techniques.

Twenty million of the seed URLs were the URLs with the highest OPIC scores in a 200 million page crawl of the English web done during January to June, 2008. That earlier crawl covered only the English language and produced an unrepresentative sample of the web due to the very specific requirements of the project that collected it, so only the top 10% of the crawled pages were used as seeds.

9 million URLs were the top-ranked search results from a commercial web search engine for a query submitted during October, 2008. Several different techniques were used to develop queries.

  1. AOL Query Log: The most frequent 1050 queries were selected from the AOL query log. Another 1050 queries were sampled randomly from the AOL query log according to their relative frequency in the query log. These queries are mostly in English, and produce mostly English seeds. The top 500 results per query returned by Google, Yahoo!, or MSN were retained as seeds.

    The AOL queries ensure that the dataset includes the pages ranked highly by commercial web search engines for a set of real (albeit, slightly dated) set of queries. Many of these pages were also expected to have high PageRank.

  2. DMOZ Category Names: 2000 queries were created from DMOZ category names. The 2000 largest DMOZ categories up to depth 3 were used, with root node (named TOP) being depth 0. The size of each category was measured by its total number of descendants, and counts were summed if the same name appeared in different parts of the tree. These queries are mostly in English, and produce mostly English results. The top 500 results per query returned by Google, Yahoo!, or MSN were retained as seeds.

    The DMOZ queries are fairly generic queries. We expected the search engines to return high PageRank pages covering a broad range of topics.

  3. Translated Queries: The AOL and DMOZ queries were automatically translated from English into 9 other languages using Google Translate. We queried up to 3 major search engines for these 9 languages, depending on the search engines' support for these languages and the program friendliness of their interfaces. 3 Search Engines (Baidu, Google and Yahoo!) have been used for Chinese, 2 (Google and Yahoo!) for Spanish, Japanese, German, French and Italian, and 1 (Google) for Korean, Portuguese and Arabic. Search engine features were used to restrict the search to pages in the desired language. The top 200 results per query were retained as seeds.

    Although the translated queries may contain errors, when querying for the other 9 languages we restricted the search engines to return only pages in those languages. Because of that, we hope that for most of the frequent AOL queries and most DMOZ categories, the results will still be high PageRank pages in the target language.

  4. Yahoo! Multilingual Queries: In addition to the above translated queries, and as a more realistic source of the multilingual queries, seed URLs were also generated from a set of 1000 most frequent queries in each of the 9 languages (except Arabic) from the Yahoo! Research Webscope program. The queries (ydata-search-queries-multiple-langs-v1_0) were collected by Yahoo! over a three month period in 2008. We thank Yahoo! for providing the data through http://research.yahoo.com/Academic_Relations.

  5. Sogou Chinese Queries: Chinese is the second largest language in our collection, while the Yahoo! queries are mainly traditional Chinese queries, which bias toward documents from Hong Kong or Taiwan, instead of mainland China. Therefore, we tried to avoid this bias by including queries from a mainland based Chinese search engine, Sogou. These queries were collected during November 2008, and were made available through the Sogou Labs program. We also thank the Tsinghua-Sogou joint lab, specifically THUIR group for query extraction and processing efforts.

    For the Yahoo! and Sogou queries, exactly the same search engines for each language, as previously stated, have been queried, and the same number of top URLs for each query from each language have been queried to the search engines.

English and Chinese queries were filtered to remove objectionable material. If the search snippets for the top 100 results of a search engine contained more than 50 sexually explicit words, the query was dropped.

The queries are listed below. The Chinese queries are encoded in GB2312. All others are in UTF-8.

Due to data license restrictions, we cannot publish the queries from Yahoo! or Sogou. Interested parties are referred to the research program provided by either company to obtain the query data.

The AOL and DMOZ queries produced about 4 million unique seed URLs for English and roughly 5 million URLs for the other 9 languages. The Yahoo! and Sogou queries yielded 2.4 million and 0.5 million unique URLs respectively.

Dataset Format

Dataset construction is still in progress, so the final dataset format is still uncertain. However, we are currently considering the following:

The data will be distributed in Web Archive (WARC) format [1, 2, 3, 4, 5]. A small sample dataset is available now, so that people can see and comment on the proposed format. The final format may (probably will) differ somewhat from this sample.

Dataset Versions


Acknowledgements

    We gratefully acknowledge comments and advice from about two dozen scientists at Microsoft, Yahoo, Google, NIST, and universities around the world. This research is sponsored by National Science Foundation grant IIS-0841275 . Any opinions, findings, conclusions or recommendations expressed on this Web site are those of the author(s), and do not necessarily reflect those of the sponsor.


Updated on February 4, 2009.
Jamie Callan