The ClueWeb09 Dataset - Dataset Information and Sample Files

The ClueWeb09 Dataset:
Dataset Information and Sample Files

The ClueWeb09 dataset was created by the Language Technologies Institute at Carnegie Mellon University to support research on information retrieval and related human language technologies. The dataset consists of 1 billion web pages, in ten languages, collected in January and February 2009. This page describes the dataset organization and format.

Dataset Organization
Dataset Format
Checksum Files
Record Counts
Language Identifiers
Language Encodings
TREC 2009 Category A and B
Sample Files

Dataset Organization

The dataset is organized into segments. Each segment contains approximately 50 million records (web pages). Each segment is stored in a directory named:

ClueWeb09_<language>_<segment #>

where <language> is the language of pages for segment (e.g. English) and <segment #> is the segment number.

Each segment contains a set of directories named:

<language><directory #>

where <language> is a 2-letter standard language identifier (see Language Identifiers below), and <directory #> is the sequence number for that language.

Each directory contains up to 100 files named:

<file #>.warc.gz

where <file #> is the sequence number of the file within its directory from "00.warc.gz" up to "99.warc.gz".

Each file contains approximately 40,000 web pages in WARC file format, as described below. An uncompressed file requires about 1 GB of storage.

For example, the first English pages downloaded by the crawler are stored in:

ClueWeb09_English_1/en0000/00.warc.gz

There is one exception to this format. At the end of the first English segment (ClueWeb09_English_1), there are four directories that contain a complete copy of the English Wikipedia. These directories are named:

enwp<wikipedia directory #>

where <wikipedia directory #> is 00, 01, 02, or 03.

Dataset Format

Web pages are stored in gzipped files that are in WARC format. The WARC formatting used conforms to the WARC ISO 28500 final draft (as of June 18th, 2008), version 018.

Specifications for the format can be found at:

One custom field is added to the WARC response header information named "WARC-TREC-ID". This is a globally unique identifier for the dataset that describes the location of the individual record within the entire ClueWeb09 Dataset. The WARC-TREC-ID value is in the format of:

clueweb09-<directory>-<file>-<record>

The <directory> corresponds to the individual directory as specified in the Dataset Organization section above. It is in the format of <language><directory #> where language is the 2 letter standard language code and the directory number is a 4-digit (padded) directory number in sequence.

The <file> is a 2-digit (padded) number that corresponds to the file number within the <directory>.

The <record> is a 5-digit (padded) number that corresponds to this record's sequence within the individual file.

Checksum Files

The files with the name "ClueWeb_*.md5" are the md5 sums of the individual WARC files in the dataset. These MD5 sums are in the format:

<md5 checksum hash> <file>

with multiple lines in the file - one line for each file in the dataset. For example, the following line:

98f91370de2dbc9c6d358f6251e591d6 *./en0000/22.warc.gz

denotes the md5 checksum for the file 22.warc.gz under the en0000 directory.

The checksum files (by segment) are on the individual directories on the disks, or can be download here:

Record Counts

The files with the name "ClueWeb09_*_counts.txt" are the record counts by file for the individual WARC files for the dataset. The record count files are in the format of:

<file> <# of records>

With multiple lines in the file - one line for each file in the dataset. For example, the following line:

*./en0042/15.warc.gz 34618

denotes that the file 15.warc.gz under the en0042 directory has 34,618 individual page records in it.

The record counts (by language) are as follows:

Language	# Records
English	503,903,810 pages
Chinese	177,489,357 pages
Spanish	79,333,950 pages
Japanese	67,337,717 pages
French	50,883,172 pages
German	49,814,309 pages
Portuguese	37,578,858 pages
Arabic	29,192,662 pages
Italian	27,250,729 pages
Korean	18,075,141 pages

The record counts (by segment on disk) are as follows (note that the individual count files are on the disks, but can also be downloaded here):

Segment Identifier	# Records	Record Count File
ClueWeb09_English_1	50,220,423 pages	ClueWeb09_English_1_counts.txt
ClueWeb09_English_2	51,577,077 pages	ClueWeb09_English_2_counts.txt
ClueWeb09_English_3	50,547,493 pages	ClueWeb09_English_3_counts.txt
ClueWeb09_English_4	52,311,060 pages	ClueWeb09_English_4_counts.txt
ClueWeb09_English_5	50,756,858 pages	ClueWeb09_English_5_counts.txt
ClueWeb09_English_6	50,559,093 pages	ClueWeb09_English_6_counts.txt
ClueWeb09_English_7	52,472,358 pages	ClueWeb09_English_7_counts.txt
ClueWeb09_English_8	49,545,346 pages	ClueWeb09_English_8_counts.txt
ClueWeb09_English_9	50,738,874 pages	ClueWeb09_English_9_counts.txt
ClueWeb09_English_10	45,175,228 pages	ClueWeb09_English_10_counts.txt
ClueWeb09_Chinese_1	50,325,079 pages	ClueWeb09_Chinese_1_counts.txt
ClueWeb09_Chinese_2	49,764,419 pages	ClueWeb09_Chinese_2_counts.txt
ClueWeb09_Chinese_3	50,359,421 pages	ClueWeb09_Chinese_3_counts.txt
ClueWeb09_Chinese_4	27,040,438 pages	ClueWeb09_Chinese_4_counts.txt
ClueWeb09_Spanish_1	49,841,221 pages	ClueWeb09_Spanish_1_counts.txt
ClueWeb09_Spanish_2	29,492,729 pages	ClueWeb09_Spanish_2_counts.txt
ClueWeb09_Japanese_1	50,634,640 pages	ClueWeb09_Japanese_1_counts.txt
ClueWeb09_Japanese_2	16,703,077 pages	ClueWeb09_Japanese_2_counts.txt
ClueWeb09_German_1	49,814,309 pages	ClueWeb09_German_1_counts.txt
ClueWeb09_French_1	50,883,172 pages	ClueWeb09_French_1_counts.txt
ClueWeb09_Korean_1	18,075,141 pages	ClueWeb09_Korean_1_counts.txt
ClueWeb09_Italian_1	27,250,729 pages	ClueWeb09_Italian_1_counts.txt
ClueWeb09_Portuguese_1	37,578,858 pages	ClueWeb09_Portuguese_1_counts.txt
ClueWeb09_Arabic_1	29,192,662 pages	ClueWeb09_Arabic_1_counts.txt

Language Identifiers

All 2-letter language identifers for the dataset conform to the ISO 639 language ID list. The languages used in the ClueWeb09 dataset are:

en - English
zh - Chinese
es - Spanish
ja - Japanese
de - German
fr - French
ko - Korean
it - Italian
pt - Portuguese
ar - Arabic

Language Encodings

English content is encoded in UTF-8 format (where proper UTF-8 character encodings apply). The content for all other languages is encoded in the encoding given by the web server that supplied the web page. When available, the content encoding appears in the individual HTTP header information for the record in the key/value pair "Content-Type".

TREC 2009 Category A and Category B

For TREC 2009, the entire dataset is referred to as the TREC 2009 "Category A" data set. The TREC 2009 "Category B" data set consists of the first English segment of Category A, which is roughly the first 50 million pages of the entire data set. This corresponds to the first directory of English documents in the dataset, "ClueWeb09_English_1".

Sample Files

Sample files (in gzipped, WARC format) can be found below. Each file is roughly 1 GB in size when uncompressed.

ClueWeb09_English_Sample_File.warc.gz (322K download, 100 records)
ClueWeb09_Chinese_Sample_File.warc.gz (473K download, 100 records)

Updated on April 6, 2009.

Maintained by Jamie Callan