Carnegie Mellon University

ClueWeb Crawler

Academic research crawler by Professor Chenyan Xiong's research group at the Language Technologies Institute

Purpose

Academic research only

Respects

robots.txt & crawl-delay

Origin

CMU SCS IP range

User-Agent

ClueWeb-Crawler/1.0
(+https://boston.lti.cs.cmu.edu/CMU-ClueWeb-Crawler/;
 mailto:cmu-clueweb-crawler@andrew.cmu.edu)

Scope

We crawl

  • Public, unauthenticated pages
  • HTML and static resources
  • Content permitted by robots.txt

We don't crawl

  • Private or paywalled content
  • Forms, carts, or POST endpoints
  • User accounts or dashboards

Behavior

Conservative per-host rate limiting

Honors all robots.txt directives

Auto back-off on 429 / 503 responses

Designed to minimize server load

Data & Privacy

Data is collected for research only. We store only what's necessary, make reasonable efforts to avoid personal data, and comply with university policies. Incidentally encountered personal data is never used for identification or tracking.

Want to opt out?

Block via robots.txt or submit a request. Honored promptly and permanently.

Opt-Out Form

Contact

cmu-clueweb-crawler@andrew.cmu.edu

Chenyan Xiong's Research Group · Language Technologies Institute · CMU