Carnegie Mellon University
Academic research crawler by Professor Chenyan Xiong's research group at the Language Technologies Institute
Purpose
Academic research only
Respects
robots.txt & crawl-delay
Origin
CMU SCS IP range
ClueWeb-Crawler/1.0 (+https://boston.lti.cs.cmu.edu/CMU-ClueWeb-Crawler/; mailto:cmu-clueweb-crawler@andrew.cmu.edu)
We crawl
We don't crawl
Conservative per-host rate limiting
Honors all robots.txt directives
Auto back-off on 429 / 503 responses
Designed to minimize server load
Data is collected for research only. We store only what's necessary, make reasonable efforts to avoid personal data, and comply with university policies. Incidentally encountered personal data is never used for identification or tracking.
Block via robots.txt or submit a request. Honored promptly and permanently.
cmu-clueweb-crawler@andrew.cmu.edu
Chenyan Xiong's Research Group · Language Technologies Institute · CMU