Scaling Up Language Technologies
at CMU and U. Maryland

May 8, 2008
9:30 am - 2:45 pm, 4615A Wean Hall
3:00 pm - 4:00 pm, 1305 Newell-Simon Hall

Sponsored by
Yahoo logo

Language technologies research groups at Carnegie Mellon and the University of Maryland have begun using computing clusters and the Hadoop software environment to address larger problems and datasets than were practical previously. This mini-workshop is an opportunity for the research groups to share preliminary information about problems encountered, solutions developed, and lessons learned.

The workshop will be small and informal. A few brief talks are planned, primarily to provide context for longer discussions. The workshop culminates in an LTI Seminar by Jimmy Lin.

The workshop is free and open to anyone, as long as there is space in the room. However, the main audience is students and faculty that are using, or expect to be using soon, either the Yahoo M45 or the Google/IBM computing clusters, as well as LTI students and faculty.


9:30 - 10:00 Introductions, Goals
10:00 - 11:00 Observations From Projects
"Construction of Statistical Machine Translation Models with MapReduce" - Chris Dyer, Aaron Cordova, Alex Mont, and Jimmy Lin
"MT Research on M45" - Ashish Venugopal, Anthony D'Auria, and Stephan Vogel
"Lessons Learned from Crawling and Annotating 200 Million Documents" - Le Zhao, Changkuk Yoo, Mark Hoy, and Jamie Callan
11:00 - 11:15 Break
11:15 - 12:00 Discussion : Porting legacy code into Hadoop / M45
12:00 - 12:45 Lunch : General discussion
12:45 - 1:00 Break
1:00 - 1:40 Observations From Projects
"Distributed Iterative Training" - Kevin Gimpel, Shay Cohen, Severin Hacker, and Noah Smith
"Pairwise Document Similarity in Large Collections with MapReduce" - Tamer Elsayed, Jimmy Lin, and Douglas Oard
1:40 - 2:30 General Discussion : Debugging, diagnosis, tuning experiences and best practices
2:30 - 2:45 Final thoughts ...
3:00 - 4:00 LTI Seminar
Fast, Easy, and Cheap: Scalable Text Processing with MapReduce - Jimmy Lin
(1305 Newell-Simon Hall)

Talk Guidelines

Talks should give a 1-2 slide description of the core scientific issues, but focus on how the problem was framed as a Map/Reduce problem, Hadoop issues & hurdles, and Hadoop-related observations and conclusions. It is fine to discuss how some well-known algorithm was mapped onto a cluster architecture, as long as the solution is not the obvious solution that would occur to any well-informed person. Focus on the large-scale issues, not the language technologies issues.

Talks are 15 minutes, followed by 5 minutes Q/A.

Updated on May 1, 2008
Jamie Callan