CMU logo

Search Engines:
11-442 / 11-642

Description: This course studies the theory, design, and implementation of text-based search engines. The core components include statistical characteristics of text, representation of information needs and documents, several important retrieval models, and experimental evaluation. The course also covers common elements of commercial search engines, for example, integration of diverse search engines into a single search service ("federated search", "vertical search"), personalized search results, diverse search results, and sponsored search. The software architecture components include design and implementation of large-scale, distributed search engines.

This is a full-semester lecture-oriented course worth 12 units.
Eligibility: This course is open to all students who meet the pre-requisites except students in the LTI's MLT and PhD programs. Students in the LTI's MLT and PhD programs can take 11-741, Information Retrieval, which focuses more on research. This course focuses more on current practice.
Learning Objectives: By the end of the course, students are expected to have developed the following skills. Skills are assessed by the homework assignments and the final exam.
  • Recall and discuss well-known search engine architectures, methods of representing text documents, methods of representing information needs, and methods of evaluating search effectiveness;
  • Implement well-known retrieval algorithms and test them on standard datasets; and
  • Apply information retrieval techniques discussed in class to solve problems faced by governments and companies.
Prerequisites: This course requires good programming skills and an understanding of computer architectures and operating systems (e.g., memory vs. disk trade-offs). A basic understanding of probability, statistics, and linear algebra is helpful. Thus students should have preparation comparable to the following CMU undergraduate courses.
  • 15-210, Parallel and Sequential Data Structures and Algorithms (required)
  • 15-213, Introduction to Computer Systems (required)
  • 15-451, Algorithm Design and Analysis (helpful)
  • 21-241, Matrix Algebra or 21-341, Linear Algebra (required)
  • 21-325, Probability (required)
  • 36-202, Basic statistics (helpful)
Time & Location: Tu/Th 10:30-11:50, GHC 4211. (The room will probably change.)
Instructor: Jamie Callan
Teaching Assistant: TBD
Office hours: By request. Send email to schedule a meeting.
Instructional Materials: One of the following textbooks.
  • B. Croft, D. Metzler, and T. Strohman. Search Engines: Information Retrieval in Practice. Addison-Wesley. 2010. Price: $92 at Amazon.
  • S. Buettcher, C. L. A. Clarke, and G. Cormack. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press. 2010. Price: $40 at Amazon.
There may be additional selected readings, which will be available through the class web page (this page).

Online access to some materials (additional readings, lecture notes, datasets, etc) is restricted to the domain. CMU people can get access from outside (e.g., from home) using CMU's WebVPN Service.
Homework: 6 assignments that give hands-on experience with techniques discussed in class.
Grading: 6 homework assignments (60%), midterm exam (20%), final exam (20%).
Grading Scale: Grades are assigned using a curve.
Course policies: Attendance, Laptops & mobile devices, Late homework, Plagiarism & cheating Recording & videotaping
Syllabus (subject to revision):  
  1. Aug 26, Course overview
  2. Aug 28, Web crawling
  3. Sep 2, Large-scale computing environments
    HW1 out
  4. Sep 4, Text representation
  5. Sep 9, Search engine indexes
  6. Sep 11, Index construction
  7. Sep 16, Large-scale and distributed index construction
    HW1 due, HW2 out
  8. Sep 18, Information needs and queries
  9. Sep 23, Evaluating search effectiveness
  10. Sep 25, Evaluating search effectiveness
  11. Sep 30, Vector space retrieval model
    HW2 due, HW3 out
  12. Oct 2, Probabilistic retrieval models
  13. Oct 7, Probabilistic retrieval models
  14. Oct 9, Hyperlink retrieval models
  15. Oct 14, Midterm Exam
  16. Oct 16, Feature-based retrieval models
  17. Oct 21, Learning to rank
    HW3 due, HW4 out
  18. Oct 23, Relevance and pseudo relevance feedback
  19. Oct 28, Diversity
  20. Oct 30, Personalization
  21. Nov 4, Search log analysis
    HW4 due, HW5 out
  22. Nov 6, Search log analysis
  23. Nov 11, Search engine optimization
  24. Nov 13, Sponsored search
  25. Nov 18, Sponsored search
    HW5 due, HW6 out
  26. Nov 20, Applications of search indexes and APIs
  27. Nov 25, Federated search
  28. Dec 2, Federated search
  29. Dec 4, Geographically-distributed search
    HW6 due
  30. Final exam

Updated on January 15, 2014
Jamie Callan