Search Engines:
11-442 / 11-642
CMU logo
Description: This course studies the theory, design, and implementation of text-based search engines. The core components include statistical characteristics of text, representation of information needs and documents, several important retrieval models, and experimental evaluation. The course also covers common elements of commercial search engines, for example, integration of diverse search engines into a single search service ("federated search", "vertical search"), personalized search results, diverse search results, and sponsored search. The software architecture components include design and implementation of large-scale, distributed search engines.

This is a full-semester lecture-oriented course worth 12 units.
Learning Objectives: By the end of the course, students are expected to have developed the skills listed below.
  • Recall and discuss well-known search engine architectures, methods of representing text documents, methods of representing information needs, and methods of evaluating search effectiveness;
  • Implement well-known retrieval algorithms and test them on standard datasets; and
  • Apply information retrieval techniques discussed in class to solve problems faced by governments and companies.
Skills are assessed by the homework assignments and the final exam.
Eligibility: This course is open to all students who meet the prerequisites.
Prerequisites: This course requires good programming skills and an understanding of computer architectures and operating systems (e.g., memory vs. disk trade-offs). A basic understanding of probability, statistics, and linear algebra is helpful. Thus students should have preparation comparable to the following CMU undergraduate courses.
  • 15-210, Parallel and Sequential Data Structures and Algorithms (required)
  • 15-213, Introduction to Computer Systems (required)
  • 15-451, Algorithm Design and Analysis (helpful)
  • 21-241, Matrix Algebra or 21-341, Linear Algebra (required)
  • 21-325, Probability (required)
  • 36-202, Basic statistics (helpful)
Time & Location: Tu/Th 10:30-12:00, DH A302
Instructor: Jamie Callan
Teaching Assistants: Kang Huang (kangh@andrew),
Zhoucheng Li (chouclee@cmu),
Claire Zhiyue Liu (zhiyuel@andrew),
Suruchi Shah (suruchis@cs),
Qinyu Tong (qtong@andrew)
Bingqing Wu (bingqinw@andrew)
Tyrone Taiyuan Zhang (taiyuanz@andrew)
Office hours:
Monday 12:00-1:00 (Oct 19 only)
  6:00-7:00 (Oct 19 only)
GHC 5417 (Oct 19 only)
GHC 5417
GHC 5417 (Oct 19 only)
Zhoucheng (Oct 19 only)
Suruchi (Oct 19 only)
Tuesday 1:00-2:30 (cancelled Oct 20) GHC 5417 Suruchi
Wednesday 2:00-3:00 (cancelled Oct 21) GHC 5417 Zhoucheng
Thursday 3:00-4:00 GHC 5417 Kang
Friday 1:00-2:00 GHC 5401 Bingqing
Instructional Materials: The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. You may use the printed copy or the online copy, but note that the reading instructions refer to the printed copy.

There are additional selected readings, which will be available through the class web page (this page).

Online access to some materials (additional readings, lecture notes, datasets, etc) is restricted to the domain. CMU people can get access from outside (e.g., from home) using CMU's WebVPN Service.

A discussion forum is provided for students to ask questions, answer questions, and discuss class-related topics. You will need a Piazza account to use the discussion forum. Please provide a CMU email address when you join the 11-642 discussion (you can use other email addresses, too). We will periodically remove students that do not have CMU email addresses.
Homework: 5 assignments that give hands-on experience with techniques discussed in class.
Grading: Weekly reading summaries (10% total), 5 homework assignments (10% each, 50% total), midterm exam (20%), final exam (20%).
Grading Scale: Grades are assigned using a curve.
Course policies: Attendance, Auditing, Laptops & mobile devices, Late homework, Pass/Fail, Plagiarism & cheating, Recording & videotaping, Waitlist
(subject to revision):
Date Topic Readings
Sep 1, Course overview (pdf)
Sep 3, Introduction to search: Exact-match retrieval (pdf1, pdf2) Ch 1, Ch 5.1
Sep 8, Introduction to search: Indexes, query processing (pdf) Ch 2.4
Sep 10, Evaluating search effectiveness (pdf)
HW1 out
Ch 8-8.5
Sep 15, Evaluating search effectiveness (pdf)
Sep 17, Document representation (pdf) 2-2.2
Sep 22, Best-match retrieval: VSM, BM25 (pdf)
HW1 due, HW2 out
Ch 6, Ch 11
Sep 24, Best-match retrieval: Language models (pdf) Ch 12
Sep 29, No class - TOC  
Oct 1, Query structure: Information needs and queries (pdf) Nguyen & Callan, 2011
Oct 6, Query structure: Relevance and pseudo relevance feedback (pdf)
HW2 due, HW3 out
Ch 9
Oct 8, Document structure (pdf) Ch 10
Oct 13, Index creation (pdf1, pdf2) Ch 4
Oct 15, Index creation (pdf) Ch 7
Oct 20, Midterm Exam Sample Midterm 1, Sample Midterm 2,
Answers 1, Answers 2
Oct 22, Index creation (pdf)  
Oct 27, Ranked retrieval: Feature-based models (pdf)
HW3 due, HW4 out
Clarke Ch 11.7; Li, 2011
Oct 29, Authority metrics (pdf) Ch 21
Nov 3, Page quality, web spam (pdf)  
Nov 5, Diversity (pdf) Santos, Ch 1-5
Nov 10, Diversity (pdf)
HW4 due, HW5 out
Santos, Ch 6-7
Nov12, Search log analysis (pdf)  
Nov 17, Search log analysis (pdf) Eickhoff et al, 2014
Nov 19, Personalization (pdf1, pdf2) Bennett et al, 2012
Nov 24, Federated, aggregated, & vertical search (pdf1, pdf2) Si & Callan, 2003
Dec 1, Federated, aggregated, & vertical search
HW5 due
Arguello & Diaz, 2013
Dec 3, Selective search Kulkarni & Callan, 2010
Dec 8, Enterprise search  
Dec 10, Final exam Sample final

Copyright 2015, Carnegie Mellon University.
Updated on November 24, 2015
Jamie Callan