Search Engines:
11-442 / 11-642
Description: This course studies the theory, design, and implementation of text-based search engines. The core components include statistical characteristics of text, representation of information needs and documents, several important retrieval models, and experimental evaluation. The course also covers common elements of commercial search engines, for example, integration of diverse search engines into a single search service ("federated search", "vertical search"), personalized search results, diverse search results, and sponsored search. The software architecture components include design and implementation of large-scale, distributed search engines.

This is a full-semester lecture-oriented course worth 12 units.
Learning Objectives: By the end of the course, students are expected to have developed the skills listed below.
  • Recall and discuss well-known search engine architectures, methods of representing text documents, methods of representing information needs, and methods of evaluating search effectiveness;
  • Implement well-known retrieval algorithms and test them on standard datasets; and
  • Apply information retrieval techniques discussed in class to solve problems faced by governments and companies.
Skills are assessed by the homework assignments and the final exam.
Eligibility: This course is open to all students who meet the pre-requisites.
Prerequisites: This course requires good programming skills and an understanding of computer architectures and operating systems (e.g., memory vs. disk trade-offs). A basic understanding of probability, statistics, and linear algebra is helpful. Thus students should have preparation comparable to the following CMU undergraduate courses.
  • 15-210, Parallel and Sequential Data Structures and Algorithms (required)
  • 15-213, Introduction to Computer Systems (required)
  • 15-451, Algorithm Design and Analysis (helpful)
  • 21-241, Matrix Algebra or 21-341, Linear Algebra (required)
  • 21-325, Probability (required)
  • 36-202, Basic statistics (helpful)
Time & Location: Tu/Th 10:30-11:50, DH A302.
Instructor: Jamie Callan
Teaching Assistants: Yubin Kim (yubink@andrew) (head TA)
Rachita Jain (rachitaj@andrew)
Preethi Sureshkumar (psureshk@andrew)
Chen Wang (chenwan1@andrew)
Chenyan Xiong (cx@andrew)
Office hours:
Monday, 3:00-4:00, Rachita, GHC 5417
Tuesday, 2:00-3:00, Chenyan, GHC 5417
Wednesday, 11:00-12:00, Chen, GHC 5417
Thursday, 2:00-3:00, Preethi, GHC 5417
Friday, 4:00-5:00, Yubin, GHC 6605
Instructional Materials: The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. You may use the printed copy or the online copy, but note that the reading instructions refer to the printed copy.

There are additional selected readings, which will be available through the class web page (this page).

Online access to some materials (additional readings, lecture notes, datasets, etc) is restricted to the domain. CMU people can get access from outside (e.g., from home) using CMU's WebVPN Service.

A discussion forum is provided for students to ask questions, answer questions, and discuss class-related topics. You will need a Piazza account to use the discussion forum. Please provide a CMU email address when you join the 11-642 discussion (you can use other email addresses, too). We will periodically remove students that do not have CMU email addresses.
Homework: 6 assignments that give hands-on experience with techniques discussed in class.
Grading: 6 homework assignments (60%), midterm exam (20%), final exam (20%).
Grading Scale: Grades are assigned using a curve.
Course policies: Attendance, Laptops & mobile devices, Late homework, Plagiarism & cheating Recording & videotaping
(subject to revision):
Date Topic Readings
Aug 26, Course overview (pdf)
Aug 28, Introduction to search: Exact-match retrieval (pdf) Ch 1, Ch 5.1
Sep 2, Introduction to search: Indexes, query processing (pdf)
HW1 out
Ch 2.4
Sep 4, Evaluating search effectiveness (pdf) Ch 8-8.5
Sep 9, Evaluating search effectiveness (pdf)
Sep 11, Document representation (pdf) 2-2.2
Sep 16, Best-match retrieval: VSM, BM25 (pdf)
HW1 due, HW2 out
Ch 6, Ch 11
Sep 18, Best-match retrieval: Language models (pdf) Ch 12
Sep 23, Query structure: Information needs and queries (pdf)
Sep 25, Query structure: Relevance and pseudo relevance feedback (pdf) Ch 9
Sep 30, Query structure: Relevance and pseudo relevance feedback (pdf)
Wednesday, Oct 1: HW2 due, HW3 out
Ch 7
Oct 2, Document structure (pdf) Ch 10
Oct 7, Index creation (pdf) Ch 4
Oct 9, Index creation (pdf)
Oct 14, Midterm Exam Sample midterm
Odd exam answers, Even exam answers
Oct 16, Index creation (pdf)  
Oct 21, Ranked retrieval: Feature-based models (pdf)
HW3 due, HW4 out
Oct 23, Authority metrics (pdf) Ch 21
Oct 28, Page quality, web spam (pdf)  
Oct 30, Diversity (pdf) Carbonell & Goldstein, 1998; Santos, et al., 2010,
Nov 4, Context (mobile) TBD
Nov 6, Search log analysis
HW4 due, HW5 out
Nov 11, Search log analysis TBD
Nov 13, Personalization TBD
Nov 18, Federated, aggregated, & vertical search Si & Callan, 2003
Nov 20, Federated, aggregated, & vertical search Arguello & Diaz, 2013
Nov 25, Geographically-distributed search TBD
Dec 2, Enterprise search
HW5 due
Dec 4, Web crawling Ch 20-20.2
TBD Final exam

Copyright 2014, Carnegie Mellon University.
Updated on October 31, 2014
Jamie Callan