Search Engines:
11-442 / 11-642
 
CMU logo
 
Description: This course studies the theory, design, and implementation of text-based search engines. The core components include statistical characteristics of text, representation of information needs and documents, several important retrieval models, and experimental evaluation. The course also covers common elements of commercial search engines, for example, integration of diverse search engines into a single search service ("federated search", "vertical search"), personalized search results, diverse search results, and sponsored search. The software architecture components include design and implementation of large-scale, distributed search engines.

This is a full-semester lecture-oriented course worth 12 units.
Learning Objectives: By the end of the course, students are expected to have developed the skills listed below.
  • Recall and discuss well-known search engine architectures, methods of representing text documents, methods of representing information needs, and methods of evaluating search effectiveness;
  • Implement well-known retrieval algorithms and test them on standard datasets; and
  • Apply information retrieval techniques discussed in class to solve problems faced by governments and companies.
Skills are assessed by the homework assignments and the final exam.
Eligibility: This course is open to all students who meet the pre-requisites.
Prerequisites: This course requires good programming skills and an understanding of computer architectures and operating systems (e.g., memory vs. disk trade-offs). A basic understanding of probability, statistics, and linear algebra is helpful. Thus students should have preparation comparable to the following CMU undergraduate courses.
  • 15-210, Parallel and Sequential Data Structures and Algorithms (required)
  • 15-213, Introduction to Computer Systems (required)
  • 15-451, Algorithm Design and Analysis (helpful)
  • 21-241, Matrix Algebra or 21-341, Linear Algebra (required)
  • 21-325, Probability (required)
  • 36-202, Basic statistics (helpful)
Time & Location: Tu/Th 1:30-2:50, NSH 1305.
Instructor: Jamie Callan
Teaching Assistants: Chenyan Xiong (cx@andrew),
Zhoucheng Li,
more TBD
Office hours: TBD
Instructional Materials: The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. You may use the printed copy or the online copy, but note that the reading instructions refer to the printed copy.

There are additional selected readings, which will be available through the class web page (this page).

Online access to some materials (additional readings, lecture notes, datasets, etc) is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service.

A discussion forum is provided for students to ask questions, answer questions, and discuss class-related topics. You will need a Piazza account to use the discussion forum. Please provide a CMU email address when you join the 11-642 discussion (you can use other email addresses, too). We will periodically remove students that do not have CMU email addresses.
Homework: 5 assignments that give hands-on experience with techniques discussed in class.
Grading: 5 homework assignments (12% each, 60% total), midterm exam (20%), final exam (20%).
Grading Scale: Grades are assigned using a curve.
Course policies: Attendance, Auditing, Laptops & mobile devices, Late homework, Pass/Fail, Plagiarism & cheating, Recording & videotaping, Waitlist
Syllabus
(subject to revision):
Date Topic Readings
Jan 13, Course overview
Jan 15, Introduction to search: Exact-match retrieval Ch 1, Ch 5.1
Jan 20, Introduction to search: Indexes, query processing
HW1 out
Ch 2.4
Jan 22, Evaluating search effectiveness Ch 8-8.5
Jan 27, Evaluating search effectiveness
Jan 29, Document representation 2-2.2
Feb 3, Best-match retrieval: VSM, BM25
HW1 due, HW2 out
Ch 6, Ch 11
Feb 5, Best-match retrieval: Language models Ch 12
Feb 10, Query structure: Information needs and queries
Feb 12, Query structure: Relevance and pseudo relevance feedback Ch 9
Feb 17, Query structure: Relevance and pseudo relevance feedback
HW2 due, HW3 out
Ch 7
Feb 19, Document structure Ch 10
Feb 24, Index creation Ch 4
Feb 26, Index creation
Mar 3, Midterm Exam Sample midterm 1, Sample midterm 2
Mar 5, Index creation  
Mar 17, Ranked retrieval: Feature-based models
HW3 due, HW4 out
 
Mar 19, Authority metrics Ch 21
Mar 24, Page quality, web spam  
Mar 26, Diversity Carbonell & Goldstein, 1998; Santos, et al., 2010,
Mar 31, Diversity Dang & Croft, 2012; Dang & Croft, 2013
Apr 2, Search log analysis
HW4 due, HW5 out
 
Apr 7, Search log analysis  
Apr 9, Personalization Eickhoff et al, 2014; Bennett et al, 2012
Apr 14, Federated, aggregated, & vertical search Si & Callan, 2003
Apr 21, Federated, aggregated, & vertical search Arguello & Diaz, 2013
Apr 23, Selective search Kulkarni & Callan, 2010
Apr 28, Enterprise search
HW5 due
 
Apr 30, Web crawling Ch 20-20.2
Dec 8, Final exam, 1:00-4:00, DH 2210

Copyright 2014, Carnegie Mellon University.
Updated on December 19, 2014
Jamie Callan