CMU logo

11-741: Information Retrieval

LTI logo

 

Description:

This course studies the theory, design, and implementation of text-based information systems. The Information Retrieval core components of the course include statistical characteristics of text, representation of information needs and documents, several important retrieval models (Boolean, vector space, probabilistic, inference net, language modeling), clustering algorithms, automatic text categorization, and experimental evaluation. The software architecture components include design and implementation of high-capacity text retrieval and text filtering systems. A variety of current research topics are also covered, including cross-lingual retrieval, document summarization, machine learning, topic detection and tracking, and multi-media retrieval.

Prerequisites:

  • Programming and data-structures at the level of 15-211 or higher.
  • Algorithms comparable to the undergraduate CS algorithms course (15-451) or higher.
  • Basic linear algebra (21-241 or 21-341).
  • Basic statistics (36-202) or higher.

Time & Location:

TR 12:00-1:20pm, Wean Hall 4623

Instructors:

Jamie Callan and Yiming Yang

Instructor Office Hours:

By appointment

Teaching Assistant(s):

Le Zhao

TA Office Hours:

By appointment

Textbook:

The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008.

Other Readings:

Selected papers or book chapters will be assigned reading for some lectures. All will be available online and/or on reserve in the Engineering and Science Library, 4th floor, Wean Hall. Some of the books used in the course are listed below.

  • Hastie: The Elements of Statistical Learning. T. Hastie, R. Tibshirani, and J. Friedman. (2001) Springer. New York.
  • MG: Managing Gigabytes. I.H. Witten, A. Moffat, and T.C. Bell. 2nd edition. (1999), Morgan Kaufmann.
  • SNLP: Foundations of Statistical Natural Language Processing, C. Manning and H. Schutze. (1999), MIT Press.

Course Notes:

Usually available online, occasionally distributed in lectures. Online access is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using VPN or CMU's WebVPN Service.

Homework:

1 brief reading summary per week (1/2 - 1 page), and 5 problem sets or programming assignments. This is subject to change (but it probably won't). Submission guidelines

Grading:

Grades will be based on 5 problem sets / programming assignments sets (10% each, 50% total), weekly summaries of readings (10% total), a midterm exam (20%) and a final exam (20%).

Course Policies:

Attendance, Cheating, Laptop computers, Late homework, Recording & videotaping, Bugs in homework

Sitting In:

Approval from the instructors is required.

Syllabus:

The anticipated syllabus is below. It is subject to change.
 

Lecture

Day

Important
Events

Topic

Readings

1.

1/13

 

Course overview (pdf)

 

2.

1/15

 

Introduction to ad-hoc search: Boolean retrieval (pdf) Updated 1/28

Ch 1

3.

1/20

 

Text representation (pdf)

Ch 2.0-2.2

4.

1/22

 

Text representation, index construction (pdf) Updated 1/27

Ch 4

5.

1/27

 

Index construction (pdf) Updated 1/29

Ch 2.3-2.4, 3.2, 5.1, 5.3

6.

1/29

HW1 out

Index construction; web indexing (pdf)

 

7.

2/3

 

Information needs and queries (pdf)

 

8.

2/5

 

Evaluation (pdf)

Ch 8

9.

2/10

 

Retrieval models: Vector space (pdf)

Ch 9

10.

2/12

HW1 due

Retrieval models: Probabilistic model (pdf)

Ch 11

11.

2/17

HW2 out

Retrieval models: Statistical language models (pdf)

Ch 12, Zhai & Lafferty

12.

2/19

 

Retrieval models: Structured documents, inference network (pdf) (Updated 3/2)

Ch 10; Metzler

13.

2/24

 

Retrieval models: Link-analysis based (pdf)

Ch 21; Ng, et al., IJCAI'01

14.

2/26

 

Retrieval models: Collaborative filtering (pdf) (quiz1)

Shardanand, CHI'95; Si & Jin, ICML'03

15.

3/3

HW2 due
HW3 out

Search log analysis (pdf)

Agichtein

 

3/5

 

Midterm Exam

2009 midterm, 2008 midterm, 2007 midterm, 2006 midterm

 

3/10

 

Spring Break!

 

 

3/12

 

Spring Break!

 

16.

3/17

 

Query classification, federated search (pdf)

Callan

17.

3/19

 

Retrieval models: Collaborative filtering (cont’d)

 

18.

3/24

 

Dimensionality reduction  (pdf) (pdf2) (quiz2)

Ch 18

19.

3/26

HW3 due
HW4 out

Learning empirical associations (pdf) (pdf2) (quiz3)

Yang ICML'97; Forman, JMLR'03

20.

3/31

 

Document clustering I (pdf)

Ch 16

21.

4/2

 

Document clustering I, II

Ch 17

22.

4/7

 

Document clustering II (pdf)

 

23.

4/9

 

Text categorization introduction

Yang & Liu SIGIR'99

24.

4/14

HW4 due
HW5 out

 Significance Tests  (pdf)

Ch 13

 

4/16

 

Mid Semester Break

 

25.

4/21

 

Naive Bayes methods (quiz 4)  (pdf)

McCallum & Nigam, AAAI Workshop, 1998

26.

4/23

 

Nearest neighbor  (pdf)

Goldberger, NIPS '04

27.

4/28

HW5 due

Support Vector Machines (quiz 5)  (pdf)

Ch 15

28.

4/30

 

Large-scale text categorization

Yang el at. SIGIR'03; Liu el at. SIGKDD'05

 

5/11

 

Final Exam: 1-4pm, Room: WeH 5302

2006 final


Updated on January 5, 2009.

Jamie Callan, Yiming Yang