Search Engines and Web Mining
11-441 / 11-641

 

Description:

This course provides a comprehensive introduction to the theory and implementation of algorithms for organizing and searching large text collections. The first half of the course studies text search engines for enterprise and Web environments; the open-source Lucene and Indri search engines are used as working examples. The second half studies text mining techniques such as recommender systems, clustering, and categorization. Programming assignments give hands-on experience with document ranking, evaluation, categorizing documents into browsing hierarchies, and related topics.

Eligibility:

This course is open to all students who meet the pre-requisites except students in the LTI's MLT and PhD programs. Students in the LTI's MLT and PhD programs can take 11-741, Information Retrieval, which focuses more on research. This course focuses more on current practice.

Prerequisites:

Prerequisites: 15-211, Fundamental Data Structures and Algorithms. 21-241, Matrix Algebra or 21-341, Linear Algebra.
Recommended: 15-213, Introduction to Computer Systems.

Time & Location:

TBD

Instructor(s):

Jamie Callan and Yiming Yang

Teaching Assistant:

TBD

Instructional Materials:

The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. The textbook can be purchased at the CMU book store.

There are selected additional readings, which are available online or placed on reserve in the Engineering and Science Library, 4th floor, Wean Hall.

Online access to some materials is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service.

Homework:

Homework consists of programming projects and/or problem sets.

Grading:

50% homework (4 programming, 1 written), 10% quizzes, 20% midterm, 20% final.

Course policies:

Late homework , Cheating , Laptops

Syllabus:

 

 

  1. Aug 28,  Course overview and introduction
  2. Aug 30,  Introduction to text search (Ch 1)
  3. Sep 4,  Text representation (Ch 2)
                HW1 out, due Sep 18
  4. Sep 6,   Index construction (Ch 4)
  5. Sep 11, Index construction (Ch 5)
  6. Sep 13, Index construction (Ch 20.3-20.4)
  7. Sep 18, Information needs and queries (Ch 8)
  8. Sep 20,  Evaluating search effectiveness (Ch 8)
                  HW2 out, due Oct 4
  9. Sep 25,  Evaluating search effectiveness
  10. Sep 27,  Vector space retrieval model (Ch 6-7)
  11. Oct 2,    Statistical language models for IR (Ch 12)
  12. Oct 4,    Clustering (Ch 16, 17)
  13. Oct 9,  Clustering
  14. Oct 11,   Structured documents, combination of evidence (Ch 10)
  15. Oct 16,   Midterm exam (Sample midterm)
  16. Oct 18,  Search log analysis
  17. Oct 23,  Search log analysis
  18. Oct 25,  Query classification, and federated search (Callan, 2000)
  19. Oct 30,  Query classification and federated search
  20. Nov 1,  Collaborative filtering (Shardanand, CHI'95; Si & Jin, ICML'03)

                 HW3 out, due Nov 15.

  1. Nov 6,   Collaborative filtering
  2. Nov 8, Clustering (cont’d)
  3. Nov 13, Link Analysis (Ch 21) 
  4. Nov 15, Link Analysis  
  5. Nov 20, Text categorization (Ch 13)
  6. Nov 27, Significance tests (Yang & Liu, SIGIR 1999)
                  HW4 out, due Dec 4.
  1. Nov 29, Text categorization (Ch 15)
  2. Dec 4,  Text categorization
  3. Dec 6,  Text categorization
  4. TBD,   Final Exam (Sample Final)

 


Updated on March 29, 2012

Jamie Callan and Yiming Yang