Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: CMU logo

Search Engines and Web Mining
11-441 / 11-641

 

Description:

This course provides a comprehensive introduction to the theory and implementation of algorithms for organizing and searching large text collections. The first half of the course studies text search engines for enterprise and Web environments; the open-source Lucene and Indri search engines are used as working examples. The second half studies text mining techniques such as recommender systems, clustering, and categorization. Programming assignments give hands-on experience with document ranking, evaluation, categorizing documents into browsing hierarchies, and related topics.

Eligibility:

This course is open to all students who meet the pre-requisites except students in the LTI's MLT and PhD programs. Students in the LTI's MLT and PhD programs can take 11-741, Information Retrieval, which focuses more on research. This course focuses more on current practice.

Prerequisites:

Prerequisites: 15-211, Fundamental Data Structures and Algorithms. 21-241, Matrix Algebra or 21-341, Linear Algebra.
Recommended: 15-213, Introduction to Computer Systems.

Time & Location:

Tu/Th, 12:00-1:20, DH 1212

Instructor(s):

Jamie Callan and Yiming Yang

Teaching Assistant:

Siddharth Gopal and Juan Manuel Caicedo Carvajal

Instructional Materials:

The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. The textbook can be purchased at the CMU book store.

There are selected additional readings, which are available online or placed on reserve in the Engineering and Science Library, 4th floor, Wean Hall.

Online access to some materials is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service.

Homework:

Homework consists of programming projects and/or problem sets.

Grading:

60% homework (5 programming), 20% midterm, 20% final.

Course policies:

Late homework , Cheating , Laptops

Syllabus:

 

 

  1. Aug 28, Course overview and introduction
  2. Aug 30, Introduction to text search (Ch 1)
  3. Sep 4, Text representation (Ch 2)
  4. Sep 6, Search engine indexes (Ch 4)
                HW1 out, due Sep 18
  5. Sep 11, Index construction (Ch 5)
  6. Sep 13, Index construction (Ch 20.3-20.4)
  7. Sep 18, Information needs and queries (Ch 8)
  8. Sep 20, Vector space retrieval model (Ch 6-7)
  9. Sep 25, Evaluating search effectiveness (Ch 8)
  10. Sep 27, Evaluating search effectiveness
  11. Oct 2, Probabilistic retrieval models (Ch 12)
                 
    HW2 out, due Oct 23
  12. Oct 4, Probabilistic retrieval models (Ch 10)
  13. Oct 9, Search log analysis
  14. Oct 11, Search log analysis
  15. Oct 16, Midterm exam (Sample midterm, Answers)
  16. Oct 18, Text categorization (Ch 13, 15)
    Oct 19, HW3 out, due Nov 2
  17. Oct 23, Text categorization: logistic regression
  18. Oct 25, Text categorization
  19. Oct 30, Learning to rank (Joachims, SIGIR 2002)
  20. Nov 1, Learning to rank (Yue, SIGIR 2007)
    Nov 2, HW4 out, due Nov 14
  21. Nov 6, Clustering (Ch 16, 17)
  22. Nov 8, Clustering
  23. Nov 13, Collaborative filtering
    Nov 13, HW5 out, due Nov 29
  24. Nov 15, Collaborative filtering
  25. Nov 20, Link Analysis (Ch 21)
  26. Nov 27, Link Analysis
  27. Nov 29, Significance tests (Yang & Liu, SIGIR 1999 )
  28. Dec 4, Query classification, and federated search (Callan 2000)
  29. Dec 6, Query classification and federated search
  30. Dec 14, Final Exam: 8:30am – 11:30am, Location at POS MN AUD (Posner Hall, Mellon Auditorium) (Sample Final)

 


 

Updated on September 12, 2012

Jamie Callan and Yiming Yang