Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: CMU logo

Search Engines and Web Mining
11-441 / 11-641

 

Description:

This course provides a comprehensive introduction to the theory and implementation of algorithms for organizing and searching large text collections. The first half of the course studies text search engines for enterprise and Web environments; the open-source Lucene and Indri search engines are used as working examples. The second half studies text mining techniques such as recommender systems, clustering, and categorization. Programming assignments give hands-on experience with document ranking, evaluation, categorizing documents into browsing hierarchies, and related topics.

Eligibility:

This course is open to all students who meet the pre-requisites except students in the LTI's MLT and PhD programs. Students in the LTI's MLT and PhD programs can take 11-741, Information Retrieval, which focuses more on research. This course focuses more on current practice.

Prerequisites:

Prerequisites: 15-211, Fundamental Data Structures and Algorithms. 21-241, Matrix Algebra or 21-341, Linear Algebra.
Recommended: 15-213, Introduction to Computer Systems.

Time & Location:

Tu/Th, 12:00-1:20, GHC 4215

Instructor(s):

Jamie Callan and Yiming Yang

Teaching Assistant:

Yubin Kim

Instructional Materials:

The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. The textbook can be purchased at the CMU book store.

There are selected additional readings, which are available online or placed on reserve in the Engineering and Science Library, 4th floor, Wean Hall.

Online access to some materials is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service.

Homework:

Homework consists of programming projects and/or problem sets.

Grading:

60% homework (6 programming), 20% midterm, 20% final.

Course policies:

Late homework , Cheating , Laptops

Syllabus:

 

 

  1. Aug 27, Course overview and introduction
  2. Aug 29, Introduction to text search (Ch 1)
  3. Sep 3, Text representation (Ch 2)
  4. Sep 5, Search engine indexes (Ch 4)
                HW1 out, due Sep 17
  5. Sep 10, Index construction (Ch 5)
  6. Sep 12, Index construction (Ch 20.3-20.4)
  7. Sep 17, Information needs and queries (Ch 8)
  8. Sep 19, Vector space retrieval model (Ch 6-7)
                HW2 out, due Oct 1
  9. Sep 24, Evaluating search effectiveness (Ch 8)
  10. Sep 26, Evaluating search effectiveness
  11. Oct 1, Probabilistic retrieval models (Ch 12)
                HW3 out, due Oct 17
  12. Oct 3, Probabilistic retrieval models (Ch 10)
  13. Oct 8, Search log analysis
  14. Oct 10, Search log analysis
  15. Oct 15, Midterm exam (Sample midterm)
  16. Oct 17, Text categorization (Ch 13, 15)
                HW4 out, due Oct 31
  17. Oct 22, Text categorization: logistic regression
  18. Oct 24, Text categorization
  19. Oct 29, Learning to rank (Joachims, SIGIR 2002)
  20. Oct 31, Learning to rank (Yue, SIGIR 2007)
                HW5 out, due Nov 12
  21. Nov 5, Clustering (Ch 16, 17)
  22. Nov 7, Clustering
  23. Nov 12, Collaborative filtering
                HW6 out, due Nov 26
  24. Nov 14, Collaborative filtering
  25. Nov 19, Link Analysis (Ch 21)
  26. Nov 21, Link Analysis
  27. Nov 26, Significance tests (Yang & Liu, SIGIR 1999 )
  28. Dec 3, Query classification, and federated search (Callan 2000)
  29. Dec 5, Query classification and federated search
  30. TBA, Final Exam: (Sample Final)

 


Updated on May 30, 2013

Jamie Callan and Yiming Yang