Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: Description: CMU logo

Search Engines and Web Mining
11-441 / 11-641

 

Description:

This course provides a comprehensive introduction to the theory and implementation of algorithms for organizing and searching large text collections. The first half of the course studies text search engines for enterprise and Web environments; the open-source Lucene and Indri search engines are used as working examples. The second half studies text mining techniques such as recommender systems, clustering, and categorization. Programming assignments give hands-on experience with document ranking, evaluation, categorizing documents into browsing hierarchies, and related topics.

Eligibility:

This course is open to all students who meet the pre-requisites except students in the LTI's MLT and PhD programs. Students in the LTI's MLT and PhD programs can take 11-741, Information Retrieval, which focuses more on research. This course focuses more on current practice.

Prerequisites:

Prerequisites: 15-211, Fundamental Data Structures and Algorithms. 21-241, Matrix Algebra or 21-341, Linear Algebra.
Recommended: 15-213, Introduction to Computer Systems.

Time & Location:

Tu/Th, 12:00-1:20, Hamerschlag Hall B103

Instructor(s):

Jamie Callan and Yiming Yang

Teaching Assistants:

Yubin Kim, Reyyan Yeniterzi, Lu Jiang, Yuliang Yin (yyl0827@gmail.com)

Instructional Materials:

The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. The textbook can be purchased at the CMU book store.

There are selected additional readings, which are available online or placed on reserve in the Engineering and Science Library, 4th floor, Wean Hall.

Online access to some materials is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service.

Homework:

Homework consists of programming projects and/or problem sets.

Grading:

60% homework (6 programming), 20% midterm, 20% final.

Course policies:

Late homework , Cheating , Laptops

Syllabus:

 

 

  1. Aug 27, Course overview and introduction
  2. Aug 29, Introduction to text search (Ch 1)
  3. Sep 3, Text representation (Ch 2)
  4. Sep 5, Search engine indexes (Ch 4)
                HW1 out, due Sep 17
  5. Sep 10, Information needs and queries (Ch 8)
  6. Sep 12, Vector space retrieval model (Ch 6-7)
  7. Sep 17, Probabilistic retrieval models (Ch 12)
  8. Sep 19, Probabilistic retrieval models (Ch 10)
               
    HW2 out, due Oct 3
  9. Sep 24, Evaluating search effectiveness (Ch 8)
  10. Sep 26, Evaluating search effectiveness
  11. Oct 1, Index construction (Ch 5)
  12. Oct 3, Index construction (Ch 20.3-20.4)
               
    HW3 out, due Oct 22
  13. Oct 8, Search log analysis
  14. Oct 10, Search log analysis
  15. Oct 15, Midterm exam (2012 midterm, 2009 midterm)
  16. Oct 17, Clustering (Ch 16, 17)
  17. Oct 22, Clustering
  18. Oct 24, Collaborative filtering
               
    HW4 out, due Nov 7
  19. Oct 29, Collaborative filtering
  20. Oct 31, Text categorization (Ch 13, 15)
  21. Nov 5, Text categorization
  22. Nov 7, Text categorization

            HW5 out, due Nov 20

  1. Nov 12, Text categorization
  2. Nov 14, Learning to rank (Joachims, SIGIR 2002)
  3. Nov 19, Learning to rank (Yue, SIGIR 2007)

           HW6 out, due Dec 3

  1. Nov 21, Significance tests (Yang & Liu, SIGIR 1999 )
  2. Nov 26, Link Analysis (Ch 21)
  3. Dec 3, Link Analysis
  4. Dec 5, Query classification, and federated search (Callan 2000)
  5. Dec 16, 8:30am – 11:30am, Final Exam, GHC 4401(Samples from 11-641 2012 Final ;  11-741 2011;  11-741 2013 )

 

 


Updated on Jul 30, 2013

Jamie Callan and Yiming Yang