Special Topic: Search Engines and Web Mining
15-493
(Renumbered as 11-441 for Fall 2010)

 

Description:

This course provides a comprehensive introduction to the theory and implementation of algorithms for organizing and searching large text collections. The first half of the course studies text search engines for enterprise and Web environments; the open-source Indri search engine is used as a working example. The second half studies text mining techniques such as clustering, categorization, and information extraction. Programming assignments give hands-on experience with document ranking algorithms, categorizing documents into browsing hierarchies, and related topics.

Prerequisites:

Prerequisites: 15-211, Fundamental Data Structures and Algorithms. 21-241, Matrix Algebra or 21-341, Linear Algebra.
Recommended: 15-213, Introduction to Computer Systems.

Time & Location:

Tu/Th 12:00 - 1:20, Wean Hall 5310

Instructor(s):

Jamie Callan and Yiming Yang

Teaching Assistant:

Abhay Harpale

Instructional Materials:

The textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. The textbook can be purchased at the CMU book store.

There are selected additional readings, which are available online or placed on reserve in the Engineering and Science Library, 4th floor, Wean Hall.

Online access to some materials is restricted to the .cmu.edu domain. CMU people can get access from outside .cmu.edu (e.g., from home) using CMU's WebVPN Service.

Homework:

Homework consists of programming projects and/or problem sets.

Grading:

50% homework (2 programming, 2 written), 10% quizzes, 20% midterm, 20% final.

Course policies:

Late homework , Cheating , Laptops

Syllabus:

 

 

  1. Aug 25, Course overview and introduction
  2. Aug 27, Introduction to text search (Ch 1)
  3. Sep 1, Text representation (Ch 2)
    HW1 out, due Sep 18
  4. Sep 3, Index construction (Ch 4)
  5. Sep 8, Index construction (Ch 5)
  6. Sep 10, Information needs and queries (Ch 8)
  7. Sep 15, Evaluating search effectiveness
  8. Sep 17, Evaluating search effectiveness
  9. Sep 22, All SCS classes cancelled - attend GHC opening events instead
  10. Sep 24, Vector space retrieval model (Ch 6-7)
  11. Sep 29, Statistical language models for IR (Ch 12)
    HW2 out, due Oct 20
  12. Oct 1, Structured documents, combination of evidence (Ch 10)
  13. Oct 6, Hypertext retrieval models (Ch 21) , Quiz 1
  14. Oct 8, Hypertext retrieval models, Quiz 2
  15. Oct 13, Midterm exam (Sample midterm), Midterm answers
  16. Oct 15, Optimizations for large-scale search (Ch 20.3-20.4)
  17. Oct 20, Query classification, and federated search
  18. Oct 22, Query classification, and federated search
  19. Oct 27, Query classification and federated search

     HW3 out, due Nov11 (Part 1) and Nov 19 (Part 2).

  1. Oct 29, Midterm make-up exam (the first 15 minutes), Clustering (Ch 16, 17)
  2. Nov 3, Clustering
  3. Nov 5, Clustering
  4. Nov 10, Collaborative filtering
  5. Nov 12, Collaborative filtering, Quiz 3
  6. Nov 17, Collaborative filtering, Significance tests (Yang & Liu, SIGIR 1999)
  7. Nov 19, Significance tests

      HW4 out, due Dec 1.

  1. Nov 24, Introduction to text categorization (Ch 13)
  2. Dec 1, Naive Bayes (Ch 15)
  3. Dec 3, Support Vector Machines
  4. Dec 8, Final exam, GHC 4215,1:00-4:00pm (Sample final)

 


Updated on September 29, 2009

Jamie Callan and Yiming Yang