COSC 488: Introduction to Information Retrieval

Success Story:

Developed the course by NSF support at Information Retrieval Lab & evaluated for 3 consecutive years! Taught the course since 2001 (17 times). The course is regularly updated. Students finishing this course successfully have found jobs at Microsoft, Google, Yahoo, Facebook, Amazon, ,…etc..

Course Description:

Information retrieval is the identification of textual components, be them web pages, blogs, microblogs, documents, medical transcriptions, mobile data, or other big data elements, relevant to the needs of the user. Relevancy is determined either as a global absolute or within a given context or view point. Practical, but yet theoretically grounded, foundational and advanced algorithms needed to identify such relevant components are taught. The Information-retrieval techniques and theory, covering both effectiveness and run-time performance of information-retrieval systems are covered. The focus is on algorithms and heuristics used to find textual components relevant to the user request and to find them fast. The course covers the architecture and components of the search engines such as parser, index builder, and query processor. In doing this, various retrieval models, relevance ranking, evaluation methodologies, and efficiency considerations will be covered. The students learn the material by building a prototype of such a search engine. These approaches are in daily use by all search and social media companies.

Prerequisite:

Data Structure and comfortable Programming knowledge.

Recommanded Texts:

  • C. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge Publisher, 2008, ISBN: 978-0-521-86571-5; http://nlp.Stanford.edu/IR-book/

Handouts:

The course handouts will be available on the class Forum for most topics that are covered in the class.

Grading & Due Dates (Tentative- Will be finalized by the 1st day of the class!):

Project

40%

3-4 projects (Incrementally building the engine). Any programming language may be used (your choice). The projects require design and implementation of various components of a search engine per the assignment requirements, performing experimentations, and analysis. Deliverables for each project part include (detail will be specified when project is given): Cover Page,Design document,Software,Results & Analysis,and [potentially] Demo. Projects are individual (solo) tasks.

Research Presentation

8%

Students must attend all presentations. A pool of papers will be made available and students will be asked to pick several choices, one of which will be finalized for student to present in the class. Will be announced if this assignment is an individual / solo or a group assignment!

Exams ( 2-3 exams )

52%

Course Outline (Tentative!):

Slides

Introduction, Overview of IR

IR Utilities: Parser/Tokenizer, phrase Recognition, Stemming, N-Grams

Efficiency: Indexing

IR Models: Boolean, Vector Space Model; Similarity Measures

IR Models: Probablistic Model

IR Models: Language Model & Topic Model

Relational Approach

IR Evaluation

IR Utility: Passage Based Retrieval

IR Utility: Relevance Feedback and other Query Expansions

Efficiency : Compression

Efficiency: Top Docs, Query Threshold

Clustering

Web Search Ranking

Peer-to-Peer Search

Intro to Text Classification

Web Personalization and Recommender Systems

Research Paper Presentations

Students Presentations

Late Assignment Policy:

Will be posted on the syllabus that will be given to the class by the 1st day of the class.

Academic Integrity:

Visit the Honor System Website at http://gervaseprograms.georgetown.edu/honor/