Vorlesung „Information Retrieval und Web Search Engines“ (WS 2008/2009)

Information
Classification: 
Master Informatik, Master Wirtschaftsinformatik (module „Datenbanken und Informationssysteme“)
Credits: 
4
Exam: 
Oral (scoring 50% of total exercise points is required to take final exam)
Regular Dates: 
Wednesday, 10.30am–12.45pm; first lecture on November 5, 2008
Room MP 23.3 (Informatikzentrum, Mühlenpfordtstraße 23, room 160)
Contents
Contents: 

This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.

Summary: 

Introduction and fundamental notionsIndexingRetrieval models: Boolean, vector space, and probabilisticRelevance feedbackLanguage modelsLatent semantic indexingLearning to rankClusteringStruktur of the World Wide WebCrawlingLink analysisSearch engine optimization, spam, and countermeasures against spamUser interfaces

Materials

Lectures

Note that lectures and exercises will not be separated. Exercise parts will be interlaced into the lecture whereever adequate.

  Date Contents References Materials
1 5.11.2008 Introduction and fundamental notions (Bush, 1945) Slides
Video
Homework 1
2 12.11.2008 Fuzzy retrieval model
Coordination Level Matching
Vector space retrieval model
(Zadeh, 1965)
(Zadeh, 1978)
(Ogawa et al., 1991)
(Salton et al., 1975)
(Luhn, 1961)
(Spärck Jones, 1972)
(Robertson and Spärck Jones, 1976)
(Salton et al., 1983)
Slides
Video
Homework 2
3 19.11.2008 Probabilistic retrieval models (Robertson, 1977)
(Maron and Kuhns, 1960)
(van Rijsbergen, 1977)
(Croft and Harper, 1979)
(Greiff, 1998)
Slides
Video
Homework 3
4 26.11.2008 Latent Semantic Indexing (Salton and Buckley, 1988)
(Deerwester et al., 1990)
(Berry et al., 1995)
(Landauer and Dumais, 1997)
Slides
Video
Homework 4
5 3.12.2008 Document Clustering   Slides
Video
Homework 5
6 10.12.2008 Language Models and Evaluation (Saracevic, 2007) By popular request: MATLAB tutorials
Slides
Video
Homework 6 (Updated on Dec 12!)
7 17.12.2008 Feedback and Classification (Rocchio, 1971) Slides
Video
Homework 7
8 7.1.2009 Support Vector Machines (Burges, 1998)
(Ivanciuc, 2007)
(Joachims, 1998)
(Joachims, 2002)
Slides
Video
Homework 8 (Updated on Jan 9!)
9 14.1.2009 Indexing (Porter, 1980)
(Zobel and Moffat, 2006)
Slides
Video
Solution to Exercise 18 in MATLAB
Homework 9
10 21.1.2009 Introduction to Web Retrieval (Fetterly et al., 2004) Slides
Video
Homework 10
11 28.1.2009 Web Crawling   Slides
Video
Homework 11
12 4.2.2009 Link Analysis (Boyack et al., 2005)
(Page et al., 1999)
(Kleinberg, 1999)

Slides
Video
Homework 12
13 11.2.2009 What Remains to be Done   Slides
Video

Books

(Manning et al., 2008)
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ http ]
(Belew, 2000)
Richard K. Belew. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press, 2000. [ http ]
(Baeza-Yates and Ribeiro-Neto, 1999)
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. [ http ]
(van Rijsbergen, 1979)
Cornelis Joost van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. [ .html ]

Further Readings

(Bush, 1945)
Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101-108, 1945.
(Zadeh, 1965)
Lotfi A. Zadeh. Fuzzy sets. Information and Control, 8(3):338-353, 1965. [ DOI ]
(Zadeh, 1978)
Lotfi A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1):3-28, 1978. [ DOI ]
(Ogawa et al., 1991)
Yasushi Ogawa, Tetsuya Morita, and Kiyohiko Kobayashi. A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 39(2):163-179, 1991. [ DOI ]
(Salton et al., 1975)
Gerard Salton, Chung-Shu Yang, and Anita Wong. A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620, 1975. [ DOI ]
(Luhn, 1961)
Hans Peter Luhn. The automatic derivation of information retrieval encodements from machine-readable texts. Volume 3, Part 2 of Information Retrieval and Machine Translation, pages 1021-1028. Interscience Publishers, 1961.
(Spärck Jones, 1972)
Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11-21, 1972. [ DOI
(Robertson and Spärck Jones, 1976)
Stephen E. Robertson and Karen Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129-146, 1976.
(Salton et al., 1983)
Gerard Salton, Edward A. Fox, and Harry Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022-1036, 1983. [ DOI ]
(Robertson, 1977)
Stephen E. Robertson. The probabilistic ranking principle in IR. The Journal of Documentation, 33(4):294-304, 1977.
(Maron and Kuhns, 1960)
Melvin E. Maron and John L. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3):216-244, 1960. [ DOI ]
(van Rijsbergen, 1977)
Cornelis Joost van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. The Journal of Documentation, 33(2):106-119, 1977. [ http ]
(Croft and Harper, 1979)
W. Bruce Croft und David J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285-295, 1979.
(Greiff, 1998)
Warren R. Greiff. A theory of term weighting based on exploratory data analysis. In W. Bruce Croft, Alistair Moffat, Cornelis Joost van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pages 11-19. ACM Press, 1998. [ DOI ]
(Salton and Buckley, 1988)
Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513-523, 1988. [ DOI ]
(Deerwester et al., 1990)
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990. [ DOI ]
(Berry et al., 1995)
Michael W. Berry, Susan T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995. [ DOI ]
(Landauer and Dumais, 1997)
Thomas K. Landauer and Susan T. Dumais. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211-240, 1997. [ DOI ]
(Saracevic, 2007)
Tefko Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58(13):1915-1933, 2007. [ DOI ]
(Rocchio, 1971)
Joseph J. Rocchio, Jr. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313-323. Prentice Hall, 1971.
(Burges, 1998)
Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [ DOI ]
(Ivanciuc, 2007)
Ovidiu Ivanciuc. Applications of support vector machines in chemistry. In Kenneth B. Lipkowitz, Thomas R. Cundari, and Donald B. Boyd, editors, Reviews in Computational Chemistry, volume 23, pages 291-400. Wiley, 2007. [ .pdf ]
(Joachims, 1998)
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML 1998), volume 1398 of Lecture Notes in Artificial Intelligence, pages 137-142. Springer, 1998. [ DOI ]
(Joachims, 2002)
Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002), pages 133-142. ACM Press, 2002. [ DOI ]
(Porter, 1980)
Martin F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.
(Zobel and Moffat, 2006)
Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. [ DOI ]
(Fetterly et al., 2004)
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of Web pages. Software: Practice and Experience, 34(2):231-237, 2004. [ DOI ]
(Boyack et al., 2005)
Kevin W. Boyack, Richard Klavans, and Katy Börner. Mapping the backbone of science Scientometrics, 64(3):351-374, 2005. [ DOI ]
(Page et al., 1999)
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.
(Kleinberg, 1999)
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999. [ DOI ]
AttachmentDateSize
File irws-ws0809-v1-2x3.pdf17/11/08 3:09 pm2.4 MB
File irws-ws0809-u1.pdf07/11/08 5:44 pm114.94 KB
File irws-ws0809-u2.pdf13/11/08 2:52 pm119.73 KB
File irws-ws0809-v2-2x3.pdf17/11/08 3:10 pm1.41 MB
File irws-ws0809-v3-2x3.pdf19/11/08 3:05 pm1.76 MB
File irws-ws0809-u3.pdf19/11/08 3:05 pm131.28 KB
File irws-ws0809-v4-2x3.pdf27/11/08 2:37 pm2.09 MB
File irws-ws0809-u4.pdf27/11/08 2:23 pm119.61 KB
File irws-ws0809-u5.pdf03/12/08 9:42 am121.95 KB
File irws-ws0809-v5-2x3.pdf03/12/08 3:53 pm1.88 MB
File irws-ws0809-v6-2x3.pdf19/12/08 10:28 am1.41 MB
File irws-ws0809-u6.pdf12/12/08 9:38 am103.97 KB
File irws-ws0809-u7.pdf18/12/08 10:29 am151.65 KB
File irws-ws0809-v7-2x3.pdf19/12/08 10:28 am1.15 MB
File irws-ws0809-v8-2x3.pdf13/01/09 10:04 am1.23 MB
File irws-ws0809-u8.pdf09/01/09 9:50 am164.04 KB
File irws-ws0809-v9-2x3.pdf22/01/09 3:00 pm1.42 MB
File irws-ws0809-ex18.m14/01/09 4:18 pm4.86 KB
File irws-ws0809-u9.pdf15/01/09 4:14 pm25.35 KB
File irws-ws0809-u10.pdf22/01/09 3:00 pm29.61 KB
File irws-ws0809-v10-2x3.pdf22/01/09 3:00 pm1.92 MB
File irws-ws0809-v11-2x3.pdf29/01/09 1:46 pm1.82 MB
File irws-ws0809-u11.pdf29/01/09 3:27 pm21.13 KB
File irws-ws0809-u12.pdf05/02/09 2:06 pm26.34 KB
File irws-ws0809-v12-2x3.pdf05/02/09 2:06 pm2.24 MB
File irws-ws0809-v13-2x3.pdf11/02/09 4:42 pm1.02 MB