Lecture “Information Retrieval and Web Search Engines”

Information
Classification: 
Master
Credits: 
4 or 5 (depending on examination rules)
Exam: 
Oral
Regular Dates: 
First lecture starts on the 22nd Of October 2013
Every Tuesday, from 14:00 till 16:30.
Lecture takes place at Informatikzentrum, Mühlenpfordtstraße 23, room 161
Contents
Contents: 

This lecture provides an introduction to the fields of information retrieval and web search. We will discuss how relevant information can be found in very large and mostly unstructured data collections; this is particularly interesting in cases where users cannot provide a clear formulation of their current information need. Web search engines like Google are a typical application of the techniques covered by this course.

 This lecture will be held in English.  The lecture will be video recorded and streamed live. If you choose to attend the lecture via live streaming, please get in touch with the lecture's assistant to get further details at: elmaarryatifis [dot] cs [dot] tu-bs [dot] de

The oral exams will take place on the following days. If you wish to sit the exam, please contact our secretary Regine Dalkiran to set an appointment.

Thursday, February 27th
Friday, February 28th
Monday-Wednesday, March 3rd, 4th and 5th
Monday-Friday, March 24th, 25th, 26th, 27th, and 28th

Materials

Lectures

Please note the following:

  •   Lectures and exercises are not separated. Instead, exercise parts will be interlaced into the lecture whereever adequate.
  •   Lectures' slides will always be uploaded on Mondays, so that you can print them and bring them along to the lecture if you wish to write down your notes directly there.
  •  The lectures will be video recorded, and will be generally uploaded along with the homework assignement every Wednesday.
  Date Contents References Materials
1 22.10.2013 Introduction and fundamental notions (Bush, 1945)
(Kunze, 1999)
Slides
Print Slides
Homework 1
Video
2 29.10.2013 Fuzzy retrieval model
Coordination level matching
Vector space retrieval model
(Zadeh, 1965)
(Zadeh, 1978)
(Ogawa et al., 1991)
(Salton et al., 1975)
(Luhn, 1961)
(Spärck Jones, 1972)
(Robertson and Spärck Jones, 1976)
(Salton et al., 1983)
Slides
Print Slides
Homework 2
Video
Due to technical fault, the slides stream is unavailable
3 05.11.2013 Probabilistic retrieval models (Robertson, 1977)
(Maron and Kuhns, 1960)
(van Rijsbergen, 1977)
(Croft and Harper, 1979)
(Greiff, 1998)
Slides
Print Slides
Video
No Homework
4 12.11.2013 Indexing  (Porter, 1980)
(Zobel and Moffat, 2006)
Slides
Print Slides
Homework 3
Video
- 19.11.2013 NO LECTURE    
5 26.11.2013 Latent Semantic Indexing (Salton and Buckley, 1988)
(Deerwester et al., 1990)
(Berry et al., 1995)
(Landauer and Dumais, 1997)
Slides
Print Slides
Homework 4
Video
6 03.12.2013 Language models
Retrieval evaluation
(Saracevic, 2007) Slides
Print Slides
Homework 5
Video
7 10.12.2013 Document Clustering   Slides
Print Slides
Homework 6
Matlab Detours:
1. K-Means
2. Agglomerative clustering
Video
8 17.12.2013 Relevance Feedback
Classification
(Rocchio, 1971) Slides
Print Slides
Video
No Homework
9 07.01.2014 Support Vector Machines (Burges, 1998)
(Ivanciuc, 2007)
(Joachims, 1998)
(Joachims, 2002)
Slides
Print Slides
Homework 7
Video
Due to technical fault, the video is at times damaged
10 14.01.2014 Introduction to Web Retrieval (Fetterly et al., 2004) Slides
Print Slides
Homework 8
Video
11 21.01.2014 Web Crawling   Slides
Print Slides
Homework 9
Video
12 28.01.2014
"Tag der Informatik", accordingly, the lecture will be delayed and will start at 16 pm
Link Analysis (Boyack et al., 2005)
(Page et al., 1999)
(Kleinberg, 1999)
Slides
Print Slides
Homework 10
Video
13 04.02.2014 Miscellanious   Slides
Print Slides
Video

Books

(Manning et al., 2008)
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ http ]
(Belew, 2000)
Richard K. Belew. Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW. Cambridge University Press, 2000. [ http ]
(Baeza-Yates and Ribeiro-Neto, 1999)
Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999. [ http ]
(van Rijsbergen, 1979)
Cornelis Joost van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979. [ .html ]

Further Readings

(Bush, 1945)
Vannevar Bush. As we may think. The Atlantic Monthly, 176(1):101-108, 1945.
(Kunze, 1999)
John A. Kunze. Encoding Dublin Core Metadata in HTML, The Internet Society, 1999 [ DOI ]
(Zadeh, 1965)
Lotfi A. Zadeh. Fuzzy sets. Information and Control, 8(3):338-353, 1965. [ DOI ]
(Zadeh, 1978)
Lotfi A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1(1):3-28, 1978. [ DOI ]
(Ogawa et al., 1991)
Yasushi Ogawa, Tetsuya Morita, and Kiyohiko Kobayashi. A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 39(2):163-179, 1991. [ DOI ]
(Salton et al., 1975)
Gerard Salton, Chung-Shu Yang, and Anita Wong. A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620, 1975. [ DOI ]
(Luhn, 1961)
Hans Peter Luhn. The automatic derivation of information retrieval encodements from machine-readable texts. Volume 3, Part 2 of Information Retrieval and Machine Translation, pages 1021-1028. Interscience Publishers, 1961.
(Spärck Jones, 1972)
Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11-21, 1972. [ DOI
(Robertson and Spärck Jones, 1976)
Stephen E. Robertson and Karen Spärck Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129-146, 1976.
(Salton et al., 1983)
Gerard Salton, Edward A. Fox, and Harry Wu. Extended boolean information retrieval. Communications of the ACM, 26(11):1022-1036, 1983. [ DOI ]
(Robertson, 1977)
Stephen E. Robertson. The probabilistic ranking principle in IR. The Journal of Documentation, 33(4):294-304, 1977.
(Maron and Kuhns, 1960)
Melvin E. Maron and John L. Kuhns. On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3):216-244, 1960. [ DOI ]
(van Rijsbergen, 1977)
Cornelis Joost van Rijsbergen. A theoretical basis for the use of co-occurrence data in information retrieval. The Journal of Documentation, 33(2):106-119, 1977. [ http ]
(Croft and Harper, 1979)
W. Bruce Croft und David J. Harper. Using probabilistic models of document retrieval without relevance information. Journal of Documentation, 35(4):285-295, 1979.
(Greiff, 1998)
Warren R. Greiff. A theory of term weighting based on exploratory data analysis. In W. Bruce Croft, Alistair Moffat, Cornelis Joost van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1998), pages 11-19. ACM Press, 1998. [ DOI ]
(Salton and Buckley, 1988)
Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513-523, 1988. [ DOI ]
(Deerwester et al., 1990)
Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391-407, 1990. [ DOI ]
(Berry et al., 1995)
Michael W. Berry, Susan T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573-595, 1995. [ DOI ]
(Landauer and Dumais, 1997)
Thomas K. Landauer and Susan T. Dumais. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2):211-240, 1997. [ DOI ]
(Saracevic, 2007)
Tefko Saracevic. Relevance: A review of the literature and a framework for thinking on the notion in information science. Part II: Nature and manifestations of relevance. Journal of the American Society for Information Science and Technology, 58(13):1915-1933, 2007. [ DOI ]
(Rocchio, 1971)
Joseph J. Rocchio, Jr. Relevance feedback in information retrieval. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313-323. Prentice Hall, 1971.
(Burges, 1998)
Christopher J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121-167, 1998. [ DOI ]
(Ivanciuc, 2007)
Ovidiu Ivanciuc. Applications of support vector machines in chemistry. In Kenneth B. Lipkowitz, Thomas R. Cundari, and Donald B. Boyd, editors, Reviews in Computational Chemistry, volume 23, pages 291-400. Wiley, 2007. [ .pdf ]
(Joachims, 1998)
Thorsten Joachims. Text categorization with support vector machines: Learning with many relevant features. In Claire Nédellec and Céline Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML 1998), volume 1398 of Lecture Notes in Artificial Intelligence, pages 137-142. Springer, 1998. [ DOI ]
(Joachims, 2002)
Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 2002), pages 133-142. ACM Press, 2002. [ DOI ]
(Porter, 1980)
Martin F. Porter. An algorithm for suffix stripping. Program, 14(3):130-137, 1980.
(Zobel and Moffat, 2006)
Justin Zobel and Alistair Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. [ DOI ]
(Fetterly et al., 2004)
Dennis Fetterly, Mark Manasse, Marc Najork, and Janet L. Wiener. A large-scale study of Web pages. Software: Practice and Experience, 34(2):231-237, 2004. [ DOI ]
(Boyack et al., 2005)
Kevin W. Boyack, Richard Klavans, and Katy Börner. Mapping the backbone of science Scientometrics, 64(3):351-374, 2005. [ DOI ]
(Page et al., 1999)
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the Web. Technical Report 1999-66, Stanford InfoLab, 1999.
(Kleinberg, 1999)
Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604-632, 1999. [ DOI ]