Information Retrieval

Big Data Sensing & Procurement
Academic Year: 

The course introduces the design, implementation and analysis of Information Retrieval systems that are efficient and effective in managing and searching for information stored in the form of collections of texts, possibly unstructured (e.g. Web), and labeled graphs (e.g. Knowledge graph). The theoretical lessons will describe the main components of a modern Information Retrieval system, more exactly of a search engine, such as: crawler, text analyzer, storage and compressed index, query solver, text annotator (based on Knowledge graph and Entity linkers), and rankers. The laboratory lessons will put into practice what has been learned "in theory" with the help of three famous software libraries such as: ElasticSearch (an open-source search engine), Neo4J (a graphDB), TagMe and Swat (two entity annotators). The exam will consist of a written test, aimed at evaluating the knowledge acquired both in the theoretical lessons and in the laboratory lessons (weight 60%), and of a joint software project with various other courses (weight 40%), whose objective is to to evaluate the technical skills in the use of the aforementioned libraries.

Prerequisites: Basic notions of algorithms, programming and use of Python programming environments.

  • Lesson 1
    • Introduction to IR and history of search engines
    • The structure of a search engine
    • Inverted lists and query resolution for AND and soft-AND, phrase, proximity, and zone
  • Lesson 2 
    • The Web graph, its structure (bow tie), its properties, its representation in memory and an example of browsing algorithms (BFS and DFS).
  • Lesson 3
    • The crawling module, the parsing module, keyword extraction (with PoS tag, Rake, and statistics).
  • Lesson 4 
    • Creation in Python of a text parser (tokenization, stopword, normalization and stemming), and of a word cloud.
  • Lesson 5 
    • The first generation of search engines (Altavista, Lycos,…)
    • The laws of Zipf, Heaps and Luhn
    • Textual ranking: Jaccard and TF-IDF
    • The vector space model and cosine similarity
    • Text spam.
  • Lesson 6
    • ElasticSearch.
  • Lesson 7:
    • The second generation of search engines (google et al)
    • Ranking based on the Web graph, random walk and PageRank, Topic-based and Personalized PageRank.
    • Evaluation of a search engine: precision, recall and F1.
  • Lesson 8
    • Knowledge graph and latest generation search engines
    • Entity linkers and semantic text annotation TagMe
    • Entity linker applications: keyword extraction, representation and comparison of texts using labeled and weighted graphs, reasoning.
    • Use of TagMe library and Swat library.
  • Lesson 9
    • Definition, properties and functionality of GraphDBs using the Neo4J library.
Technics and tools: 
  • requests
  • elasticsearch
  • nltk
  • wordcloud
  • matplotlib