The module presents the characteristics and peculiarities of "big data", highlighting through specific use cases the growing importance of the ability to extract significant information and valuable insights from this enormous amount of heterogeneous data (for example data from sensors, purchase data and consumption, data from social media and social networks, open data, etc.). The participatory methods of data collection through crowdsourcing and crowdsensing systems are also discussed, showing popular examples of application of these concepts. The practical part will instead focus on data ingestion by presenting data crawling and scraping methodologies with concrete examples on Social Media and the Web, as well as on the use of pre-compiled publicly available datasets.
Prerequisites: Python
- Lesson 1
- Introduction to big data and the various data sources that characterize them
- Open data and linked open data, crowdsourcing and crowdsensing
- Big data analytics: interesting use cases
- Lesson 2
- Social media crawling: REST architecture and OAUTH authentication framework, Twitter and Reddit overview
- Introduction to using the PRAW library for data access to Reddit + exercises with PRAW
- Lesson 3
- Exercises with PRAW
- Introduction to HTML/CSS technologies
- Lesson 4
- HTML/CSS exercises
- Introduction to Web scraping in Python: Selenium and Beautiful Soup
- Lesson 5
- Exercises on Selenium
- Lesson 6
- Exercises on BeautifulSoup
- CSV/JSON data parsing
- Exam
- Selenium
- Beautiful Soup
- PRAW
- Theoretical knowledge:
- Characterization of "big data" and the potential obtainable in terms of knowledge resulting from their analysis
- Data characterization: open sources, closed sources, open data and linked open date. Data collection or development of specific services that exploit groups of users (crowdsensing, crowdsourcing).
- HTML/CSS technologies underlying the functioning of the Web
- REST architectures
- Social media with focus on Twitter and Reddit: analysis of the main characteristics of social networks and high-level overview of the available APIs.
- Practical knowledge:
- Use of HTML tags and CSS selectors for creating web pages.
- Website scraping with concrete examples using the Selenium and Beautiful libraries Soups
- Social media crawling with concrete examples using the Reddit API through the PRAW library.
- Parsing of data in CSV/JSON format