Getting your data from the Internet

This will be a series of workshops covering self-contained material that will help the participants set-up pipelines for data extraction from the internet. The pace of the workshop will be based on the level of familiarity with R / coding in general.

Why is this skill important?

If you are a person who works with a lot of data - or are training to work with it, you would often need to get your data at scale from the internet. Interesting questions need interesting datasets to answer them - not all of them will be served to you in nice excel sheets. Simply copying and pasting would not work when the dataset exceeds a particular size.

For professionals - you could use these set of tools to automate your data extraction / updation workflows.

For researchers - Whether you are planning to embark on your own research project or assisting someone for it, this can potentially save you a lot of money needed to procure the data.

For students - Whether you need new data for your thesis or these skills to help pad your CV to apply for data science and research positions, these workshops would help you tool up.

Prerequisites

Basic familiarity with coding/logic will be helpful.

Workshop Outline

The bare essentials of R

(Session 1)

  • Set-up R, Rstudio and the packages relevant to the course (dplyr,rvest,httr)
  • Basic syntax, how it compares to python
  • R script vs RMarkdown
Working with PDFs in R

(Session 1)

  • Downloading files
  • Parsing PDFs
  • Cleaning using stringr , intro to regexes
  • OCR
Web-scraping with R

(Session 2)

  • What can you do?
  • Why are these skills essential
  • The types of web-pages/sources you can scrape from
  • The tools available in R
Using rvest to scrape complex webpages

(Session 2)

  • Introduction to the package
  • Scraping an example web-page
  • Using sessions and forms in rvest
  • Bonus : Preparing auto-updating reports
Using RSelenium to scrape a web-page

(Session 3)

  • What is Selenium , setting it up
  • HTML and selectors for dummies
  • Setting up the scraper
  • Pros and Cons of Rselenium
Using HTTP Requests

(Session 4)

  • What kind of pages can you scrape with these?
  • Why use this technique?
  • GET vs POST requests
  • Constructing requests
  • Chaining it all together
Using an Application Program Interface (API)

(Session 4)

  • Introduction to APIs
  • Querying an API on the web
  • Constructing queries and wrappers in R
  • Getting data from data.gov.in
  • Bonus : datagovindia package on R and Python
Project

(Session 5-6)

  • The class will choose an example web-page/portal and we will build a scraper live to see how the thinking process works
  • Assigning web-pages/data source and building the scrapers
  • We will meet again to discuss feedback and possible improvements

Pricing

Sessions will be conducted once a week. I am expecting the duration of these to be around 1-2 hours per session. The duration will can be more or less than 2 hours depending upon the goal of that session. All topics mentioned above will be covered.

5 sessions will cover all of the above topics and 1 session would be to discuss the project mentioned above.

I will keep compiling a list of references and will try and share a handout after the end of each session.

  • For Professionals : INR 1000 per class, INR 4500 for the entire series
  • For students : INR 600 per class , 2500 for the entire series

Special discounts will be considered for students based on need and the availability of spots in the classes. Please contact me for more information about this. An ID proof will be required to get the student discount.

These sessions will be conducted online and digital certificates will be provided to those who need them.

Watch this space for sign-up information. Find out more about me here.