datagovindia : A single window to all APIs of data.gov.in

We wrote the datagovindia packages for R and Python to enable users of these environments to access all APIs on data.gov.in. The OGD platform of India has more than 130,000 APIs on it. APIs themselves can be hard to tackle for those who are uninitiated with HTTP requests. Even for those who are, it might be a time-consuming task to find the right API, its relevant ID and to implement an ad-hoc wrapper. Our packages allow the user to do it all within the preferred coding environment!

This blog is a tutorial for the R package.

Primarily,the functionality is centered around three aspects :

API discovery - Finding the right API from all the available APIs
API information - Getting information about a particular API
Querying the API - Getting a tidy data set from the chosen API

Installation

The package is now on CRAN, download using :

install.packages("datagovindia")

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("econabhishek/datagovindia")

Prerequisites

An account on data.gov.in
An API key from the My Account page (instructions here : official guide)

Setup

library(datagovindia)

API Discovery

The APIs from the portal are scraped every week to update a list of all APIs and the information attached to them like sector, source, field names etc. The website data.gov.in provides a search functionality through string searches and drop down menuswe have given that a boost. The functions in this package allows one to have more robust string based searches.
A user can search by API title, description, organization type, organization (ministry), sector and sources. Briefly there are two types of functions here, the first lets the user get a list of all available and unique organization type, organization (ministry), sector and sources and the other lets one “search” by these criteria and more.

Here is a demonstration of the former (getting only the first few values)

###List of organizations (or ministries)
get_list_of_organizations() %>% 
  head
#> [1] "Ministry of Health and Family Welfare"  
#> [2] "Department of Health and Family Welfare"
#> [3] ""                                       
#> [4] "Ministry of Home Affairs"               
#> [5] "Department of States"                   
#> [6] "National Crime Records Bureau (NCRB)"

###List of sectors 
get_list_of_sectors() %>% 
  head
#> [1] "Health and Family welfare"    "Family Welfare"              
#> [3] "Health"                       ""                            
#> [5] "Home Affairs and Enforcement" "Police"

Searching for the right API

Once you have an idea about what you want to look for in the API, search queries can be constructed using titles, descriptions as well as the categories explored earlier. A data.frame with information of APIs matching the search keywords is returned. Multiple search functions can be applied over each other utilizing the data.frame structure of the result.

##Single Criteria
search_api_by_title(title_contains = "pollution") %>% head(2)
#> Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'Asia/Caclutta'

#> Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'Asia/Caclutta'

index_name	title	description	org_type	org	sector	source	created_date	updated_date
e374f644-b9d4-4e2a-b55f-f3888859abd6	State-wise Discharge of Waste Water with Pollution Load in Terms of Biochemical Oxygen Demand (BOD) into River Ganga and its Tributaries during 2020-21	State-wise Discharge of Waste Water with Pollution Load in Terms of Biochemical Oxygen Demand (BOD) into River Ganga and its Tributaries during 2020-21	Central	Rajya Sabha	All	data.gov.in	2023-01-12 23:58:35	2023-01-13 04:27:44
66ffc876-6ae5-4a3c-9d02-f9be908cb3e9	State/UTs-wise Identified Polluted Rivers and the Status of Action Plans approved by Central Pollution Control Board (CPCB) during 2018	State/UTs-wise Identified Polluted Rivers and the Status of Action Plans approved by Central Pollution Control Board (CPCB) during 2018	Central	Rajya Sabha	All	data.gov.in	2022-12-25 09:17:57	2022-12-25 12:21:30

##Multiple Criteria
dplyr::intersect(search_api_by_title(title_contains = "pollution"),
                 search_api_by_organization(organization_name_contains = "pollution"))
#> Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'Asia/Caclutta'

#> Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'Asia/Caclutta'

index_name	title	description	org_type	org	sector	source	created_date	updated_date
0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08	Details of Comprehensive Environmental Pollution Index (CEPI) Scores and Status of Moratorium in Critically Polluted Areas (CPAs) in India	NA	Central	Ministry of Environment, Forest and Climate Change\|Central Pollution Control Board	Industrial Air Pollution\|Water Quality\|Natural Resources\|Environment and Forest	data.gov.in	2017-06-08 16:36:24	2018-11-29 21:05:16

Once you have found the right API for your use, take a note of the “index_name” of that API, for example, “0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08” corresponds to the API for “Details of Comprehensive Environmental Pollution Index (CEPI) Scores and Status of Moratorium in Critically Polluted Areas (CPAs) in India”. index_name will be essential for both getting to know more about the API or to even get data from it.

Getting more information about a chosen API

There are two functions in this section, one to get API information, the other to get a available “field” names and types of the chosen API (using it’s index_name obtained above).

API information

get_api_info(api_index = "0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08")
#> Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'Asia/Caclutta'

#> Warning in as.POSIXlt.POSIXct(x, tz): unknown timezone 'Asia/Caclutta'

index_name	title	description	org_type	org	sector	source	created_date	updated_date
0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08	Details of Comprehensive Environmental Pollution Index (CEPI) Scores and Status of Moratorium in Critically Polluted Areas (CPAs) in India	NA	Central	Ministry of Environment, Forest and Climate Change\|Central Pollution Control Board	Industrial Air Pollution\|Water Quality\|Natural Resources\|Environment and Forest	data.gov.in	2017-06-08 16:36:24	2018-11-29 21:05:16

API Fields

Fields are essentially the variables in the dataset obtained from the API. Knowing the fields before querying for the data will be essential to preform tasks such as filtering, sorting and subsetting the data obtained from the API’s server.

get_api_fields(api_index = "0579cf1f-7e3b-4b15-b29a-87cf7b7c7a08")

id	name	type
document_id	document_id	double
status_of_moratorium	Status of Moratorium	keyword
industrial_cluster_area	Industrial Cluster / Area	keyword
state	State	keyword
cepi_score_2009	CEPI SCORE-2009	double
cepi_score_2011	CEPI SCORE-2011	double
cepi_score_2013	CEPI SCORE-2013	double
resource_uuid	resource_uuid	keyword

The id of these fields is going to be useful while querying the data.

Querying the chosen API

The function get_api_data is really the powerhouse in this package which allows one to do things over and above a manually constructed API query can do by utilizing the data.frame structure of the underlying data. It allows the user to filter, sort, select variables and to decide how much of the data to extract. The website can itself filter on only one field with one value at a time but one command through the wrapper can make multiple requests and append the results from these requests at the same time.

But before we dive into data extraction, we first need to validate our API key relieved from data.gov.in. To get the key, you need to register first register and then get the key from your “My Account” page after logging in. More instruction can be found on this official guide. Once you get your API key, you can validate it as follows (only need to do this once per session) :

##Using a sample key
register_api_key("579b464db66ec23bdd000001cdd3946e44ce4aad7209ff7b23ac571b")
#> Connected to the internet
#> The server is online
#> The API key is valid and you won't have to set it again

Once you have your key registered, you are ready to extract data from a chosen API. Here is what each argument means :

api_index : index_name of the chosen API (found by using search functions)
results_per_req : Results per request sent to the server ; can take integer values or the string “all” to get all of the available data
filter_by : A named character vector of field id (not the name) - value(s) pairs ; can take multiple fields as well as multiple comma separated values
field_select : A character vector of fields to select only a subset of variables in the final data.frame
sort_by : Sort by one or multiple fields

To recap, first find the API you want using the search functions, get the index_name of the API from the results, optionally take a look at the fields present in the data of the API and then use the get_api_data function to extract the data. Suppose we choose the API “Real time Air Quality Index from various location” with index_ name 3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69. First we will look at which fields are available to construct the right query.
Suppose We want to get the data from only 2 cities Chandigarh and Gurugram and pollutants PM10 and NO2. We will let all fields to be returned (dataset columns).

We will use a sample key from the website for this demonstration.

register_api_key("579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234")
#> Connected to the internet
#> The server is online
#> The API key is valid and you won't have to set it again

We now look at the fields available to play with.

get_api_fields("3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69")
#> Warning in stri_split_regex(string, pattern, n = n, simplify = simplify, :
#> argument is not an atomic vector; coercing

#> Warning in stri_split_regex(string, pattern, n = n, simplify = simplify, :
#> argument is not an atomic vector; coercing

id	name	type
character(0)	character(0)	character(0)

We accordingly select the city and pollution_id fields for constructing our query. Note that we use only field id to finally query the data.


get_api_data(api_index="3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69",
             results_per_req=10,filter_by=c(city="Gurugram,Chandigarh",
                                            polutant_id="PM10,NO2"),
             field_select=c(),
             sort_by=c('state','city'))
#> Connected to the internet
#> The server is online
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[polutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[polutant_id]=PM10
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Gurugram&filters[polutant_id]=NO2
#> gave the API a rest
#> url-https://api.data.gov.in/resource/3b01bcb8-0b14-4abf-b6f2-c1bfd384ba69?api-key=579b464db66ec23bdd0000019fc84f43ca52437351b43702f5998234&format=json&offset=0&limit=10&filters[city]=Chandigarh&filters[polutant_id]=NO2
#> gave the API a rest
#> No results returned - check your api_index

id	country	state	city	station	last_update	pollutant_id	pollutant_min	pollutant_max	pollutant_avg
839	India	Haryana	Gurugram	Sector-51, Gurugram - HSPCB	22-07-2023 06:00:00	PM10	20	174	115
846	India	Haryana	Gurugram	Teri Gram, Gurugram - HSPCB	22-07-2023 06:00:00	PM10	98	162	122
834	India	Haryana	Gurugram	NISE Gwal Pahari, Gurugram - IMD	22-07-2023 06:00:00	PM10	NA	NA	NA
339	India	Chandigarh	Chandigarh	Sector 22, Chandigarh - CPCC	22-07-2023 06:00:00	PM10	13	138	73
346	India	Chandigarh	Chandigarh	Sector-25, Chandigarh - CPCC	22-07-2023 06:00:00	PM10	35	90	64
353	India	Chandigarh	Chandigarh	Sector-53, Chandigarh - CPCC	22-07-2023 06:00:00	PM10	16	94	58
835	India	Haryana	Gurugram	NISE Gwal Pahari, Gurugram - IMD	22-07-2023 06:00:00	NO2	NA	NA	NA
847	India	Haryana	Gurugram	Teri Gram, Gurugram - HSPCB	22-07-2023 06:00:00	NO2	5	17	10
853	India	Haryana	Gurugram	Vikas Sadan, Gurugram - HSPCB	22-07-2023 06:00:00	NO2	34	49	41
840	India	Haryana	Gurugram	Sector-51, Gurugram - HSPCB	22-07-2023 06:00:00	NO2	6	13	9
347	India	Chandigarh	Chandigarh	Sector-25, Chandigarh - CPCC	22-07-2023 06:00:00	NO2	1	46	14
340	India	Chandigarh	Chandigarh	Sector 22, Chandigarh - CPCC	22-07-2023 06:00:00	NO2	6	90	27
354	India	Chandigarh	Chandigarh	Sector-53, Chandigarh - CPCC	22-07-2023 06:00:00	NO2	8	93	29

We will soon also release the tutorial for the Python package. Apart from the functions already in this implementation, the python one also supports multi-threading! We are actively maintaining these packages and would be happy to engage with the users of the OGD platform. If you face any issues with the R package, hit us up!

The maintainers :

Abhishek Arora Twitter : @96abhishekarora Email: abhishek.arora1996@gmail.com

Aditya K Chhabra Twitter : @AdityaKChhabra Email: aditya0chhabra@gmail.com