06 Dec 2024
Advantages:
Pure data collection: avoid malformed HTML, no legal issues, clear data structures
Standardised data access procedures: transparency, replicability
Robustness: benefits from wisdom of the crowds
Disadvantages:
Theyâre not always available
Dependency on API providers
Lack of natural connection to R
RESTful APIs: queries for static information at current moment (e.g. user profiles, posts, etc.)
Streaming APIs: real time data (e.g. new tweets, weather alerts)
APIs often have extensive documentation:
written for developers, what to look for: endpoints and parameters: API Documentation
most APIs are rate-limited: restrictions on number of API calls by user/IP address and period of time
commercial APIs may impose a monthly fee
List of APIs in case you need inspiration
Rate-limit your requests (sys.sleep() in loop
Json vs XML:
Most APIs requires a key or other user credentials before you can query their database
Getting credentialised with a API requires that you register with the organization
Most APIs are set up for developers, so you will likely be asked to register an application
Once you have successfully registered, you will be assigned one or more keys, tokens, or other credentials that must be supplied to the server as part of any API call you make
There are two ways to collect data through APIs in R:
Many common APIs are available through user-written R Packages. These packages offer functions that âwrapâ API queries and format the response. These packages are usually much more convenient than writing our own query
If no wrapper function is available, we have to write our own API request and format the response ourselves using R. This is trickier, but definitely doable
Setup:
The results you get are a standardised measure of search volume for single search terms, a combination of search terms using operators (see below), or comparisons (one input in relation to the other inputs) over a selected time period
Google calculates how much search volume in each region a search term or query had, relative to all searches in that region. Using this information, Google assigns a measure of popularity to search terms (scale of 0 - 100), leaving out repeated searches from the same person over a short period of time and searches with apostrophes and other special characters]
No quotation marks (e.g. Humanitarian crisis) | You get results for each word in your query |
Quotation marks (e.g. âHumanitarian crisisâ) | You get results for the coherent search phrase |
Plus sign (e.g. humanitarian +crisis) | Serves as function of an OR-operator |
Minus sign (e.g. humanitarian -crisis) | Excludes word after the operator |
Example using httr
package:
library(httr)
GET("https://trends.google.com/trends/explore",
query=list(q = "Humanitarian",geo = "GB"))
gtrendsR
packagegtrendsR
Packagedata("countries") # get abbreviations of all countries to filter data
data("categories") # get numbers of all categories to filter data
#Combination using dplyr and ggplot
trend <- gtrends(keyword="vaccine", geo=c("GB"), time = "2021-01-01 2021-12-30", gprop="web")
trend_df <- trend$interest_over_time
trend_df <- trend_df %>%
mutate(hits = as.numeric(hits), date = as.Date(date)) %>%
replace(is.na(.), 0)
ggplot(trend_df, aes(x=date, y=hits, group=geo, col=geo)) + geom_line(size=2) +
scale_x_date(date_breaks = "2 months" , date_labels = "%b-%y") +
labs(color= "Countries") +
ggtitle("Frequencies for the query -vaccine harm- in the period: 2021-01-01 - 2021-12-3")
#if gtrendsR package doesn't work, try trendecon package
library(trendecon)
x <- ts_gtrends(keyword = c("vaccine fraud", "vaccine danger"), geo = c("GB"), time = "today+5-y",
retry = 5,
wait = 10)
tsbox::ts_plot(x)
ggmap
uses Google Maps behind the scenes, so youâll need an active Google Cloud Platform account (see here if you cannot figure out)
ggmap
with Google Maps APIOpenstreetMap (OSM) is a free and open map of the world created largely by voluntary contribution of millions of people around the world
OSM serves two APIs, namely Main API for editing OSM, and Overpass API for providing OSM data
OSM data is stored as a list of attributes tagged in key - value pairs of geospatial objects (points, lines or polygons)
For example, for charities, the key is âofficeâ, and the value is âcharityâ
The first step is to define a bounding box of the geographical area we are interested in. It is defined by four geographic coordinates, representing the minimum and maximum latitude and longitude of the area
osmdata
objecthead(london_rst$osm_polygons)$name
#select name and geometry for charities
rst_osm_points <- london_rst$osm_points %>% #select point data from downloaded OSM data
select(name, geometry) #for now just selecting the name and geometry to plot
rst_osm_polygons <- london_rst$osm_polygons %>%
select(name, geometry)
london_charities <- rbind(rst_osm_points, rst_osm_polygons)
Further texts:
Bonus Tutorial: Spotify API:
SOCS0100 â Computational Social Science