🗓️ Week 07
Automated Data Collection I

Dr Burak Sonmez

University College London

22 Nov 2024

HTML-CSS: Webscraping

What is semi-structured data?

It does not conform to a data model but has some structure
It is not stored in rows and columns
This type of data contains tags and elements (Metadata) which is used to group data and describe how the data is stored
Similar entities are grouped together and organised in a hierarchy

Examples: HTML (e.g. websites), XML (e.g. government data), JSON (e.g. social media API)

Source:

What is webscraping?

Extracting data from webpages:

Increasing amount of data is available on websites:
Speeches, sentences, biographical information
Social media data, newspaper articles, forums
Geographic information, conflict data, climate data
process of extracting this information automatically and transforming it into a structured dataset

Data revolution

Source:

Webscraping benefits

Any content that can be viewed on a webpage can be scraped
No API needed
No rate-limiting or authentication (usually)

Webscraping challenges

Rarely tailored for researchers
Messy, unstructured, inconsistent
Entirely site-dependent
Sites change their layout all the time

Ethics in webscraping

Check a site’s terms and conditions before scraping. Some websites disallow scrapers on robots.txt file
Consider non-intrusive ways to gather data. Don’t exhaust the site’s server
Data protection: data means traces of individuals
Secure storage vs. deletion of data
Anonymisation of users
Review these ethical webscraping tips

HTML - basics

The core of a website is HTML (Hyper Text Markup Language)
HTML defines the structure of a webpage using a series of elements. HTML elements tell the browser how to display the page by labeling pieces of content: “This is a heading,” “this is a paragraph,” “this is a link,” etc.

<!DOCTYPE html>
<html>
    <head>
        <title>Page title</title>
    </head>
    <body>
        <p>Hello world!</p>
    </body>
</html>

HTML - elements

An HTML element is defined by a start tag, some content, and an end tag

Source:

Tag	Meaning
`<head>`	page header (metadata, etc.)
`<body>`	holds all of the content
`<p>`	regular text (paragraph)
`<h1>`,`<h2>`,`<h3>`	header text, levels 1, 2, 3
`ol,`,`<ul>`,`<li>`	ordered list, unordered list, list item
`<a href="page.html">`	link to “page.html”
`<table>`,`<tr>`,`<td>`	table, table row, table item
`<div>`,`<span>`	general containers

HTML - attributes

All HTML elements can have attributes
attributes provide additional information about an element
they are included inside the tag
Examples:

<img src="no_smiley.jpg" alt="Image that does not exist."> <p style="color:red">This is a paragraph.</p>

HTML - attributes II

Source:

CSS

CSS stands for Cascading Style Sheet. CSS defines how HTML elements are to be displayed
HTML came first. But it was only meant to define content, not format it. While HTML contains tags like <font> and <color>, this is a very inefficient way to develop a website. Some websites can easily contain 100+ individual pages, each with their own HTML code
CSS was created specifically to display content on a webpage. Now, one can change the look of an entire website just by changing one file
Most web designers litter the HTML markup with tons of classes and ids to provide “hooks” for their CSS
You can piggyback on these to jump to the parts of the markup that contain the data you need

CSS anatomy

Selectors:

Element selector: p

Class selector: p class="blue"

I.D. selector: p id="blue"

Declarations:

Selector: p

Property: background-color

Value: yellow

CSS anatomy II

Source:

CSS + HTML

<body>
    <table id="content">
<tr class='name'>
<td class='firstname'>
Kurtis
</td>
<td class='lastname'>
McCoy
</td>
</tr>
<tr class='name'>
<td class='firstname'>
Leah
</td>
<td class='lastname'>
Guerrero
</td>
</tr>
</table>
</body>

We can use CSS selectors (see example)
SelectorGadget as Chrome Extension
Inspect option in Chrome

Let’s inspect a website

Use inspect option to select table to copy Xpath –example: //*[@id="mw-content-text"]/div[1]/table[1]

Source:

Using RVest to Read HTML

Overview of rvest

The package rvest allows us to:

Collect the HTML source code of a webpage
Read the HTML of the page
Select and keep certain elements of the page that are of interest
Relatively simple: no dynamic webpages

Main uses: Tables, texts, extracting links (downloading files)

Parsing HTML code

First step in webscraping: read HTML code in R and parse it (understanding structure)

xml2 package
read_html: parse HTML code into R (and )
rvest package
html_text: extract text from HTML code
html_table: extract tables in HTML code
html_nodes: extract components with CSS selector
html_attrs: extract attributes of nodes

Setup

pacman::p_load(tidyverse, # tidyverse pkgs including purrr
               purrr, # automating 
               xml2L, # parsing XML
               rvest, # parsing HTML
               robotstxt) #checking path is permitted

Parsing the url of the website

library(rvest)
library(xml2)
url <- "https://en.wikipedia.org/wiki/University_College_London"
parsed <- read_html(url)

This returns an xml object that contains all the information of the website

Extracting selected information

Select the desired part

parsed.sub <- html_element(parsed, xpath = '//*[@id="mw-content-text"]/div[1]/table[1]')

Convert to table

table.df <- html_table(parsed.sub)   
head(table.df)

Source:

Tidying data is part of webscraping

library(janitor)
# clean names
names(table.df) <-  janitor::make_clean_names(names(table.df))

# Delete empty rows
empt <- apply(table.df, 1, FUN = function(x) all(is.na(x) | x == ""))
table.df <- table.df[which(!empt), ]

# Exclude empty columns 
table.df <- table.df[,-3:-7]

head(table.df)

Automating the process

The task is to scrap Wiki info-cards of three universities (UCL; Oxford; Cambridge)

#see whether path is allowed to be scraped 
paths_allowed(paths="https://en.wikipedia.org/wiki/University_College_London")

#creating url list for the websites to be scraped 
url_list <- c(
  "https://en.wikipedia.org/wiki/University_College_London",
  "https://en.wikipedia.org/wiki/University_of_Cambridge",
  "https://en.wikipedia.org/wiki/University_of_Oxford"
)

Source:

Inspection

url <- "https://en.wikipedia.org/wiki/University_College_London"

download.file(url, destfile = "scraped_page.html", quiet = TRUE)

target <- read_html("scraped_page.html")

# If you want character vector output
target1 <- target %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div[1]/table[1]') %>%
  html_text() 

# If you want table output 
target2 <- target %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div[1]/table[1]') %>%
  html_table()

Writing a function

get_table_from_wiki <- function(url){
  
  download.file(url, destfile = "scraped_page.html", quiet = TRUE)
  
  target <- read_html("scraped_page.html")
  
  table <- target %>%
    html_nodes(xpath = '//*[@id="mw-content-text"]/div[1]/table[1]') %>%
    html_table() 
  
  return(table)
}

Automating data collection

#Testing
get_table_from_wiki(url_list[[2]])

#Automation
library(purrr)
map(url_list, get_table_from_wiki)

Scraping tables

testlink <- read_html("https://en.wikipedia.org/wiki/University_College_London")
table <- testlink %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div[1]/table[2]') %>%
  html_table()

table <- data.frame(table)
table

Source:

Practical limitations to scraping

Web-pages are complex with many different elements
Dealing with the great amount of unstructured data
APIs provided at least two advantages over scraping:
The first is that they made it easier to get at the data. Rather than effectively ‘unbaking’ the HTML-formatted data and user interface, the API allowed for an ordered and predictable transmission of information
The second is that the API often provided information that was not visible to the public through the web

Further materials

Further texts:

Automated data collection with R

Tutorials:

rvest

Lab exercise: Scraping global indicies table on Yahoo finance

Load required libraries
Check whether scraping is permitted on Yahoo Finance (https://finance.yahoo.com/world-indices/)
Identify XPath for the table, read the path (read_html)
Keep only the columns: (Name, Last Price, % Change)
Save this information as a new data frame (yahoo_data)
Use plotly to create a bar plot to visualise stock indicies prices and changes

Source:

🗓️ Week 07Automated Data Collection I

HTML-CSS: Webscraping

What is semi-structured data?

What is webscraping?

Data revolution

Webscraping benefits

Webscraping challenges

Ethics in webscraping

HTML - basics

HTML - elements

HTML - attributes

HTML - attributes II

CSS

CSS anatomy

CSS anatomy II

CSS + HTML

Let’s inspect a website

Using RVest to Read HTML

Overview of rvest

Parsing HTML code

Setup

Parsing the url of the website

Extracting selected information

Tidying data is part of webscraping

Automating the process

Inspection

Writing a function

Automating data collection

Scraping tables

Practical limitations to scraping

Further materials

Lab exercise: Scraping global indicies table on Yahoo finance

🗓️ Week 07
Automated Data Collection I