🗓️ Week 08
Automated Data Collection II

29 Nov 2024

Scraping Dynamic Webpages

What is a dynamic webpage

  • Dynamic pages are the ones that display custom content

    • different visitors might see different content on the same page at the same URL
    • depending on their own input (e.g. clicks, scrolls)
  • Difficult to scrape as page content changes without the URL changing!!

  • Dynamic pages are scraped typically in three steps

    • use an additional package, RSelenium

Three steps to dynamically scrape

  • Create the desired instance of the dynamic page with the RSelenium package
    • clicking, scrolling, filling in forms, from within R
  • Get the source code into R
    • RSelenium downloads XML
    • rvest turns it into HTML
  • Extract the exact information needed from the source code with the the rvest package

R Selenium

  • Use R as browser control to simulate your behavior
    • scrape dynamically rendered web pages
  • Originally developed for web testing purposes
    • automates browsing across platforms

It allows interacting with two things: - with browsers on your computer (e.g. opening a browser and navigating to a page) - with elements on a webpage (e.g. opening and clicking on a drop-down menu)

How RSelenium works

  • Starting server and browser session

  • Navigating to page

  • Finding elements

  • Sending events to elements

  • Getting the source code and extracting information

Key examples of what you can do with Selenium

Action Code
Open a browser open() / navigate()
Click on something clickElement()
Enter values sendKeysToElement()
Go to previous/next page goBack() / goForward()
Refresh the page refresh()
Get all the HTML that is currently
displayed
getPageSource()

Installation issues

  1. Java not installed
  • If you have a message saying that “Java is not found” (or similar), you need to install Java:
  1. Firefox not installed/found
  • If you have a message saying “Could not open firefox browser”, two possible explanations:
    • if Firefox is not installed, install it the same way as Java on the previous slide
    • if Firefox is installed but not found, it probably means that it wasn’t installed with admin rights, so you need to manually specify the location of the file:
driver <- rsDriver(
  browser = "firefox", 
  extraCapabilities = list(
    `moz:firefoxOptions` = list(
      binary = "C:\\Users\\<USERNAME>\\AppData\\Local\\Mozilla Firefox\\firefox.exe"
    )
  )
)

Stopping Selenium

  • The clean way to stop Selenium is to run driver$server$stop()

  • If you close the browser by hand and try to re-run the script, you may receive the following error:

"Error in wdman::selenium(port = port, verbose = verbose, version = version,: 
Selenium server signals port = 4567 is already in use."
  • To avoid this error, you also need to run driver$server$stop()

Browsers (starting a server)

  • Use the rsDriver function to start a server, so that you can control a web browser from within R

  • Note that the defaults can cause errors, such as trying to start two servers from the same port

  • Note that rsDriver() creates a client and a server the code singles out the client, with which our code will interact client is best thought as the browser itself it has the class of remoteDriver

Connecting to browser

driver <- rsDriver(port = 4567L, browser = "firefox")
remDr <- driver$client

Browser closing and opening

  • Close the browser
    • which won’t close the session on the server
    • recall that we have singled the client out
remDr$close()
  • Open a new browser
    • which doesn’t require the rsDriver function because the server is still running
browser$open()

Finding elements

  • If you want to use elements (e.g. clicking on the element), you need to find & assign them to an object
    • all commands to this element will be performed using that environment, not the remoteDriver environment
    • you should name it well! – webElem is a common name
  • Note that
    • the default selector is xpath
    • requires entering the xpath value
findElement(using = "xpath", 
            value
            )
  • Objects can be found by css selector, x-path, id or class, by name, by (partial) link text (anchor elements / links)

Find selectors

  • If there were a button created by the following code,
<button class="big-button" id="only-button" name="clickable">Click Me</button>
  • any of those lines below would find it
remDr$findElement(using = "xpath", value = '//*[(@id = "only-button")]')
remDr$findElement(using = "css selector", value = ".big-button")
remDr$findElement(using = "css", value = "#only-button")
remDr$findElement(using = "id", value = "only-button")
remDr$findElement(using = "name", value = "clickable")
  • Save elements as R objects to be interacted later on
WebElem <- remDr$findElement(using = ..., value = ...)

Highlighting elements you find

Highlight the element found in the previous step, with the highlightElement method

# navigate to a page
remDr$navigate("https://r-project.org")
# find the element
Contributors <- remDr$findElement(using = "link text", value = "Contributors")
# highlight it to see if we found the correct element
Contributors$highlightElement()

Source:

Clicking on the element

  • Click on the element found in the previous step, with the clickElement method
# navigate to a page
remDr$navigate("http://example.com")
# find an element
search_icon <- remDr$findElement(using = "css", value = ".fa-search")
# click on it
search_icon$clickElement()

Providing input(s) to element(s)

  • You can provide input to elements, such as text, with the value argument; keyboard presses; or mouse gestures, with the key argument

  • Note that user provides values while the selenium keys are pre-defined

sendKeysToElement(list(value, key))
# scrolling a bit
webElem$sendKeysToElement(list(key = "down_arrow"))
# scrolling to end of page
webElem <- remDr$findElement("css", "body")
webElem$sendKeysToElement(list(key = "end"))

Selenium keys

Here is the list of Selenium keys:


The downloaded binary packages are in
    /var/folders/f6/gb9b_pqd0yg5tl9h5791zmjr0000gp/T//Rtmp4tUZOE/downloaded_packages

The downloaded binary packages are in
    /var/folders/f6/gb9b_pqd0yg5tl9h5791zmjr0000gp/T//Rtmp4tUZOE/downloaded_packages
 [1] "null"         "cancel"       "help"         "backspace"    "tab"         
 [6] "clear"        "return"       "enter"        "shift"        "control"     
[11] "alt"          "pause"        "escape"       "space"        "page_up"     
[16] "page_down"    "end"          "home"         "left_arrow"   "up_arrow"    
[21] "right_arrow"  "down_arrow"   "insert"       "delete"       "semicolon"   
[26] "equals"       "numpad_0"     "numpad_1"     "numpad_2"     "numpad_3"    
[31] "numpad_4"     "numpad_5"     "numpad_6"     "numpad_7"     "numpad_8"    
[36] "numpad_9"     "multiply"     "add"          "separator"    "subtract"    
[41] "decimal"      "divide"       "f1"           "f2"           "f3"          
[46] "f4"           "f5"           "f6"           "f7"           "f8"          
[51] "f9"           "f10"          "f11"          "f12"          "command_meta"

Searching a keyword (example)

# navigate to the home page
remDr$navigate("http://example.com/")

# find the search icon and click on it
search_icon <- remDr$findElement(using = "css", value = ".fa-search")
search_icon$clickElement()

# find the search bar on the new page and click on it
search_bar <- browser$findElement(using = "css", value = "#search-query")
search_bar$clickElement()

# search for the keyword "R Package" and click enter
search_bar$sendKeysToElement(list(value = "R Package", key = "enter"))

Source code and extracting information

  • Screenshots
    • central to working with headless browsers
  • Getting source code
    • preferable option: download HTML page source and save it for extraction
  • directly extracting elements from RSelenium session
    • findElements() / findElement() for selection of nodes
    • getElementText() for extracting text from individual nodes
# screen shots
remDr$screenshot(display = TRUE)
# getting source code
remDr$getPageSource()
# directly extracting elements
webElem <- remDr$findElements(using = "class", value="results")
values <- webElem[[1]]$getElementText()

Lab Example

Scraping the real-time list of billionaires

Source:

Setup

if (!requireNamespace("pacman", quietly = TRUE)) install.packages("pacman")

pacman::p_load(
  RSelenium,
  rvest,
  purrr,
  dplyr,
  ggplot2,
  plotly,
  countrycode
)

Scraping step by step

  • Using the element inspector, we see that the table is a dynamic table
    • the data of interest are in table rows <tr> of class base ng-scope
  • Connecting to browser
driver <- rsDriver(browser = "firefox")
remote_driver <- driver$client
  • Navigating to the page
url <- "https://www.forbes.com/real-time-billionaires"
remote_driver$navigate(url)

Clicking on the “accept cookies” element

Source:

webElems <- remote_driver$findElements(using = "xpath", '/html/body/div/div/div/div/div[2]/button[2]')

webElems[[1]]$clickElement()

Sys.sleep(5) # wait for page loading
  • Slow down the code where necessary, with the Sys.sleep
    • for ethical reasons
    • because R might be faster than the browser

Let’s get the table from the browser

main <- remote_driver$findElements(using = "css", value = ".fbs-table")
table <- read_html(main[[1]]$getElementAttribute("outerHTML")[[1]]) # get html
  • And use rvest to extract the lines of the table:
table  %>% html_nodes(xpath = "//tr[@class='base ng-scope']")
  • As this is a dynamic table, we need to scroll it down to get more results

Scrolling all the way down

  • Use sendKeysToElement
webElem <- remote_driver$findElement("css", ".scrolly-table")
webElem$sendKeysToElement(list(key = "end"))

main <- remote_driver$findElements(using = "css", value = ".fbs-table")
Sys.sleep(1)
table <- read_html(main[[1]]$getElementAttribute("outerHTML")[[1]]) # get html
table  %>% html_nodes(xpath = "//tr[@class='base ng-scope']")

Automating the scrolling task

  • We get 50 more lines each time we scroll down. Let’s scroll down 50 times:
webElem <- remote_driver$findElement("css", ".scrolly-table")
for (i in 1:50) {
  cat("Scroll", i, "\n")
  webElem$sendKeysToElement(list(key = "end"))
  Sys.sleep(3)
}

Parsing the html of the table to get all the lines:

main <- remote_driver$findElements(using = "css", value = ".fbs-table")
table <- main[[1]]$getElementAttribute("outerHTML")[[1]] # get html
# get all lines with attributes

# Assuming 'table' is the HTML content
html <- read_html(table)

# Select the rows with class 'base ng-scope'
data <- html %>% html_nodes(xpath = "//tr[@class='base ng-scope']")

Extracting data from the table

  • Forming the data frame:
# Extract data from each column and create a data-frame with all columns
forbes2023 <- tibble(
  name <- data %>% html_nodes(xpath = "//td[@class='name']") %>% html_text(),
  rank <- data %>% html_nodes(xpath = "//td[@class='rank']") %>% html_text(),
  money <- data %>% html_nodes(xpath = "//td[@class='Net Worth']") %>% html_text(),
  age <- data %>% html_nodes(xpath = "//td[@class='age']") %>% html_text(),
  source <- data %>% html_nodes(xpath = "//td[@class='source']") %>% html_text(),
  country <- data %>% html_nodes(xpath = "//td[@class='Country/Territory']") %>% html_text() 
)
  • Tidying the data frame
# Replace empty cells with NA for the entire data frame
forbes2023 <- forbes2023 %>% mutate_all(~ ifelse(. == "", NA, .))

# Clean and convert 'money' to numeric
forbes2023$money <- as.numeric(gsub("\\$|B", "", forbes2023$money))

Bonus material (DataViz)

  • Let’s create a choropleth map to visualise billionaires’ wealth distribution by country
# Aggregate money by country
total_money_by_country <- aggregate(money ~ country, data = forbes2023, sum)

# Convert country names to ISO codes for mapping
iso_country <- countrycode(total_money_by_country$country, "country.name", "iso3c")

# Create an interactive choropleth map
plot_ly(
  locations = iso_country,
  z = total_money_by_country$money,
  text = paste("Country: ", total_money_by_country$country, "<br>Net Worth: $", total_money_by_country$money, "B"),
  type = "choropleth",
  colorscale = "Viridis"
) %>%
  layout(
    title = "Billionaires' Wealth Distribution by Country",
    geo = list(
      showframe = FALSE,
      projection = list(type = 'mercator')
    )
  )

Billionaires’ Wealth Distribution by Country

Source:

Further materials

Further texts:

Tutorials: