72073

Scrape values from HTML select/option tags in R

I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need!

Using the XML library I scrape my data using this code:

majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=") majidata_html <- htmlTreeParse(content(majidata_get, as="text"))

This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage and I want to scrape the values from it (which relate to the names and ID no. of different towns). The bits I want to extract are the numbers between <option value ="XXX"> and the name following it in capital letters.

<div class="regiondata"> <div id="town_data"> <select id="town" name="town" onchange="town_data(this.value);"> <option value="0" selected="selected">[SELECT TOWN]</option> <option value="611">AHERO</option> <option value="635">AKALA</option> <option value="625">AWASI</option> <option value="628">AWENDO</option> <option value="749">BAHATI</option> <option value="327">BANGALE</option>

Ideally, I'd like to have these in a data.frame where the first column is the number and second column is the name e.g.

ID Name 611 AHERO 635 AKALA 625 AWASI

etc.

I'm not really sure where to go from here. I had thought to use regex and match the pattern within the text, though I've read from a number of forums that this is a bad idea an that its better/more efficient to use the xpath. Not really sure where to start with this though other than thinking I need to use xpathApplysomehow.

Answer1:

The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.

<strong>UPDATED</strong> Incorporates the second request (see comments below)

library(rvest) library(dplyr) # gets data from the second popup # returns a data frame of town_id, town_name, area_id, area_name addArea <- function(town_id, town_name) { # make the AJAX URL and grab the data url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s", town_id) subunits <- html(url) # reformat into a data frame with the town data data.frame(town_id=town_id, town_name=town_name, area_id=subunits %>% html_nodes("option") %>% html_attr("value"), area_name=subunits %>% html_nodes("option") %>% html_text(), stringsAsFactors=FALSE)[-1,] } # get data from the first popup and put it into a dat a frame majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=") maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"), town_name=majidata %>% html_nodes("#town option") %>% html_text(), stringsAsFactors=FALSE)[-1,] # pass in the name and id to our addArea function and make the result into # a data frame with all the data (town and area) combined <- do.call("rbind.data.frame", mapply(addArea, maji$town_id, maji$town_name, SIMPLIFY=FALSE, USE.NAMES=FALSE)) # row names aren't super-important, but let's keep them tidy rownames(combined) <- NULL str(combined) ## 'data.frame': 1964 obs. of 4 variables: ## $ town_id : chr "611" "635" "625" "628" ... ## $ town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ... ## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ... ## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ... head(combined) ## town_id town_name area_id area_name ## 1 611 AHERO 60603030101 AHERO ## 2 635 AKALA 60107050201 AKALA ## 3 625 AWASI 60603020101 AWASI ## 4 628 AWENDO 61103040101 ANINDO ## 5 628 AWENDO 61103050401 SARE ## 6 749 BAHATI 73101010101 BAHATI

Answer2:

Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with

options<-getNodeSet(xmlRoot(majidata_html), "//select[@id='town']/option") ids <- sapply(options, xmlGetAttr, "value") names <- sapply(options, xmlValue) data.frame(ID=ids, Name=names)

which returns

ID Name 1 0 [SELECT TOWN] 2 611 AHERO 3 635 AKALA 4 625 AWASI 5 628 AWENDO 6 749 BAHATI ...

Recommend

  • Database design for addresses
  • T-SQL (Azure) Only shows 2 results instead of 3
  • JavaFX creating interactive map
  • Conditional display of p:overlayPanel
  • Django - extended user model not saving
  • I want to start Qt [duplicate]
  • temporary memory allocated for BLOB and CLOB
  • Bind list with unequal columns
  • PHP mail $email = $_POST['email'] in $recipients
  • Change values for a random selection of a data.table subset
  • Updating dimension tables using SQL Server (BIDs or Data Tools)
  • Break is outside the loop python
  • Using queries for specific data out of tables
  • Select/Dropdownlist is not displaying
  • Writing code to: start an R session, run R script, terminate session, repeat
  • php move nodes to parent array
  • Scrape values from HTML select/option tags in R
  • How to convert this 'for' loop to a vector solution
  • How to find out what changes on a branch after merges from master?
  • cakePHP- retrieving from database. Models Associations & database
  • Get value of selected drop down list item
  • getting Json result in Android
  • How to upload excel file in angular js?
  • Fatal error: Call to a member function fetch() on a non-object?
  • Html select multiple get all values at onchange event
  • Image map in Flex
  • Who propagate bugfixes across branches (corporate development)?
  • Validate child input components on submit with Vee-Validate and vue js 2
  • MySQL Order by column = x, column asc?
  • How to check if every primary key value is being referenced as foreign key in another table
  • MySQL WHERE-condition in procedure ignored
  • Display Images one by one with next and previous functionality
  • ORA-29908: missing primary invocation for ancillary operator
  • How to get next/previous record number?
  • How do you join a server to an Active Directory (domain)?
  • How does Linux kernel interrupt the application?
  • How to push additional view controllers onto NavigationController but keep the TabBar?