Scrape values from HTML select/option tags in R

I'm trying (fairly unsuccessfully) to scrape some data from a website (www.majidata.co.ke) using R. I've managed to scrape the HTML and parse it but now a little unsure how to extract the bits I actually need!

Using the XML library I scrape my data using this code:

majidata_get <- GET("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=") majidata_html <- htmlTreeParse(content(majidata_get, as="text"))

This leaves me with (Large) XMLDocumentContent. There is a drop-down list on the webpage and I want to scrape the values from it (which relate to the names and ID no. of different towns). The bits I want to extract are the numbers between <option value ="XXX"> and the name following it in capital letters.

<div class="regiondata"> <div id="town_data"> <select id="town" name="town" onchange="town_data(this.value);"> <option value="0" selected="selected">[SELECT TOWN]</option> <option value="611">AHERO</option> <option value="635">AKALA</option> <option value="625">AWASI</option> <option value="628">AWENDO</option> <option value="749">BAHATI</option> <option value="327">BANGALE</option>

Ideally, I'd like to have these in a data.frame where the first column is the number and second column is the name e.g.

ID Name 611 AHERO 635 AKALA 625 AWASI

etc.

I'm not really sure where to go from here. I had thought to use regex and match the pattern within the text, though I've read from a number of forums that this is a bad idea an that its better/more efficient to use the xpath. Not really sure where to start with this though other than thinking I need to use xpathApplysomehow.

Answer1:

The very new rvest package makes quick work of this and lets you use sane CSS selectors, too.

<strong>UPDATED</strong> Incorporates the second request (see comments below)

library(rvest) library(dplyr) # gets data from the second popup # returns a data frame of town_id, town_name, area_id, area_name addArea <- function(town_id, town_name) { # make the AJAX URL and grab the data url <- sprintf("http://www.majidata.go.ke/ajax-list-area.php?reg=towns&type=projects&id=%s", town_id) subunits <- html(url) # reformat into a data frame with the town data data.frame(town_id=town_id, town_name=town_name, area_id=subunits %>% html_nodes("option") %>% html_attr("value"), area_name=subunits %>% html_nodes("option") %>% html_text(), stringsAsFactors=FALSE)[-1,] } # get data from the first popup and put it into a dat a frame majidata <- html("http://www.majidata.go.ke/town.php?MID=MTE=&SMID=MTM=") maji <- data.frame(town_id=majidata %>% html_nodes("#town option") %>% html_attr("value"), town_name=majidata %>% html_nodes("#town option") %>% html_text(), stringsAsFactors=FALSE)[-1,] # pass in the name and id to our addArea function and make the result into # a data frame with all the data (town and area) combined <- do.call("rbind.data.frame", mapply(addArea, maji$town_id, maji$town_name, SIMPLIFY=FALSE, USE.NAMES=FALSE)) # row names aren't super-important, but let's keep them tidy rownames(combined) <- NULL str(combined) ## 'data.frame': 1964 obs. of 4 variables: ## $ town_id : chr "611" "635" "625" "628" ... ## $ town_name: chr "AHERO" "AKALA" "AWASI" "AWENDO" ... ## $ area_id : chr "60603030101" "60107050201" "60603020101" "61103040101" ... ## $ area_name: chr "AHERO" "AKALA" "AWASI" "ANINDO" ... head(combined) ## town_id town_name area_id area_name ## 1 611 AHERO 60603030101 AHERO ## 2 635 AKALA 60107050201 AKALA ## 3 625 AWASI 60603020101 AWASI ## 4 628 AWENDO 61103040101 ANINDO ## 5 628 AWENDO 61103050401 SARE ## 6 749 BAHATI 73101010101 BAHATI

Answer2:

Using xpath expressions with HTML is almost always a better choice than regex. Given this data you can extract what you're after with

options<-getNodeSet(xmlRoot(majidata_html), "//select[@id='town']/option") ids <- sapply(options, xmlGetAttr, "value") names <- sapply(options, xmlValue) data.frame(ID=ids, Name=names)

which returns

ID Name 1 0 [SELECT TOWN] 2 611 AHERO 3 635 AKALA 4 625 AWASI 5 628 AWENDO 6 749 BAHATI ...

人吐槽 人点赞

Recommend

Comment

用户名: 密码:
验证码: 匿名发表

你可以使用这些语言

查看评论:Scrape values from HTML select/option tags in R