5465

collecting table data from a .asp webpage over with a for loop using RSelenium

I am trying to collect Indian census data at the village-level from http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx

Using RSelenium, I can navigate and select different values in the four dropdown menus with the following code:

require(RSelenium) require(selectr) #Setting up the proxy server RSelenium::checkForServer() RSelenium::startServer() # if needed remDr <- remoteDriver$new() remDr$open() remDr$setImplicitWaitTimeout(3000) remDr$navigate(https://www.e-learn.cn/content/wangluowenzhang/"http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx") #Finding and changing the menus stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState") stateElem$sendKeysToElement(list(key = "down_arrow", key = "enter")) districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") districtElem$sendKeysToElement(list(key = "enter")) districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") districtElem$sendKeysToElement(list(key = "down_arrow", key = "enter")) subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict") subdistrictElem$sendKeysToElement(list(key = "enter")) subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict") subdistrictElem$sendKeysToElement(list(key = "down_arrow", key = "enter")) villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage") villageElem$sendKeysToElement(list(key = "enter")) villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage") villageElem$sendKeysToElement(list(key = "down_arrow", key = "enter")) submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit") remDr$executeScript("arguments[0].click();", list(submitElem)) table <- readHTMLTable(remDr$getPageSource()[[1]], which=8)

The bigger problem. I need to run this code for all villages in India (well villages in select states). Computation time is not an issue. I have a designated computer bank and plan to break it apart over a number of machines.

However, I need to figure out how many districts are in each state, how many subdistricts are in each district, and how many villages are in each subdistrict. So I can run this through one nested for loop.

The framework I have in mind looks something like this:

num_states <- "code grabbing this from the options list" for(r in 1:length(num_states)){ num_dist <- "code grabbing number of districts from the options list" stateElem_code_block[r] for(k in 1:length(num_dist)){ num_subdist <- "code grabbing number of subdistricts from the options list" districtElem_code_block[k] for(m in 1:length(num_subdist)){ num_vill <- "code grabbing number of village from the options list" subdistrictElem_code_block[m] for(i in 1:length(num_village)){ villageElem_code_block[i] submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit") remDr$executeScript("arguments[0].click();", list(submitElem)) table <- readHTMLTable(remDr$getPageSource()[[1]], which=8) } tables <-rbind(tables, table) } } }

Sorry for the Novel... I hope this make sense. Any help is greatly appreciated

EDIT: I was able to solve the first question myself....

Answer1:

First I would define a function that changes the dropdown list

changeFun <- function(value, elementName, targetName){ changeElem <- remDr$findElement(using = "name", elementName) script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();") remDr$executeScript(script, list(changeElem)) targetElem <- remDr$findElement(using = "name", targetName) target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]]) targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1] target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1] list(target, targetCodes) }

This script sets the value on a dropdown list and fires the onchange event using javascript. This way the interaction with the site is at a minimum. Also you may want to run a headless browser like phantomJS rather then firefox see RSelenium: Driving OS/Browsers local and remote for details on how to run phantomjs.

remDr <- remoteDriver$new() remDr$open() remDr$setImplicitWaitTimeout(3000) remDr$navigate(https://www.e-learn.cn/content/wangluowenzhang/"http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx") #STATES stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState") states <- stateElem$getElementAttribute("outerHTML")[[1]] stateCodes <- sapply(querySelectorAll(xmlParse(states), "option"), xmlGetAttr, "value")[-1] states <- sapply(querySelectorAll(xmlParse(states), "option"), xmlValue)[-1] state <- list() for(x in seq_along(stateCodes)){ district <- changeFun(stateCodes[[x]], "ctl00$Body_Content$drpState", "ctl00$Body_Content$drpDistrict") subdistrict <- lapply(district[[2]], function(y){ subdistrict <- changeFun(y, "ctl00$Body_Content$drpDistrict", "ctl00$Body_Content$drpSubDistrict") village <- lapply(subdistrict[[2]], function(z){ village <- changeFun(z, "ctl00$Body_Content$drpSubDistrict", "ctl00$Body_Content$drpVillage") village} ) list(subdistrict, village)} ) state[[x]] <- list(district, subdistrict) } #

state would now contain all the states, districts, subdistricts and villages together with their codes. I only ran for x = 1 that was the state of Andaman and Nicobar Islands. As an example here was the data for the nicobars district.

> state[[1]][[2]][[2]] [[1]] [[1]][[1]] [1] "Car Nicobar" "Nancowry" [[1]][[2]] [1] "0001" "0002" [[2]] [[2]][[1]] [[2]][[1]][[1]] [1] "Arong" "Big Lapati" "Chuckchucha" "IAF Camp" "Kakana" [6] "Kimois" "Kinmai" "Kinyuka" "Malacca" "Mus" [11] "Perka" "Sawai" "Small Lapati" "Tamaloo" "Tapoiming" [16] "Teetop" [[2]][[1]][[2]] [1] "00036000" "00037000" "00036800" "00036300" "00036200" "00036100" [7] "00037200" "00036700" "00036400" "00035700" "00036500" "00035900" [13] "00037100" "00036600" "00036900" "00035800" [[2]][[2]] [[2]][[2]][[1]] [1] "7 km Farm" "Akupa" [3] "Al-Hit-Touch/Balu Basti" "Alexandera River" [5] "Alhiat" "Alhitoth/Alhiloth" [7] "Alipa/Alips" "Alkaipoh/Alkripoh" [9] "Aloora" "Aloorang" [11] "Alreak" "Alsama" [13] "Altaful" "Altheak" [15] "Alukian/Alhukheck" "Anul/Anula" [17] "Atkuna/Alkun" "Bahua" [19] "Banderkari/Pulu" "Bengali" [21] "Berainak/Badnak" "Bompoka Island" [23] "Bumpal" "Campbell Bay" [25] "Champin" "Chanel/Chanol" [27] "Changua/Changup" "Chaw Nallaha" [29] "Chingen" "Chonghipoh" [31] "Chongkamong" "Chota Inak" [33] "Chukmachi" "Dairkurat" [35] "Dakhiyon (FC)" "Danlet" [37] "Daring" "Dogmar River" [39] "Elahi/Ilhoya" "Enam" [41] "Galathia River (FC)" "Gandhi Nagar" [43] "Govinda Nagar" "Hakonhala" [45] "Halnatai/Hoinatai" "Hin-Pou-Chi" [47] "Hindra" "Hinnunga" [49] "Hintona" "Hitlat" [51] "Hockook" "Hoin incl. Ikuia" [53] "Hoipoh" "Hontona" [55] "Hutnyak" "In-Hig-Loi" [57] "Indira Point" "Inlock/Infock" [59] "Inod" "Inroak/Chinlak" [61] "Itoi" "Jansin" [63] "Jhoola" "Joginder Nagar" [65] "Kakana" "Kalara" [67] "Kalasi" "Kamorta/Kalatapu" [69] "Kamriak" "Kanahinot" [71] "Kapanga" "Kasintung" [73] "Katahu" "Katahuwa" [75] "Kavatinpeu/Karahinpoh" "Kiyang" [77] "Knot" "Koe" [79] "Kokeon" "Kondul" [81] "Kopenheat" "Kuikua" [83] "Kuitasuk" "Kulatapangia" [85] "Kumikia" "Kupinga" [87] "Lanuanga" "Lapat" [89] "Lawful" "Laxmi Nagar" [91] "Luxi" "Makhahu/Makachua" [93] "Malacca" "Mapayala" [95] "Maru" "Masala Tapu" [97] "Mavatapis/Maratapia" "Mildera" [99] "Minlana/Minlan" "Minyuk" [101] "Mohreak/Kohreakap" "Munak incl. Ponioo/Moul" [103] "Mus" "Navy Dera" [105] "Neang" "Neeche Tapu" [107] "Not yet named (at 27.9 km)-A" "Nyicalang" [109] "Olinchi/Bombay" "Olinpon/Alhinpon" [111] "Ongulongho" "Patatiya" [113] "Payak" "Payuha" [115] "Pehayo" "Pilpilow" [117] "Pulloullo/Puloulo" "Pulobaha" [119] "Pulobaha/Pathathifen" "Pulobed" [121] "Pulobed/Lababu" "Pulobha/Pulobahan" [123] "Pulobhabi" "Pulokunji" [125] "Pulomilo" "Pulopanja" [127] "Pulopucca" "Pulotalia/Pulotohio" [129] "Raihion" "Ramzoo" [131] "Ranganathan Bay" "Reakomlong" [133] "Renguang" "Safedbalu" [135] "Safedbalu" "Sanaya" [137] "Sastri Nagar" "Shompen hut" [139] "Shompen Village-A" "Shompen Village-B" [141] "Sonomkuwa" "Tahaila" [143] "Tani" "Tapani/Tapainy" [145] "Tapiang" "Tapong incl. Kabila" [147] "Tavinkin/Tavakin" "Tillang Chong Island" [149] "Tomae/Inmae" "Trinket" [151] "Vijoy Nagar" "Vikas Nagar" [153] "Vyavtapu" "W.B.Katchal/Hindra" [[2]][[2]][[2]] [1] "00053600" "00048500" "00043500" "00050800" "00037500" "00039800" [7] "00042600" "00039700" "00038000" "00037900" "00044200" "00042100" [13] "00041900" "00043400" "00046200" "00048300" "00041100" "00050300" [19] "00045700" "00038900" "00047000" "00039000" "00045000" "00054000" [25] "00043700" "00045400" "00046000" "00054400" "00052900" "00039500" [31] "00037400" "00046900" "00038400" "00050700" "00052400" "00051900" [37] "00045200" "00051400" "00050000" "00038100" "00053000" "00053200" [43] "00053900" "00041400" "00041800" "00052200" "00042800" "00043800" [49] "00044300" "00039300" "00047700" "00049200" "00040800" "00040500" [55] "00040200" "00052600" "00052800" "00048600" "00050100" "00044000" [61] "00044100" "00039200" "00039100" "00053500" "00047200" "00038300" [67] "00038800" "00046800" "00040100" "00038700" "00042200" "00051700" [73] "00050600" "00039900" "00042000" "00049000" "00046300" "00051800" [79] "00052300" "00050400" "00051500" "00047500" "00037600" "00040600" [85] "00040000" "00042300" "00043300" "00042700" "00054600" "00053300" [91] "00038200" "00048400" "00043600" "00040900" "00045300" "00046100" [97] "00039400" "00042400" "00048200" "00038600" "00047400" "00046600" [103] "00042900" "00054500" "00043100" "00044600" "00053700" "00047300" [109] "00049700" "00044900" "00040300" "00052100" "00043000" "00046500" [115] "00050200" "00044500" "00049100" "00052700" "00048800" "00051000" [121] "00050500" "00049400" "00052000" "00051100" "00048100" "00049900" [127] "00052500" "00048700" "00037700" "00046700" "00054100" "00041500" [133] "00051300" "00047600" "00038500" "00039600" "00053100" "00053800" [139] "00051200" "00051600" "00041600" "00037300" "00041200" "00043900" [145] "00047900" "00043200" "00041700" "00037800" "00045900" "00047800" [151] "00053400" "00047100" "00040700" "00042500"

There are 600,000 villages in india :O so its best to have the state as a for loop. Once you have the four necessary codes you can get the village data by submitting a form seperately. For example part of the form posted to get the village detail for

State: Andaman and Nicobar Islands

District: Nicobars

Sub-district: Car Nicobar

Village: Arong

is

ctl00$Body_Content$btnSub... Submit ctl00$Body_Content$drpDis... 02 ctl00$Body_Content$drpSta... 35 ctl00$Body_Content$drpSub... 0001 ctl00$Body_Content$drpVil... 00036000

UPDATE:

For interest I ran with phantomJS on x = 1 that was the state of Andaman and Nicobar Islands with a slightly modified changeFun

changeFun <- function(value, elementName, targetName){ changeElem <- remDr$findElement(using = "name", elementName) script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();") remDr$executeScript(script, list(changeElem)) targetCodes <- c() while(length(targetCodes) == 0){ targetElem <- remDr$findElement(using = "name", targetName) target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]]) targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1] target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1] if(length(targetCodes) == 0){ Sys.sleep(0.5) }else{ out <- list(target, targetCodes) } } return(out) }

It took 3 seconds to get the data versus 43 seconds for firefox to get the same data.

Answer2:

I was able to find the number of districts in each state using the following code:

districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") districtElem$sendKeysToElement(list(key = 'enter')) districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") stuff <- districtElem$describeElement()$text dist_num <- length(unlist(strsplit(stuff, "\\n")))-1 dist_num

The length for the other nested loops can be derived similarly. While it is certainly inefficient, it is a solution nevertheless.

Still looking to learn a more efficient method for this type of project....

Recommend

  • Auto-hiding toolbar on scroll, when there are several other layers of views in-between
  • Difference between osrm route and match service
  • MSI Uninstall issue: Error 1001 -> The saved State dictionary contains inconsistent data and migh
  • JQuery Mobile 1.4 How to Disable Hover Effect on Mobile Devices
  • How to decode route points from the JSON Output Data?
  • Convert adjacency matrix to a csv file
  • Create File Command in Batch Files (*.bat)
  • SSRS 2008 - Sorting within a group
  • Installing apk from within application in android
  • Can my app be notified when another application starts/stops playing audio?
  • How to configure Cygnus in relation to Orion and Cosmos
  • Where these are stored?
  • Problems with toDataURL HTML5 other ways to get canvas data?
  • PayPal API Listener Website Payments Standard URI
  • Repository Browser Only - \"Repository moved permanently to… please relocate”
  • Salesforce Different WSDL files and when to use
  • Hide HTML elements without javascript, only CSS
  • During installation of Django, why do I keep getting ImportError: No module named django?
  • pillow imaging ImportError
  • ADO and msqli connections very slow
  • PHP buffered output depending on server setting?
  • Admob requires api-13 or later can I not deploy on old API-8 phones?
  • Chrome doesn't support silverlight anymore? How to solve this?
  • SignalR .NET Client Invoke throws an exception
  • All Classes Conforming to Protocol Inherit Default Implementation
  • Display issues when we change from one jquery mobile page to another in firefox
  • Master page gives error
  • Ajax jQuery multiple calls at the same time - long wait for answer and not able to cancel
  • Azure Cloud Service Web Role web pages do not load
  • Javascript convert timezone issue
  • Websockets service method fails during R startup
  • Apache 2.4 - remove | delete | uninstall
  • Cannot Parse HTML Data Using Android / JSOUP
  • Matrix multiplication with MKL
  • A cron job substitute?
  • How do you join a server to an Active Directory (domain)?
  • Understanding cpu registers
  • coudnt use logback because of log4j
  • Are Kotlin's Float, Int etc optimised to built-in types in the JVM? [duplicate]
  • costura.fody for a dll that references another dll