我正试图从http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx
收集村级的印度人口普查数据使用RSelenium,我可以使用以下代码导航并在四个下拉菜单中选择不同的值:
require(RSelenium)
require(selectr)
#Setting up the proxy server
RSelenium::checkForServer()
RSelenium::startServer() # if needed
remDr <- remoteDriver$new()
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate("http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx")
#Finding and changing the menus
stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState")
stateElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))
districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict")
districtElem$sendKeysToElement(list(key = "enter"))
districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict")
districtElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))
subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict")
subdistrictElem$sendKeysToElement(list(key = "enter"))
subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict")
subdistrictElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))
villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage")
villageElem$sendKeysToElement(list(key = "enter"))
villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage")
villageElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))
submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit")
remDr$executeScript("arguments[0].click();", list(submitElem))
table <- readHTMLTable(remDr$getPageSource()[[1]], which=8)
更大的问题。我需要为印度的所有村庄(选定州的村庄)运行此代码。计算时间不是问题。我有一个指定的计算机银行,并计划在许多机器上拆分它。
但是,我需要弄清楚每个州有多少个区,每个区有多少个分区,每个分区有多少个村。所以我可以通过一个嵌套的for循环来运行它。
我想到的框架看起来像这样:
num_states <- "code grabbing this from the options list"
for(r in 1:length(num_states)){
num_dist <- "code grabbing number of districts from the options list"
stateElem_code_block[r]
for(k in 1:length(num_dist)){
num_subdist <- "code grabbing number of subdistricts from the options list"
districtElem_code_block[k]
for(m in 1:length(num_subdist)){
num_vill <- "code grabbing number of village from the options list"
subdistrictElem_code_block[m]
for(i in 1:length(num_village)){
villageElem_code_block[i]
submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit")
remDr$executeScript("arguments[0].click();", list(submitElem))
table <- readHTMLTable(remDr$getPageSource()[[1]], which=8)
}
tables <-rbind(tables, table)
}
}
}
对小说感到抱歉......我希望这是有道理的。非常感谢任何帮助
编辑:我能够自己解决第一个问题......答案 0 :(得分:1)
我能够使用以下代码找到每个州的区数:
districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict")
districtElem$sendKeysToElement(list(key = 'enter'))
districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict")
stuff <- districtElem$describeElement()$text
dist_num <- length(unlist(strsplit(stuff, "\\n")))-1
dist_num
可以类似地导出其他嵌套循环的长度。虽然它当然效率低下,但它仍然是一种解决方案。
仍然希望为这类项目学习更有效的方法......
答案 1 :(得分:1)
首先,我将定义一个更改下拉列表的函数
changeFun <- function(value, elementName, targetName){
changeElem <- remDr$findElement(using = "name", elementName)
script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();")
remDr$executeScript(script, list(changeElem))
targetElem <- remDr$findElement(using = "name", targetName)
target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]])
targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1]
target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1]
list(target, targetCodes)
}
此脚本在下拉列表中设置值,并使用javascript触发onchange事件。这样,与网站的互动最少。此外,你可能想要运行像phantomJS这样的无头浏览器 firefox有关如何运行phantomjs的详细信息,请参阅RSelenium: Driving OS/Browsers local and remote。
remDr <- remoteDriver$new()
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate("http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx")
#STATES
stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState")
states <- stateElem$getElementAttribute("outerHTML")[[1]]
stateCodes <- sapply(querySelectorAll(xmlParse(states), "option"), xmlGetAttr, "value")[-1]
states <- sapply(querySelectorAll(xmlParse(states), "option"), xmlValue)[-1]
state <- list()
for(x in seq_along(stateCodes)){
district <- changeFun(stateCodes[[x]], "ctl00$Body_Content$drpState", "ctl00$Body_Content$drpDistrict")
subdistrict <- lapply(district[[2]], function(y){
subdistrict <- changeFun(y, "ctl00$Body_Content$drpDistrict", "ctl00$Body_Content$drpSubDistrict")
village <- lapply(subdistrict[[2]], function(z){
village <- changeFun(z, "ctl00$Body_Content$drpSubDistrict", "ctl00$Body_Content$drpVillage")
village}
)
list(subdistrict, village)}
)
state[[x]] <- list(district, subdistrict)
}
#
state
现在将包含所有州,地区,分区和村庄及其代码。
我只跑到了安达曼和尼科巴群岛的x = 1。这里的例子是数据
尼科巴斯区。
> state[[1]][[2]][[2]]
[[1]]
[[1]][[1]]
[1] "Car Nicobar" "Nancowry"
[[1]][[2]]
[1] "0001" "0002"
[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
[1] "Arong" "Big Lapati" "Chuckchucha" "IAF Camp" "Kakana"
[6] "Kimois" "Kinmai" "Kinyuka" "Malacca" "Mus"
[11] "Perka" "Sawai" "Small Lapati" "Tamaloo" "Tapoiming"
[16] "Teetop"
[[2]][[1]][[2]]
[1] "00036000" "00037000" "00036800" "00036300" "00036200" "00036100"
[7] "00037200" "00036700" "00036400" "00035700" "00036500" "00035900"
[13] "00037100" "00036600" "00036900" "00035800"
[[2]][[2]]
[[2]][[2]][[1]]
[1] "7 km Farm" "Akupa"
[3] "Al-Hit-Touch/Balu Basti" "Alexandera River"
[5] "Alhiat" "Alhitoth/Alhiloth"
[7] "Alipa/Alips" "Alkaipoh/Alkripoh"
[9] "Aloora" "Aloorang"
[11] "Alreak" "Alsama"
[13] "Altaful" "Altheak"
[15] "Alukian/Alhukheck" "Anul/Anula"
[17] "Atkuna/Alkun" "Bahua"
[19] "Banderkari/Pulu" "Bengali"
[21] "Berainak/Badnak" "Bompoka Island"
[23] "Bumpal" "Campbell Bay"
[25] "Champin" "Chanel/Chanol"
[27] "Changua/Changup" "Chaw Nallaha"
[29] "Chingen" "Chonghipoh"
[31] "Chongkamong" "Chota Inak"
[33] "Chukmachi" "Dairkurat"
[35] "Dakhiyon (FC)" "Danlet"
[37] "Daring" "Dogmar River"
[39] "Elahi/Ilhoya" "Enam"
[41] "Galathia River (FC)" "Gandhi Nagar"
[43] "Govinda Nagar" "Hakonhala"
[45] "Halnatai/Hoinatai" "Hin-Pou-Chi"
[47] "Hindra" "Hinnunga"
[49] "Hintona" "Hitlat"
[51] "Hockook" "Hoin incl. Ikuia"
[53] "Hoipoh" "Hontona"
[55] "Hutnyak" "In-Hig-Loi"
[57] "Indira Point" "Inlock/Infock"
[59] "Inod" "Inroak/Chinlak"
[61] "Itoi" "Jansin"
[63] "Jhoola" "Joginder Nagar"
[65] "Kakana" "Kalara"
[67] "Kalasi" "Kamorta/Kalatapu"
[69] "Kamriak" "Kanahinot"
[71] "Kapanga" "Kasintung"
[73] "Katahu" "Katahuwa"
[75] "Kavatinpeu/Karahinpoh" "Kiyang"
[77] "Knot" "Koe"
[79] "Kokeon" "Kondul"
[81] "Kopenheat" "Kuikua"
[83] "Kuitasuk" "Kulatapangia"
[85] "Kumikia" "Kupinga"
[87] "Lanuanga" "Lapat"
[89] "Lawful" "Laxmi Nagar"
[91] "Luxi" "Makhahu/Makachua"
[93] "Malacca" "Mapayala"
[95] "Maru" "Masala Tapu"
[97] "Mavatapis/Maratapia" "Mildera"
[99] "Minlana/Minlan" "Minyuk"
[101] "Mohreak/Kohreakap" "Munak incl. Ponioo/Moul"
[103] "Mus" "Navy Dera"
[105] "Neang" "Neeche Tapu"
[107] "Not yet named (at 27.9 km)-A" "Nyicalang"
[109] "Olinchi/Bombay" "Olinpon/Alhinpon"
[111] "Ongulongho" "Patatiya"
[113] "Payak" "Payuha"
[115] "Pehayo" "Pilpilow"
[117] "Pulloullo/Puloulo" "Pulobaha"
[119] "Pulobaha/Pathathifen" "Pulobed"
[121] "Pulobed/Lababu" "Pulobha/Pulobahan"
[123] "Pulobhabi" "Pulokunji"
[125] "Pulomilo" "Pulopanja"
[127] "Pulopucca" "Pulotalia/Pulotohio"
[129] "Raihion" "Ramzoo"
[131] "Ranganathan Bay" "Reakomlong"
[133] "Renguang" "Safedbalu"
[135] "Safedbalu" "Sanaya"
[137] "Sastri Nagar" "Shompen hut"
[139] "Shompen Village-A" "Shompen Village-B"
[141] "Sonomkuwa" "Tahaila"
[143] "Tani" "Tapani/Tapainy"
[145] "Tapiang" "Tapong incl. Kabila"
[147] "Tavinkin/Tavakin" "Tillang Chong Island"
[149] "Tomae/Inmae" "Trinket"
[151] "Vijoy Nagar" "Vikas Nagar"
[153] "Vyavtapu" "W.B.Katchal/Hindra"
[[2]][[2]][[2]]
[1] "00053600" "00048500" "00043500" "00050800" "00037500" "00039800"
[7] "00042600" "00039700" "00038000" "00037900" "00044200" "00042100"
[13] "00041900" "00043400" "00046200" "00048300" "00041100" "00050300"
[19] "00045700" "00038900" "00047000" "00039000" "00045000" "00054000"
[25] "00043700" "00045400" "00046000" "00054400" "00052900" "00039500"
[31] "00037400" "00046900" "00038400" "00050700" "00052400" "00051900"
[37] "00045200" "00051400" "00050000" "00038100" "00053000" "00053200"
[43] "00053900" "00041400" "00041800" "00052200" "00042800" "00043800"
[49] "00044300" "00039300" "00047700" "00049200" "00040800" "00040500"
[55] "00040200" "00052600" "00052800" "00048600" "00050100" "00044000"
[61] "00044100" "00039200" "00039100" "00053500" "00047200" "00038300"
[67] "00038800" "00046800" "00040100" "00038700" "00042200" "00051700"
[73] "00050600" "00039900" "00042000" "00049000" "00046300" "00051800"
[79] "00052300" "00050400" "00051500" "00047500" "00037600" "00040600"
[85] "00040000" "00042300" "00043300" "00042700" "00054600" "00053300"
[91] "00038200" "00048400" "00043600" "00040900" "00045300" "00046100"
[97] "00039400" "00042400" "00048200" "00038600" "00047400" "00046600"
[103] "00042900" "00054500" "00043100" "00044600" "00053700" "00047300"
[109] "00049700" "00044900" "00040300" "00052100" "00043000" "00046500"
[115] "00050200" "00044500" "00049100" "00052700" "00048800" "00051000"
[121] "00050500" "00049400" "00052000" "00051100" "00048100" "00049900"
[127] "00052500" "00048700" "00037700" "00046700" "00054100" "00041500"
[133] "00051300" "00047600" "00038500" "00039600" "00053100" "00053800"
[139] "00051200" "00051600" "00041600" "00037300" "00041200" "00043900"
[145] "00047900" "00043200" "00041700" "00037800" "00045900" "00047800"
[151] "00053400" "00047100" "00040700" "00042500"
印度有60万个村庄:O所以最好将州作为一个环路。一旦你有了四个必要的代码,你可以通过单独提交表格来获得村庄数据。例如,发布表单的一部分以获取
的村庄详细信息 州:安达曼和尼科巴群岛地区:Nicobars
分区:Car Nicobar
村庄:阿荣
是
ctl00$Body_Content$btnSub... Submit
ctl00$Body_Content$drpDis... 02
ctl00$Body_Content$drpSta... 35
ctl00$Body_Content$drpSub... 0001
ctl00$Body_Content$drpVil... 00036000
更新:
为了兴趣,我在x = 1上使用phantomJS运行,这是安达曼和尼科巴群岛的状态,稍微修改了一点改变
changeFun <- function(value, elementName, targetName){
changeElem <- remDr$findElement(using = "name", elementName)
script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();")
remDr$executeScript(script, list(changeElem))
targetCodes <- c()
while(length(targetCodes) == 0){
targetElem <- remDr$findElement(using = "name", targetName)
target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]])
targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1]
target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1]
if(length(targetCodes) == 0){
Sys.sleep(0.5)
}else{
out <- list(target, targetCodes)
}
}
return(out)
}
获取数据需要3秒钟,而firefox需要43秒才能获得相同的数据。