使用RSelenium通过for循环从.asp网页收集表数据

时间:2014-03-11 00:54:03

标签: asp.net r selenium

我正试图从http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx

收集村级的印度人口普查数据

使用RSelenium,我可以使用以下代码导航并在四个下拉菜单中选择不同的值:

require(RSelenium)
require(selectr)

#Setting up the proxy server
RSelenium::checkForServer()
RSelenium::startServer() # if needed
remDr <- remoteDriver$new()
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate("http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx")

#Finding and changing the menus
stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState") 
stateElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
districtElem$sendKeysToElement(list(key = "enter"))
districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
districtElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict") 
subdistrictElem$sendKeysToElement(list(key = "enter"))
subdistrictElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpSubDistrict")
subdistrictElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage") 
villageElem$sendKeysToElement(list(key = "enter"))
villageElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpVillage") 
villageElem$sendKeysToElement(list(key = "down_arrow", key = "enter"))

submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit") 
remDr$executeScript("arguments[0].click();", list(submitElem))

table <- readHTMLTable(remDr$getPageSource()[[1]], which=8)

更大的问题。我需要为印度的所有村庄(选定州的村庄)运行此代码。计算时间不是问题。我有一个指定的计算机银行,并计划在许多机器上拆分它。

但是,我需要弄清楚每个州有多少个区,每个区有多少个分区,每个分区有多少个村。所以我可以通过一个嵌套的for循环来运行它。

我想到的框架看起来像这样:

    num_states <- "code grabbing this from the options list"
    for(r in 1:length(num_states)){
      num_dist <- "code grabbing number of districts from the options list"
      stateElem_code_block[r]

      for(k in 1:length(num_dist)){
        num_subdist <- "code grabbing number of subdistricts from the options list" 
        districtElem_code_block[k]

        for(m in 1:length(num_subdist)){
         num_vill <- "code grabbing number of village from the options list" 
         subdistrictElem_code_block[m]

         for(i in 1:length(num_village)){
          villageElem_code_block[i]
          submitElem <- remDr$findElement(using = "name", "ctl00$Body_Content$btnSubmit") 
          remDr$executeScript("arguments[0].click();", list(submitElem))
          table <- readHTMLTable(remDr$getPageSource()[[1]], which=8)
          }
         tables <-rbind(tables, table) 
        }
      }
     }

对小说感到抱歉......我希望这是有道理的。非常感谢任何帮助

编辑:我能够自己解决第一个问题......

2 个答案:

答案 0 :(得分:1)

我能够使用以下代码找到每个州的区数:

districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
districtElem$sendKeysToElement(list(key = 'enter')) 
districtElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpDistrict") 
stuff <- districtElem$describeElement()$text
dist_num <- length(unlist(strsplit(stuff, "\\n")))-1 
dist_num

可以类似地导出其他嵌套循环的长度。虽然它当然效率低下,但它仍然是一种解决方案。

仍然希望为这类项目学习更有效的方法......

答案 1 :(得分:1)

首先,我将定义一个更改下拉列表的函数

changeFun <- function(value, elementName, targetName){
  changeElem <- remDr$findElement(using = "name", elementName)
  script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();")
  remDr$executeScript(script, list(changeElem))
  targetElem <- remDr$findElement(using = "name", targetName) 
  target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]])
  targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1]
  target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1]
  list(target, targetCodes)
}

此脚本在下拉列表中设置值,并使用javascript触发onchange事件。这样,与网站的互动最少。此外,你可能想要运行像phantomJS这样的无头浏览器 firefox有关如何运行phantomjs的详细信息,请参阅RSelenium: Driving OS/Browsers local and remote

remDr <- remoteDriver$new()
remDr$open()
remDr$setImplicitWaitTimeout(3000)
remDr$navigate("http://www.censusindia.gov.in/Census_Data_2001/Village_Directory/View_data/Village_Profile.aspx")

#STATES
stateElem <- remDr$findElement(using = "name", "ctl00$Body_Content$drpState") 
states <- stateElem$getElementAttribute("outerHTML")[[1]]
stateCodes <- sapply(querySelectorAll(xmlParse(states), "option"), xmlGetAttr, "value")[-1]
states <- sapply(querySelectorAll(xmlParse(states), "option"), xmlValue)[-1]

state <- list()
for(x in seq_along(stateCodes)){
  district <- changeFun(stateCodes[[x]], "ctl00$Body_Content$drpState", "ctl00$Body_Content$drpDistrict")
  subdistrict <- lapply(district[[2]], function(y){
    subdistrict <- changeFun(y, "ctl00$Body_Content$drpDistrict", "ctl00$Body_Content$drpSubDistrict")
    village <- lapply(subdistrict[[2]], function(z){
      village <- changeFun(z, "ctl00$Body_Content$drpSubDistrict", "ctl00$Body_Content$drpVillage")
      village}
    )
    list(subdistrict, village)}
  ) 
  state[[x]] <- list(district, subdistrict)
}


#

state现在将包含所有州,地区,分区和村庄及其代码。 我只跑到了安达曼和尼科巴群岛的x = 1。这里的例子是数据 尼科巴斯区。

> state[[1]][[2]][[2]]
[[1]]
[[1]][[1]]
[1] "Car Nicobar" "Nancowry"   

[[1]][[2]]
[1] "0001" "0002"


[[2]]
[[2]][[1]]
[[2]][[1]][[1]]
 [1] "Arong"        "Big Lapati"   "Chuckchucha"  "IAF Camp"     "Kakana"      
 [6] "Kimois"       "Kinmai"       "Kinyuka"      "Malacca"      "Mus"         
[11] "Perka"        "Sawai"        "Small Lapati" "Tamaloo"      "Tapoiming"   
[16] "Teetop"      

[[2]][[1]][[2]]
 [1] "00036000" "00037000" "00036800" "00036300" "00036200" "00036100"
 [7] "00037200" "00036700" "00036400" "00035700" "00036500" "00035900"
[13] "00037100" "00036600" "00036900" "00035800"


[[2]][[2]]
[[2]][[2]][[1]]
  [1] "7 km Farm"                    "Akupa"                       
  [3] "Al-Hit-Touch/Balu Basti"      "Alexandera River"            
  [5] "Alhiat"                       "Alhitoth/Alhiloth"           
  [7] "Alipa/Alips"                  "Alkaipoh/Alkripoh"           
  [9] "Aloora"                       "Aloorang"                    
 [11] "Alreak"                       "Alsama"                      
 [13] "Altaful"                      "Altheak"                     
 [15] "Alukian/Alhukheck"            "Anul/Anula"                  
 [17] "Atkuna/Alkun"                 "Bahua"                       
 [19] "Banderkari/Pulu"              "Bengali"                     
 [21] "Berainak/Badnak"              "Bompoka Island"              
 [23] "Bumpal"                       "Campbell Bay"                
 [25] "Champin"                      "Chanel/Chanol"               
 [27] "Changua/Changup"              "Chaw Nallaha"                
 [29] "Chingen"                      "Chonghipoh"                  
 [31] "Chongkamong"                  "Chota Inak"                  
 [33] "Chukmachi"                    "Dairkurat"                   
 [35] "Dakhiyon (FC)"                "Danlet"                      
 [37] "Daring"                       "Dogmar River"                
 [39] "Elahi/Ilhoya"                 "Enam"                        
 [41] "Galathia River (FC)"          "Gandhi Nagar"                
 [43] "Govinda Nagar"                "Hakonhala"                   
 [45] "Halnatai/Hoinatai"            "Hin-Pou-Chi"                 
 [47] "Hindra"                       "Hinnunga"                    
 [49] "Hintona"                      "Hitlat"                      
 [51] "Hockook"                      "Hoin incl. Ikuia"            
 [53] "Hoipoh"                       "Hontona"                     
 [55] "Hutnyak"                      "In-Hig-Loi"                  
 [57] "Indira Point"                 "Inlock/Infock"               
 [59] "Inod"                         "Inroak/Chinlak"              
 [61] "Itoi"                         "Jansin"                      
 [63] "Jhoola"                       "Joginder Nagar"              
 [65] "Kakana"                       "Kalara"                      
 [67] "Kalasi"                       "Kamorta/Kalatapu"            
 [69] "Kamriak"                      "Kanahinot"                   
 [71] "Kapanga"                      "Kasintung"                   
 [73] "Katahu"                       "Katahuwa"                    
 [75] "Kavatinpeu/Karahinpoh"        "Kiyang"                      
 [77] "Knot"                         "Koe"                         
 [79] "Kokeon"                       "Kondul"                      
 [81] "Kopenheat"                    "Kuikua"                      
 [83] "Kuitasuk"                     "Kulatapangia"                
 [85] "Kumikia"                      "Kupinga"                     
 [87] "Lanuanga"                     "Lapat"                       
 [89] "Lawful"                       "Laxmi Nagar"                 
 [91] "Luxi"                         "Makhahu/Makachua"            
 [93] "Malacca"                      "Mapayala"                    
 [95] "Maru"                         "Masala Tapu"                 
 [97] "Mavatapis/Maratapia"          "Mildera"                     
 [99] "Minlana/Minlan"               "Minyuk"                      
[101] "Mohreak/Kohreakap"            "Munak incl. Ponioo/Moul"     
[103] "Mus"                          "Navy Dera"                   
[105] "Neang"                        "Neeche Tapu"                 
[107] "Not yet named (at 27.9 km)-A" "Nyicalang"                   
[109] "Olinchi/Bombay"               "Olinpon/Alhinpon"            
[111] "Ongulongho"                   "Patatiya"                    
[113] "Payak"                        "Payuha"                      
[115] "Pehayo"                       "Pilpilow"                    
[117] "Pulloullo/Puloulo"            "Pulobaha"                    
[119] "Pulobaha/Pathathifen"         "Pulobed"                     
[121] "Pulobed/Lababu"               "Pulobha/Pulobahan"           
[123] "Pulobhabi"                    "Pulokunji"                   
[125] "Pulomilo"                     "Pulopanja"                   
[127] "Pulopucca"                    "Pulotalia/Pulotohio"         
[129] "Raihion"                      "Ramzoo"                      
[131] "Ranganathan Bay"              "Reakomlong"                  
[133] "Renguang"                     "Safedbalu"                   
[135] "Safedbalu"                    "Sanaya"                      
[137] "Sastri Nagar"                 "Shompen hut"                 
[139] "Shompen Village-A"            "Shompen Village-B"           
[141] "Sonomkuwa"                    "Tahaila"                     
[143] "Tani"                         "Tapani/Tapainy"              
[145] "Tapiang"                      "Tapong incl. Kabila"         
[147] "Tavinkin/Tavakin"             "Tillang Chong Island"        
[149] "Tomae/Inmae"                  "Trinket"                     
[151] "Vijoy Nagar"                  "Vikas Nagar"                 
[153] "Vyavtapu"                     "W.B.Katchal/Hindra"          

[[2]][[2]][[2]]
  [1] "00053600" "00048500" "00043500" "00050800" "00037500" "00039800"
  [7] "00042600" "00039700" "00038000" "00037900" "00044200" "00042100"
 [13] "00041900" "00043400" "00046200" "00048300" "00041100" "00050300"
 [19] "00045700" "00038900" "00047000" "00039000" "00045000" "00054000"
 [25] "00043700" "00045400" "00046000" "00054400" "00052900" "00039500"
 [31] "00037400" "00046900" "00038400" "00050700" "00052400" "00051900"
 [37] "00045200" "00051400" "00050000" "00038100" "00053000" "00053200"
 [43] "00053900" "00041400" "00041800" "00052200" "00042800" "00043800"
 [49] "00044300" "00039300" "00047700" "00049200" "00040800" "00040500"
 [55] "00040200" "00052600" "00052800" "00048600" "00050100" "00044000"
 [61] "00044100" "00039200" "00039100" "00053500" "00047200" "00038300"
 [67] "00038800" "00046800" "00040100" "00038700" "00042200" "00051700"
 [73] "00050600" "00039900" "00042000" "00049000" "00046300" "00051800"
 [79] "00052300" "00050400" "00051500" "00047500" "00037600" "00040600"
 [85] "00040000" "00042300" "00043300" "00042700" "00054600" "00053300"
 [91] "00038200" "00048400" "00043600" "00040900" "00045300" "00046100"
 [97] "00039400" "00042400" "00048200" "00038600" "00047400" "00046600"
[103] "00042900" "00054500" "00043100" "00044600" "00053700" "00047300"
[109] "00049700" "00044900" "00040300" "00052100" "00043000" "00046500"
[115] "00050200" "00044500" "00049100" "00052700" "00048800" "00051000"
[121] "00050500" "00049400" "00052000" "00051100" "00048100" "00049900"
[127] "00052500" "00048700" "00037700" "00046700" "00054100" "00041500"
[133] "00051300" "00047600" "00038500" "00039600" "00053100" "00053800"
[139] "00051200" "00051600" "00041600" "00037300" "00041200" "00043900"
[145] "00047900" "00043200" "00041700" "00037800" "00045900" "00047800"
[151] "00053400" "00047100" "00040700" "00042500"

印度有60万个村庄:O所以最好将州作为一个环路。一旦你有了四个必要的代码,你可以通过单独提交表格来获得村庄数据。例如,发布表单的一部分以获取

的村庄详细信息 州:安达曼和尼科巴群岛

地区:Nicobars

分区:Car Nicobar

村庄:阿荣

ctl00$Body_Content$btnSub...    Submit
ctl00$Body_Content$drpDis...    02
ctl00$Body_Content$drpSta...    35
ctl00$Body_Content$drpSub...    0001
ctl00$Body_Content$drpVil...    00036000

更新:

为了兴趣,我在x = 1上使用phantomJS运行,这是安达曼和尼科巴群岛的状态,稍微修改了一点改变

changeFun <- function(value, elementName, targetName){
  changeElem <- remDr$findElement(using = "name", elementName)
  script <- paste0("arguments[0].value = '", value, "'; arguments[0].onchange();")
  remDr$executeScript(script, list(changeElem))
  targetCodes <- c()
  while(length(targetCodes) == 0){
    targetElem <- remDr$findElement(using = "name", targetName) 
    target <- xmlParse(targetElem$getElementAttribute("outerHTML")[[1]])
    targetCodes <- sapply(querySelectorAll(target, "option"), xmlGetAttr, "value")[-1]
    target <- sapply(querySelectorAll(target, "option"), xmlValue)[-1]
    if(length(targetCodes) == 0){
      Sys.sleep(0.5)
    }else{
      out <- list(target, targetCodes)
    }
  }
  return(out)
}

获取数据需要3秒钟,而firefox需要43秒才能获得相同的数据。