R - 带输入的网络抓取动态形式

时间:2015-01-21 20:51:12

标签: html asp.net r dynamic web-scraping

我试图在R中完成以下任务 该网站提供印度农业数据的分区统计数据(表格):http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx 我的理解是,这被称为动态形式,因为选项根据所做的条目而改变。具体来说,我想下载表格:

  1. State = Andhra Pradesh
  2. 区= Adilabad,Anantapur,Kadapa,...(共8个)
  3. Tahsil = Mancherial,Kasipet(这只是区= Adilabad的2个例子;总共158个)
  4. 然后我希望"平均控股的大小"适用于所有社会群体,所有性别和总数。

    基于这篇文章What if I want to web scrape with R for a page with parameters?我认为要走的路是使用getHTMLFormDescription()。但是,由于我的表格是动态的,我不能按照其他帖子中建议的路线行事。 createFunction()行返回错误:  " * writeFunction中的错误(formDescription,character(),url,con,verbose = verbose,:你应该在这里提供一个表单描述。参见getFormDescription()。"

    在可以从omegahat网页下载的RHTMLForms包中, 有这个功能(顾名思义)应该做我需要的:

    function function(desc, omit = character(), drop = TRUE, ..., verbose = FALSE) {
    # Discard the elements that we are omitting.
     if(length(omit)) {
       idx = match(omit, names(desc$elements), 0)
      k = class(desc$elements)
     desc$elements <- desc$elements[-idx]
    class(desc$elements) = k }
    
     # If no more elements left as a result of omitting them, just return the description
     # as there are definitely no more dynamic components left.
    if(length(desc$elements) == 0)
     return(desc)
    
       # Now find the dynamic components.
     dyn = sapply(desc$elements, inherits, "DynamicHTMLFormElement")
    if(!any(dyn)) 
    return(desc)
    
    pivot = desc$elements[[min(which(dyn))]]
    
     # We will need to submit the form for each value of this dynamic element, so
      # get the URI.  If the URI changes depending on the value, we are out of     luck!!
    url = mergeURI(URI(desc$formAttributes["action"]), URI(desc$url)) 
    
     # Prepare the return value with the pivot information and we will build up
     # the branches by looping over the possible values.
     descriptions = list(elementName = pivot$name,
                     description = pivot,
                     values = list())   
    
     omit = c(omit, pivot$name)
    
     for(i in names(pivot$options)) {
       # Create the arguments for the submission. We may need to include them all.
     args = list(i)
     names(args)[1] = pivot$name
    
     if(verbose) 
       cat("Checking ", pivot$name, " - option", i, "\n")
    
      #XX we may need to provide all the arguments rather than just this one.
      # or perhaps cumulate them for the elements we have already deal with.
      # We have the defaults and the possible values from the original description.
     page = formQuery(args, toString(url), desc, .checkArgs = FALSE, ...)
      # Make certain that we turn the checkDynamic off here to avoid recursively.
     tmp = getHTMLFormDescription(page, asText = TRUE, handlers = multiFormElementHandlers(url, checkDynamic = FALSE))
     tmp = getDynamicHTMLFormDescription(tmp, omit = omit)
    
       # Now remove the elements that we are omitting. This leaves a subset of the form.
     if(drop) {
        idx = match(omit, names(tmp$elements), 0)
    
        if(any(is.na(idx))) {
          k = class(tmp$elements)
          tmp$elements = tmp$elements[is.na(idx)]
          class(tmp$elements) = k
        }
    
        class(tmp) <- c("HTMLFormSubset", class(tmp))        
     }
    
     descriptions$values[[i]] = tmp
     }
    
    class(descriptions) <- c("DynamicFormElementPath")
    
    descriptions
    }
    

    然而,我无法使其工作 - 调用getDynamicHTMLFormDescription(&#34; http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx&#34;)在desc $元素中给出&#34;错误:$运算符无效原子载体&#34;。

    有人建议如何解决这个问题吗?一旦我有办法填写表格并访问每个分区(tahsil)的表格,我就知道如何获得数据的形状。它真的只是让R填写这个(特定)形式。

    欢迎任何帮助! 迈克尔凯撒 (加州大学圣地亚哥分校的职员研究员)

1 个答案:

答案 0 :(得分:5)

以下是使用RSelenium下载

数据的解决方案
  1. State = Andhra Pradesh
  2. 区= Adilabadt
  3. Tahsil = Mancherial
  4. 表格=按规模组划分的营运持有的平均规模
  5. 其余字段使用默认输入参数。

    library(RSelenium)
    library(XML)
    library(magrittr)
    
    # Start Selenium Server --------------------------------------------------------
    
    checkForServer()
    startServer()
    remDrv <- remoteDriver()
    remDrv$open()
    
    
    # Simulate browser session and fill out form -----------------------------------
    
    remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
    remDrv$findElement(using = "xpath", 
                       "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '1a']")$clickElement()
    remDrv$findElement(using = "xpath",
                       "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = '19']")$clickElement()
    remDrv$findElement(using = "xpath",
                       "//option[@value = '33']")$clickElement()
    remDrv$findElement(using = "xpath",
                       "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
    # Click submit
    remDrv$findElement(using = "xpath",
                       "//input[@value = 'Submit']")$clickElement()
    
    
    # Retrieve and download results ------------------------------------------------
    
    table <- remDrv$getPageSource()[[1]] %>% 
      htmlParse %>% 
      readHTMLTable %>% 
      extract2(4)
    
    remDrv$quit()
    remDrv$closeServer()
    
    head(table)
    
    #      V1                      V2                     V3
    # 1 SI No. Size of Holding(in ha.) Institutional Holdings
    # 2    (1)                     (2)                    (3)
    # 3      1                MARGINAL                      0
    # 4      2                   SMALL                      0
    # 5      3              SEMIMEDIUM                      0
    # 6      4                  MEDIUM                      0
    

    但是,上面的静态解决方案只回答了部分问题,即如何使用R填写网络表单。

    您网页上的棘手问题是不同下拉菜单中的值相互依赖。

    下面,您将找到一个解决方案,该解决方案考虑了这些依赖关系,而无需事先了解相应的分区和tehsils ID。

    以下代码下载

    的数据
    1. 州= GOA
    2. 表格=按规模组划分的营运持有的平均规模
    3. 包括所有地区和所有地区。我使用GOA作为主要锚点,但您可以轻松选择您选择的其他状态。

      library(RSelenium)
      library(XML)
      library(dplyr)
      library(magrittr)
      
      # Start Selenium Server --------------------------------------------------------
      
      checkForServer()
      startServer()
      remDrv <- remoteDriver()
      remDrv$open()
      
      
      # Simulate browser session and fill out form -----------------------------------
      
      remDrv$navigate('http://agcensus.dacnet.nic.in/tehsilsummarytype.aspx')
      
      # Select 27a == GOA as the anchor
      remDrv$findElement(using = "xpath", 
                         "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList1']/option[@value = '27a']")$clickElement()
      # Select 4 == Average Size of Operational Holding by Size Group 
      remDrv$findElement(using = "xpath",
                         "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList3']/option[@value = '4']")$clickElement()
      
      # Get all district IDs and the respective names belonging to GOA
      district_IDs <- remDrv$findElements(using = "xpath",
                                      "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
        lapply(function(x){x$getElementAttribute('value')}) %>% 
        unlist
      
      district_names <- remDrv$findElements(using = "xpath",
                                          "//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option") %>%
        lapply(function(x){x$getElementText()}) %>% 
        unlist
      
      
      # Retrieve and download results ------------------------------------------------
      
      result <- data.frame(district = character(), teshil = character(), 
                           V1 = character(), V2 = character(), V3 = character())
      
      for (i in seq_along(district_IDs)) {
      
        remDrv$findElement(using = "xpath",
                           paste0("//select[@name = '_ctl0:ContentPlaceHolder1:DropDownList9']/option[@value = ", 
                                "'", district_IDs[i], "']"))$clickElement()
        Sys.sleep(2)
      
        # Get all tehsil IDs and names from the currently selected district
        tehsil_IDs <- remDrv$findElements(using = "xpath",
                           "//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
          lapply(function(x){x$getElementAttribute('value')}) %>%
          unlist
      
        tehsil_names <- remDrv$findElements(using = "xpath",
                                                          "//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option") %>%
          lapply(function(x){x$getElementText()}) %>%
          unlist
      
        for (j in seq_along(tehsil_IDs)) {
      
          remDrv$findElement(using = "xpath",
                              paste0("//div[@id = '_ctl0_ContentPlaceHolder1_Panel4']/select/option[@value = ",
                                     "'", tehsil_IDs[j], "']"))$clickElement()
          Sys.sleep(2)
      
          # Click submit and download data of the selected tehsil
          remDrv$findElement(using = "xpath",
                             "//input[@value = 'Submit']")$clickElement()
          Sys.sleep(2)
      
          # Download data for current tehsil
          tehsil_data <- remDrv$getPageSource()[[1]] %>% 
            htmlParse %>% 
            readHTMLTable %>% 
            extract2(4) %>% 
            extract(c(-1, -2), )
      
          result <- data.frame(district = district_names[i], tehsil = tehsil_names[j],
                               tehsil_data) %>% rbind(result, .)
      
          remDrv$goBack()
          Sys.sleep(2)
        }   
      }
      
      remDrv$quit()
      remDrv$closeServer()
      
      result %<>% as_data_frame %>%
        rename(
          si_no = V1,
          holding_size = V2, 
          inst_holdings = V3
          ) %>% 
        mutate(
          si_no = as.numeric(as.character(si_no)),
          inst_holdings = as.numeric(as.character(inst_holdings))
          )
      
      dim(result)
      # [1] 66  5
      
      head(result)
      # district tehsil si_no   holding_size inst_holdings
      # 1 NORTH GOA  ponda     1       MARGINAL          0.34
      # 2 NORTH GOA  ponda     2          SMALL          0.00
      # 3 NORTH GOA  ponda     3     SEMIMEDIUM          2.50
      # 4 NORTH GOA  ponda     4         MEDIUM          0.00
      # 5 NORTH GOA  ponda     5          LARGE        182.64
      # 6 NORTH GOA  ponda     6 ALL SIZE CLASS         41.09
      
      tail(result)
      # district tehsil si_no   holding_size inst_holdings
      # 1 SOUTH GOA quepem     1       MARGINAL          0.30
      # 2 SOUTH GOA quepem     2          SMALL          0.00
      # 3 SOUTH GOA quepem     3     SEMIMEDIUM          0.00
      # 4 SOUTH GOA quepem     4         MEDIUM          0.00
      # 5 SOUTH GOA quepem     5          LARGE         23.50
      # 6 SOUTH GOA quepem     6 ALL SIZE CLASS         15.77
      

      RSelenium甚至支持利用PhantomJS进行无头浏览,如vignette所述。