使用多个下拉选项从.aspx网页刮取表格

时间:2017-06-07 08:22:06

标签: r

我想从此页面http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx中删除表中的数据。 其中要求选择“商品”,“州”,“年”和“月”等多个选项。然后需要按提交按钮才能获得该表。

我的尝试是刮掉与“Commodity”=“Tomato”,“state”=“Karnataka”,“year”=“2016”和“month”=所有月份数据相关联的表格。我正在使用R

中的以下代码
url<-"http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx"
pgsession <- html_session(url)
pgform <-html_form(pgsession)[[1]]
filled_form <-set_values(pgform,
                     "ctl00$cphBody$Commodit_list"= "Tomato",
                     "ctl00$cphBody$State_list" = "Karnataka",
                     "ctl00$cphBody$Yea_list"  = "2016",
                     "ctl00$cphBody$Mont_list" = "January"             
)
d <- submit_form(session=pgsession, form=filled_form)
y <- d %>%
html_nodes("table") %>%.[[2]] %>%
html_table(header=TRUE)
dim(y)

但我收到一条错误消息:

Submitting with 'ctl00$ddlDistrict'
Warning message:
In request_POST(session, url = url, body = request$values, encode = 
request$encode,  :
Internal Server Error (HTTP 500).

我无法从网页上删除所需的表格,请帮我从页面中提取所需的选项。

1 个答案:

答案 0 :(得分:0)

这是一种使用RSelenium包来抓取2016年所有月份数据的方法。

library(RSelenium)
library(rvest)
library(tidyverse)

url <- "http://agmarknet.gov.in/PriceTrends/SA_Month_PriMar.aspx"

rD <- rsDriver()
remDr <- rD$client

lst <- lapply(seq(2,13), function(x) {
  remDr$navigate(url)

  webElem_commodity <- remDr$findElement(using = "css", "#cphBody_Commodit_list")
  opts_commodity <- webElem_commodity$selectTag() # get all the associated tags
  commodity_num <- which(opts_commodity$text=="Tomato") # find the required option
  opts_commodity$elements[[commodity_num]]$clickElement() # select the required option

  Sys.sleep(10) # for state names to load

  webElem_state <- remDr$findElement(using = "css", "#cphBody_State_list")
  opts_state <- webElem_state$selectTag() 
  state_num <- which(opts_state$text=="Karnataka")
  opts_state$elements[[state_num]]$clickElement()

  Sys.sleep(10) # for years to load

  webElem_yr <- remDr$findElement(using = "css", "#cphBody_Yea_list")
  opts_yr <- webElem_yr$selectTag() 
  yr_num <- which(opts_yr$text=="2016")
  opts_yr$elements[[yr_num]]$clickElement()

  Sys.sleep(10) # for months to load

  webElem_month <- remDr$findElement(using = "css", "#cphBody_Mont_list")
  opts_month <- webElem_month$selectTag() 
  opts_month$elements[[x]]$clickElement() # select a different month in each lapply iteration

  Sys.sleep(10) # for submit button to become active

  webElem_submit <- remDr$findElement(using = "css", "#cphBody_But_Submit")
  webElem_submit$clickElement()

  page_source <- remDr$getPageSource()

  tdf <- read_html(page_source[[1]]) %>%     # read table 
    html_nodes("table") %>% .[[5]] %>%
    html_table(header=T,fill=T, trim=T) %>%
    head(-1) # remove the last row which contains average at the bottom of the scraped table
})

remDr$close()
rD$server$stop()
# lst is a list, with 12 elements. Each element corresponds to data for one month of 2016