Web Scrape:从Drop Downs中选择字段,提取生成的数据

时间:2016-02-25 16:55:42

标签: html asp.net r rvest

尝试在R中进行一些webscraping并可以使用一些帮助。

我想在此页面http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx

中提取表格中的数据

但我想先从左边的下拉列表中选择选择县,然后从下一个下拉列表中选择选择Alameda County(CA),然后抓取数据。表

这是我到目前为止所做的,但我想我知道为什么它不起作用 - rvest表单函数适合填写一个基本表单而不是从.aspx(?)的下拉列表中选择。搜索了我想要做的事情但是空洞的例子。

library(rvest)
url       <-"http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx"       
pgsession <-html_session(url)               
pgform    <-html_form(pgsession)[[1]]       

filled_form <- set_values(pgform,
                      `#atype_chosen span` = "County", 
                      `#asel_chosen span` = "Alameda Count (CA)") 
submit_form(pgsession,filled_form)

无论如何,这给了我一个错误&#34;错误:未知字段名称:#atype_chosen span,#asel_chosen span&#34;。我有点得到它......我要求R进入禁区,而不打开那个不能上班的下拉菜单。

如果有人能指出我正确的方向,我会很感激。

1 个答案:

答案 0 :(得分:4)

我监控浏览器在我选择您的县时所做的请求并使用该信息来创建此请求。它会以不同的方式为您提供数据...有效负载中的区域参数适用于不同的县。

更新:我已添加代码以获取县名单和代码,以便您可以选择要从中获取数据的任何县...

library("httr")

# start by getting the counties and their codes...
url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnAOI"
headers <- add_headers(
  "Accept" = "application/json, text/javascript, */*; q=0.01",
  "Accept-Encoding" = "gzip, deflate",
  "Accept-Language" = "en-US,en;q=0.8",
  "Content-Length" = "16",
  "Content-Type" = "application/json; charset=UTF-8",
  "Host" = "droughtmonitor.unl.edu",
  "Origin" = "http://droughtmonitor.unl.edu",
  "Proxy-Connection" = "keep-alive",
  "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
  "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
  "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body="{'aoi':'county'}", headers, encode="json")
tmp <- content(a)[[1]]
county_df <- data.frame(text=unname(unlist(sapply(tmp, "[", "Text"))),
                  value=unname(unlist(sapply(tmp, "[", "Value"))),
                  stringsAsFactors=FALSE)

# use the code for whatever county you want in the payload below...

url <- "http://droughtmonitor.unl.edu/Ajax.aspx/ReturnTabularDM"
payload <- "{'area':'06001', 'type':'county', 'statstype':'1'}"
headers <- add_headers(
                "Host" = "droughtmonitor.unl.edu",
                "Proxy-Connection" = "keep-alive",
                "Content-Length" = "50",
                "Accept" = "application/json, text/javascript, */*; q=0.01",
                "Origin" = "http://droughtmonitor.unl.edu",
                "X-Requested-With" = "XMLHttpRequest",
                "User-Agent" = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36",
                "Content-Type" = "application/json; charset=UTF-8",
                "Referer" = "http://droughtmonitor.unl.edu/MapsAndData/DataTables.aspx",
                "Accept-Encoding" = "gzip, deflate",
                "Accept-Language" = "en-US,en;q=0.8",
                "X-Requested-With" = "XMLHttpRequest"
)
a <- POST(url, body=payload, headers, encode="json")
tmp <- content(a)[[1]]
df <- data.frame(date=unname(unlist(sapply(tmp, "[", "Date"))),
                 d0=unname(unlist(sapply(tmp, "[", "D0"))),
                 d1=unname(unlist(sapply(tmp, "[", "D1"))),
                 d2=unname(unlist(sapply(tmp, "[", "D2"))),
                 d3=unname(unlist(sapply(tmp, "[", "D3"))),
                 d4=unname(unlist(sapply(tmp, "[", "D4"))),
                 stringsAsFactors=FALSE)