R:具有隐藏字段的Web抓取.aspx表单,“未知字段名称”错误

时间:2020-03-22 01:12:19

标签: r web-scraping rvest hidden-field

两天来,我一直在尝试困惑如何填写表格并将其提交以从https://www.igb.illinois.gov/VideoReports.aspx下载.csv文件。不幸的是,我似乎无法破解。全面披露:我是新手网页抓取工具。我可以进行基本刮削,但这对我来说是新领域。我最终希望编写一个程序,将所有机构的每月收入报告拉回到2009年9月。

似乎主要问题与表单的布局方式有关。我似乎无法弄清楚如何指定要填写的字段以请求.csv文件。我一直在使用rvestRHTMLForms。我在chrome开发工具中找到了表单,可以看到我需要的一切。我似乎无法深入到我需要提交查询的地方。

这是我到目前为止所到之处:

library('rvest')
library('RHTMLForms')

igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
igb_html <- read_html(igb)

igbForm <- html_form(igb_html)
igbForm

问题似乎从这里开始。 “表单”只有一个元素,并且包含隐藏的输入。我要查询的字段即将结束。看起来像这样...

[[1]]
<form> 'aspnetForm' (POST VideoReports.aspx)
  <input hidden> '__VIEWSTATE': /wEPDwUKMTU1MTExNzA3NQ9kFgJmD2QWAgIDD2QWAgIBD2QWBAIBD2QWEgIDDw8WAh4EVGV4dAUOU2VwdGVtYmVyIDIwMTJkZAIFDw8WAh8ABQ1GZWJydWFyeSAyMDIwZGQCFQ9kFgICAw8QZBAVAg5TdW1tYXJ5IHJlcG9ydA1EZXRhaWwgcmVwb3J0FQIOU3VtbWFyeSByZXBvcnQNRGV0YWlsIHJlcG9ydBQrAwJnZ2RkAhcPZBYCAgMPEA8WBh4ORGF0YVZhbHVlRmllbGQFA0tleR4NRGF0YVRleHRGaWVsZAUFVmFsdWUeC18 ....[TRUNCATE]

最后,我要查询的内容...

  <input radio> 'ctl00$MainPlaceHolder$SearchType': TypeStatewide
  <input radio> 'ctl00$MainPlaceHolder$SearchType': TypeMuni
  <input radio> 'ctl00$MainPlaceHolder$SearchType': TypeEst
  <select> 'ctl00$MainPlaceHolder$SearchStateType' [1/2]
  <select> 'ctl00$MainPlaceHolder$SearchMunicipality' [0/1069]
  <select> 'ctl00$MainPlaceHolder$SearchEstablishment' [0/10182]
  <input text> 'ctl00$MainPlaceHolder$SearchLicenseNumber': 
  <select> 'ctl00$MainPlaceHolder$SearchStartMonth' [1/12]
  <select> 'ctl00$MainPlaceHolder$SearchStartYear' [1/9]
  <select> 'ctl00$MainPlaceHolder$SearchEndMonth' [1/12]
  <select> 'ctl00$MainPlaceHolder$SearchEndYear' [1/9]
  <input radio> 'ctl00$MainPlaceHolder$ViewType': ViewPDF
  <input radio> 'ctl00$MainPlaceHolder$ViewType': ViewCSV

我使用以下内容来深入了解我需要的东西...

igb_form <- getHTMLFormDescription(igb_html)
igb_form[[1]]

...以及此代码以查找每个字段和字段的值。例如...

igb_form_att <- igb_form[[1]]
igb_form_att$elements[[9]]

...向我显示了开始月份字段和下拉菜单中的值...

ctl00$MainPlaceHolder$SearchStartMonth: [ February ]  January, February, March, April, May, June, July, August, September, October, November, December

我认为这可以做到。因此,我运行了以下内容...

igb_fill <- set_values(igb_html,
                      'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
                      'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
                      'ctl00$MainPlaceHolder$SearchEstablishment' ='',
                      'ctl00$MainPlaceHolder$SearchStartMonth'='September',
                      'ctl00$MainPlaceHolder$SearchStartYear'='2009',
                      'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
                      'ctl00$MainPlaceHolder$SearchEndYear'='2020',
                      'ctl00$MainPlaceHolder$ViewType'='ViewCSV')

submit_form(session=igb_html, form=igb_fill, POST(igb))

但是收到此错误...

Error: Unknown field names: ctl00$MainPlaceHolder$SearchType, ctl00$MainPlaceHolder$SearchEstablishment, ctl00$MainPlaceHolder$SearchStartMonth, ctl00$MainPlaceHolder$SearchStartYear, ctl00$MainPlaceHolder$SearchEndMonth, ctl00$MainPlaceHolder$SearchEndYear, ctl00$MainPlaceHolder$ViewType
Traceback:

1. set_values(igb_form, `ctl00$MainPlaceHolder$SearchType` = "TypeEst", 
 .     `ctl00$MainPlaceHolder$SearchEstablishment` = "All Establishments", 
 .     `ctl00$MainPlaceHolder$SearchEstablishment` = "", `ctl00$MainPlaceHolder$SearchStartMonth` = "September", 
 .     `ctl00$MainPlaceHolder$SearchStartYear` = "2009", `ctl00$MainPlaceHolder$SearchEndMonth` = "February", 
 .     `ctl00$MainPlaceHolder$SearchEndYear` = "2020", `ctl00$MainPlaceHolder$ViewType` = "ViewCSV")
2. stop("Unknown field names: ", paste(no_match, collapse = ", "), 
 .     call. = FALSE)

为这个长期待解决的问题表示歉意,但是我在这个问题上打了很多电话,似乎找不到找到可以帮助我到达需要去的地方的答案。也许我在头上。但我将不胜感激! (我也很确定提交代码是错误的,但是我可以在此之后解决。)

1 个答案:

答案 0 :(得分:0)

您的代码存在一些问题:

  • set_values(...)函数采用一种形式,而不是整个html,因此我在这里用igb_html替换了igb_form
  • submit_form(...)函数使用一个html_session,所以我将read_html(igb)替换为html_session(igb)

以下代码应该起作用:

library(rvest)

igb <- "https://www.igb.illinois.gov/VideoReports.aspx"
igb_html <- html_session(igb)

igb_form <- html_form(igb_html)[[1]]

igb_fill <- set_values(igb_form,
                       'ctl00$MainPlaceHolder$SearchType' = 'TypeEst',
                       'ctl00$MainPlaceHolder$SearchEstablishment'='All Establishments',
                       'ctl00$MainPlaceHolder$SearchEstablishment' ='',
                       'ctl00$MainPlaceHolder$SearchStartMonth'='September',
                       'ctl00$MainPlaceHolder$SearchStartYear'='2009',
                       'ctl00$MainPlaceHolder$SearchEndMonth' ='February',
                       'ctl00$MainPlaceHolder$SearchEndYear'='2020',
                       'ctl00$MainPlaceHolder$ViewType'='ViewCSV')

igb_html <- submit_form(igb_html, igb_fill, submit = "ctl00$MainPlaceHolder$ButtonSearch")

igb_html