在R中进行网络抓取时,激活带有复选框的commit_form问题

时间:2019-06-19 23:56:53

标签: r web-scraping

我正试图从一个网站上抓取存档文件,该网站要求我提交带有复选框的搜索表,以显示文件列表。 Here是一个示例链接,请注意.pdf仅在单击“ Hae”后显示。提交表单时,我希望显示所有四个复选框的结果,以便同时显示包含“ Liite”和“ PTK”的.pdf。

#Creates a dataset of URLs for download
url_root <- c("http://avoindata.eduskunta.fi/digitoidut/download/")

year_set <- c(1939:1951)

seq <- length(year_set)

#Script for downloading URLs, including years of interest
download_urls <- vector(mode = "character", length = length(year_set))
  for(i in 1:13) 
    {
  download_urls[[i]] <- paste(url_root, year_set[[i]], sep = "")
  }

#Scraping data

#Creates vector for input of download URLs
finnish_urls <- vector(mode="character")

for(i in 1:13) {
  url <- c(download_urls[i])
  test.session <- html_session(url) #Since these webpages require me to submit a "form" to click a button, I have to create a new HTML session and submit that filled form to the server
  form <- html_form(test.session)[[1]]
  form$fields[6]$type$checked <- TRUE #Sets "Asiakirjat" checkbox to true
  form$fields[3]$type$checked <- TRUE #Sets "Pöytäkirjat" checkbox to true
  form$fields[4]$type$checked <- TRUE #Sets "Liite" checkbox to true
  form$fields[5]$type$checked <- TRUE #Sets "Hakemisto" checkbox to true
  form$url <- url
  filled.form <- set_values(form)
  test.session$url <- url
  filled.form$url <- url

  test.session2 <- submit_form(test.session, filled.form, submit = filled.form$name)

  #Pulls data content from new submitted page
  updated.session <- html_nodes(test.session2, ".main-content a") #%>%
  html_attr("href") #Extracts links

  #Removes non-Finnish links (the PDFs are duplicated in Swedish)
  for(j in 1:length(updated.session)) {
  if(grepl(x = updated.session[[j]], pattern = "suomi")) {
    finnish_urls <- c(finnish_urls, updated.session[[j]])
  }
}
}

当前,finnish_urls包含所有期望的年份,但仅显示“ Asiakirjat”的结果(即显示其中包含ASK的url)。知道我做错了什么吗?

This unresolved issue似乎最接近我要处理的内容。

感谢您的帮助!

0 个答案:

没有答案