我正试图从一个网站上抓取存档文件,该网站要求我提交带有复选框的搜索表,以显示文件列表。 Here是一个示例链接,请注意.pdf仅在单击“ Hae”后显示。提交表单时,我希望显示所有四个复选框的结果,以便同时显示包含“ Liite”和“ PTK”的.pdf。
#Creates a dataset of URLs for download
url_root <- c("http://avoindata.eduskunta.fi/digitoidut/download/")
year_set <- c(1939:1951)
seq <- length(year_set)
#Script for downloading URLs, including years of interest
download_urls <- vector(mode = "character", length = length(year_set))
for(i in 1:13)
{
download_urls[[i]] <- paste(url_root, year_set[[i]], sep = "")
}
#Scraping data
#Creates vector for input of download URLs
finnish_urls <- vector(mode="character")
for(i in 1:13) {
url <- c(download_urls[i])
test.session <- html_session(url) #Since these webpages require me to submit a "form" to click a button, I have to create a new HTML session and submit that filled form to the server
form <- html_form(test.session)[[1]]
form$fields[6]$type$checked <- TRUE #Sets "Asiakirjat" checkbox to true
form$fields[3]$type$checked <- TRUE #Sets "Pöytäkirjat" checkbox to true
form$fields[4]$type$checked <- TRUE #Sets "Liite" checkbox to true
form$fields[5]$type$checked <- TRUE #Sets "Hakemisto" checkbox to true
form$url <- url
filled.form <- set_values(form)
test.session$url <- url
filled.form$url <- url
test.session2 <- submit_form(test.session, filled.form, submit = filled.form$name)
#Pulls data content from new submitted page
updated.session <- html_nodes(test.session2, ".main-content a") #%>%
html_attr("href") #Extracts links
#Removes non-Finnish links (the PDFs are duplicated in Swedish)
for(j in 1:length(updated.session)) {
if(grepl(x = updated.session[[j]], pattern = "suomi")) {
finnish_urls <- c(finnish_urls, updated.session[[j]])
}
}
}
当前,finnish_urls包含所有期望的年份,但仅显示“ Asiakirjat”的结果(即显示其中包含ASK的url)。知道我做错了什么吗?
This unresolved issue似乎最接近我要处理的内容。
感谢您的帮助!