我正在编写一个程序来收集this页面中的所有每日.csv文件。但是,对于某些文件,我收到错误消息:
Error in open.connection(file, "rt") : cannot open the connection
In addition: Warning message:
In open.connection(file, "rt") :
cannot open URL 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05042016_DailyAbsenceData.csv': HTTP status was '404 Not Found'
以下是2016年5月12日文件中的示例:
read.csv(url("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/05122016_DailyAbsenceData.csv"))
奇怪的是,如果你去网站,找到该文件的链接并单击它,R不再提供错误并正确读取文件。这里发生了什么,如何在不必手动点击这些文件的情况下阅读这些文件? (注意,只有你们中的第一个能够复制问题,因为单击该文件会修复它以进行修复。)
最终,我想使用以下循环来收集所有文件:
# Create a vector of dates. This is the interval data is collected from.
dates = seq(as.Date("2016-05-1"), as.Date("2016-05-30"), by="days")
# Format to match the filename prefixes
dates = strftime(dates, '%m%d%Y')
# Create the vector of a file names I want read.
file.names = paste(dates,"_DailyAbsenceData.csv", sep = "")
# A loop that reads the .csv files into a list of data frame
daily.truancy = list()
for (i in 1:length(dates)) {
tryCatch({ #this function prevents the loop from stopping from an error when read.csv cannot access the file
daily.truancy[[i]] = read.csv(url(paste("https://www.eride.ri.gov/eride2K5/AggregateAttendance/Data/", file.names[i], sep = "")), sep = ",")
stop("School day") #this indicates that the file was successfully read in to the list
}, error=function(e){cat("ERROR :",conditionMessage(e), "\n")})
}
# Unlist the daily data to a large panel
daily.truancy.2016 <- do.call("rbind", daily.truancy)
请注意,实际上没有文件(周末)时会显示相同的错误消息。这不是问题。
答案 0 :(得分:1)
由于页面是动态生成的,因此url
函数不起作用,但明确设计RSelenium
就是这样的任务。
我要感谢@jdharrison这个精湛的套餐以及他对挑战性问题的回答,请看他的answers page 更多例子。
此处说明了基本设置步骤:RSelenium Setup
要提取我们感兴趣的elementID,最简单的方法是右键单击元素并单击chrome中的“Inspect”,我不确定其他浏览器,它们应该具有可能不同名称的类似功能
这将打开一个包含所选元素的html标签的侧窗口。
library(RSelenium)
RSelenium:::startServer()
#you can replace browser name with your version e.g. firefox
remDr <- remoteDriver(browserName = "chrome")
remDr$open(silent = TRUE)
appURL <- 'https://www.eride.ri.gov/eride2K5/AggregateAttendance/AttendanceReports.aspx'
monthYearCounter = 1
#total months to download
totalMonths = 2
remDr$navigate(appURL)
for(monthYearCounter in 1:totalMonths) {
#Active monthYear on the page e.g April 2017
monthYearElem = remDr$findElement("xpath", "//td[contains(@style,'width:70%')]")
#highlights the element in yellow for visual feedback
monthYearElem$highlightElement()
#extract text
monthYearText = unlist(monthYearElem$getElementAttribute("innerHTML"))
cat(paste0("Processing month year=",monthYearText,"\n"))
# For a particular month all the CSV files are listed in a table
#extract elementID of all CSV files using the pattern "imgBtnXls"
csvFilesElemList = remDr$findElements("xpath", "//input[contains(@id,'imgBtnXls')]")
#For all elements, enable click function and save file to default download location
#Ensure delay between consecutive requests from burdening the servers
lapply(csvFilesElemList,function(x) {
#
x$clickElement()
#Be nice, do no overload servers with rapid requests!!
Sys.sleep(60)
})
#Go to previous month
remDr$findElement("xpath", "//a[contains(@title,'Go to the previous month')]")$clickElement()
}