我正在一个项目,该项目从https://www.hockey-reference.com/boxscores/收集一些数据。实际上,我正在尝试获取一个赛季的每张桌子。我已经生成了一个网址列表,该网址列表是通过将https://www.hockey-reference.com/boxscores/与日历的每个日期以及每个团队名称(例如“ https://www.hockey-reference.com/boxscores/20171005WSH.html
我已经将每个URL都存储到一个列表中,但是其中一些导致404错误。我正在尝试将“ Curl包”与函数“ url.exists”一起使用,以了解是否会出现404错误并删除列表的url。问题是列表中的每个URL(包括真正存在的URL)在for循环中都返回url.exists为FALSE ...我试图在控制台中使用url.exists(my list [i])使用此功能但返回FALSE。
这是我的代码:
library(rvest)
library(RCurl)
##### Variables ####
team_names = c("ANA","ARI","BOS","BUF","CAR","CGY","CHI","CBJ","COL","DAL","DET","EDM","FLA","LAK","MIN","MTL","NSH","NJD","NYI","NYR","OTT","PHI","PHX","PIT","SJS","STL","TBL","TOR","VAN","VGK","WPG","WSH")
S2017 = read.table(file = "2018_season", header = TRUE, sep = ",")
dates = as.character(S2017[,1])
#### formating des dates ####
for (i in 1:length(dates)) {
dates[i] = gsub("-", "", dates[i])
}
dates = unique(dates)
##### generation des url ####
url_list = c()
for (j in 1:2) { #dates
for (k in 1:length(team_names)) {
print(k)
url_site = paste("https://www.hockey-reference.com/boxscores/",dates[j],team_names[k],".html",sep="")
url_list = rbind(url_site,url_list)
}
}
url_list_raffined = c()
for (l in 1:40) {
print(l)
if (url.exists(url_list[l], .header = TRUE) == TRUE) {
url_list_raffined = c(url_list_raffined,url_list[l])
}
}
对我的问题有任何想法吗?
谢谢
答案 0 :(得分:0)
您可以使用RCurl
软件包代替httr
:
library(httr)
library(rvest)
library(xml2)
resp <- httr::GET(url_address, httr::timeout(60))
if(resp$status_code==200) {
html <- xml2::read_html(resp)
txt <- rvest::html_text(rvest::html_nodes(html)) # or similar
# save the results somewhere or do your operations..
}
此处url_address
是您要下载的地址。也许您需要将其放入函数或循环中以遍历所有地址。