我正在尝试从棒球参考中抓取棒球数据(例如,https://www.baseball-reference.com/teams/NYY/2017.shtml)。我有一个巨大的URLS矢量,我使用for循环创建,因为链接遵循特定的模式。但是,我在运行代码时遇到了麻烦,可能是因为我必须在R中创建太多连接。我的向量中有超过17000个元素,一旦达到16000左右,我的代码就会停止工作。是否有更简单的方法,也许是更有效的方式来复制我的代码?
require(Lahman)
teams <- unique(Teams$franchID)
years <- 1871:2017
urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
for(j in 1:length(years)) {
urls[i, j] <- paste0("https://www.baseball-reference.com/teams/",
teams[i], "/", years[j], ".shtml")
}
}
url_vector <- as.vector(urls)
list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:length(url_vector)) {
url <- url_vector[i]
res <- try(readLines(url), silent = TRUE)
## check if website exists
if(inherits(res, "try-error")) {
list_of_batting[[i]] <- NA
list_of_pitching[[i]] <- NA
}
else {
urltxt <- readLines(url)
urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
doc <- htmlParse(urltxt)
tables_full <- readHTMLTable(doc)
tmp1 <- tables_full$players_value_batting
tmp2 <- tables_full$players_value_pitching
list_of_batting[[i]] <- tmp1
list_of_pitching[[i]] <- tmp2
}
print(i)
closeAllConnections()
}