当有很多链接时,在R中抓取数据

时间:2017-12-04 13:57:47

标签: r web-scraping

我正在尝试从棒球参考中抓取棒球数据(例如,https://www.baseball-reference.com/teams/NYY/2017.shtml)。我有一个巨大的URLS矢量,我使用for循环创建,因为链接遵循特定的模式。但是,我在运行代码时遇到了麻烦,可能是因为我必须在R中创建太多连接。我的向量中有超过17000个元素,一旦达到16000左右,我的代码就会停止工作。是否有更简单的方法,也许是更有效的方式来复制我的代码?

require(Lahman)
teams <- unique(Teams$franchID)
years <- 1871:2017

urls <- matrix(0, length(teams), length(years))
for(i in 1:length(teams)) {
  for(j in 1:length(years)) {
    urls[i, j] <- paste0("https://www.baseball-reference.com/teams/", 
teams[i], "/", years[j], ".shtml")
  }
}
url_vector <- as.vector(urls)

list_of_batting <- list()
list_of_pitching <- list()
for(i in 1:length(url_vector)) {
  url <- url_vector[i]

  res <- try(readLines(url), silent = TRUE)

  ## check if website exists
  if(inherits(res, "try-error")) {
    list_of_batting[[i]] <- NA
    list_of_pitching[[i]] <- NA
  }
  else {
    urltxt <- readLines(url)
    urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))
    doc <- htmlParse(urltxt)
    tables_full <- readHTMLTable(doc)
    tmp1 <- tables_full$players_value_batting
    tmp2 <- tables_full$players_value_pitching
    list_of_batting[[i]] <- tmp1
    list_of_pitching[[i]] <- tmp2
  }
  print(i)
  closeAllConnections()
}

0 个答案:

没有答案