Web抓取循环遍历R中的ID列表和年份

时间:2017-08-06 17:39:00

标签: r xml loops web-scraping plyr

我正在尝试使用R从baseball-reference.com上搜索可追溯到2000年的每个MLB玩家的游戏日志。我已经阅读了大量有用的东西,但对于我的目的来说还不够广泛。比如说,柯蒂斯·格兰德森的2016年游戏日志的网址为https://www.baseball-reference.com/players/gl.fcgi?id=grandcu01&t=b&year=2016

如果我有一个玩家ID和年份的列表,我知道我应该能够以某种方式循环使用它们,这个功能类似于每年抓住出勤率的功能:

fetch_attendance <- function(year) {
url <- paste0("http://www.baseball-reference.com/leagues/MLB/", year, 
"-misc.shtml")
data <- readHTMLTable(url, stringsAsFactors = FALSE)
data <- data[[1]]
data$year <- year
data
}

但是,我再次努力创造一个更广泛的功能来完成这项工作。任何帮助深表感谢。谢谢!

1 个答案:

答案 0 :(得分:2)

要生成一个player_id列表,您可以执行以下操作:

library(rvest);
scraping_MLB <- read_html("https://www.baseball-reference.com/players/");

player_name1 <- scraping_MLB %>% html_nodes(xpath = '//*[@id="content"]/ul') %>% html_nodes("div")%>% html_nodes("a") %>% html_text()
player_name2 <- lapply(player_name1,function(x)strsplit(x,split = ","))
player_name<- setNames(do.call(rbind.data.frame, player_name2), "Players_Name")

player_id1 <- scraping_MLB %>% html_nodes(xpath = '//*[@id="content"]/ul')%>% html_nodes("div") %>% html_nodes("a") %>% html_attr("href")
player_id <- setNames(as.data.frame(player_id1),"Players_ID")
player_id$Players_ID <- sub("(\\/.*\\/.*\\/)(\\w+)(..*)","\\2",player_id$Players_ID)

player_df <- cbind(player_name,player_id)
head(player_df)

获得所有玩家ID的列表后,您可以通过概括此网址https://www.baseball-reference.com/players/gl.fcgi?id=grandcu01&t=b&year=2016轻松完成。


编辑注释:在OP澄清之后添加了此代码段)
您可以从下面的示例代码开始,并使用mapply或其他内容对其进行优化:

#it fetches the data of first four players from player_df for the duration 2000-16
library(rvest);
players_stat = list()
j=1

for (i in 1:nrow(player_df[c(1:4),])){
  for (year in 2000:2016){
    scrapped_page <- read_html(paste0("https://www.baseball-reference.com/players/gl.fcgi?id=",
                                      as.character(player_df$Players_ID[i]),"&t=b&year=",year))
    if (length(html_nodes(scrapped_page, "table")) >=1){
      #scrapped_data <- html_table(html_nodes(scrapped_page, "table")[[1]])
      tab <-html_attrs(html_nodes(scrapped_page, "table"))
      batting_gamelogs<-which(sapply(tab, function(x){x[2]})=="batting_gamelogs")
      scrapped_data <- html_table(html_nodes(scrapped_page, "table")[[batting_gamelogs]], fill=TRUE)
      scrapped_data$Year <- year
      scrapped_data$Players_Name <- player_df$Players_Name[i]

      players_stat[[j]] <- scrapped_data
      names(players_stat)[j] <- as.character(paste0(player_df$Players_ID[i],"_",year))
      j <- j+1
    }
  }
}
players_stat

希望这有帮助!