试图使用rvest循环一个命令从多个页面刮取表格

时间:2017-02-21 00:17:07

标签: r web-scraping rvest

我正在尝试从不同的足球队中抓取HTML表格。这是我要抓的表,但是我想从所有团队中删除相同的表,最终创建一个包含播放器名称及其数据的CSV文件。

http://www.pro-football-reference.com/teams/tam/2016_draft.htm

# teams
teams <- c("ATL", "TAM", "NOR", "CAR", "GNB", "DET", "MIN", "CHI", "SEA", "CRD", "RAM", "NWE", "MIA", "BUF", "NYJ", "KAN", "RAI", "DEN", "SDG", "PIT", "RAV", "SFO", "CIN", "CLE", "HTX", "OTI", "CLT", "JAX", "DAL", "NYG", "WAS", "PHI")

# loop
for(i in teams) {
  url <-paste0("http://www.pro-football-reference.com/teams/", i,"/2016-snap-counts.htm#snap_counts::none", sep="")
  webpage <- read_html(url)

  # grab table
  sb_table <- html_nodes(webpage, 'table')
html_table(sb_table)
head(sb_table)
  # bind to dataframe
df <- rbind(df, sb_table)
}

我收到一个错误,认为我应该使用CSS或Xpath,而不是两者,但我无法弄清楚问题的确切位置(我怀疑是html_nodes命令)。任何人都可以帮我解决这个问题吗?

2 个答案:

答案 0 :(得分:1)

我认为您的网址构建糟糕,而且团队名称区分大小写。你能尝试这样的东西吗?

library(rvest)
library(magrittr)

# teams
teams <- c("ATL", "TAM", "NOR", "CAR", "GNB", "DET", "MIN", "CHI", "SEA", "CRD", "RAM", "NWE", "MIA", "BUF", "NYJ", "KAN", "RAI", "DEN", "SDG", "PIT", "RAV", "SFO", "CIN", "CLE", "HTX", "OTI", "CLT", "JAX", "DAL", "NYG", "WAS", "PHI")

tables <- list()
index <- 1
for(i in teams){
  try({
  url <- paste0("http://www.pro-football-reference.com/teams/", tolower(i), "/2016_draft.htm")
  table <- url %>% 
    read_html() %>% 
    html_table(fill = TRUE)

  tables[index] <- table

  index <- index + 1

  })
}

df <- do.call("rbind", tables)
PS:我不明白为什么这个问题被贬低了。它看起来很好......

答案 1 :(得分:0)

我认为在这种情况下适当的CSS选择器是#snap_counts。此外,如果每页有一个表,则可以使用html_node()(单数,而非节点):

url %>% 
  read_html() %>% 
  html_node("#snap_counts") %>% 
  html_table(header = FALSE)

由于该表有两个标题行,而某些标题单元格跨越列,因此最好使用header = FALSE。数据框的前两行将包含标题,您可以手动清理(创建自己的列名称)。