使用R从足球参考中搜集阵容数据

时间:2017-11-21 14:23:59

标签: r xpath rvest

我似乎总是在使用Python或R抓取引用站点时遇到问题。每当我在R中使用我的正常xpath方法(Python)或Rvest方法时,我想要的表似乎永远不会被刮刀拾取。

library(rvest)

url = 'https://www.pro-football-reference.com/years/2016/games.htm'

webpage = read_html(url)

table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)

for(x in boxscore_links{
  keep = substr(x, 10, 36)
  url2 = paste('https://www.pro-football-reference.com', keep, sep = "") 
  webpage2 = read_html(url2)
  home_team = webpage2 %>% html_nodes(xpath='//*[@id="all_home_starters"]') %>% html_text()
  away_team = webpage2 %>% html_nodes(xpath='//*[@id="all_vis_starters"]') %>% html_text()
  home_starters = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_text()
  home_starters2 = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_table()
  #code that will bind lineup tables with some master table -- code to be written later 
}

我试图抓住首发阵容表。第一部分代码在2016年提取所有boxscores的url,for循环进入每个boxscore页面,希望提取由" Insert Team Here"起动。

以下是一个链接:' https://www.pro-football-reference.com/boxscores/201609110rav.htm'

当我运行上面的代码时,home_starters和home_starters2对象包含零元素(理想情况下,它应该包含我试图引入的表格的表格或元素)。

我很感激帮助!

1 个答案:

答案 0 :(得分:1)

我花了最后三个小时来解决这个问题。这就是应该做的事情。这是我的示例,但我确定您可以将其应用于您的示例。

"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>%    # select comments
  html_text() %>%    # extract comment text
  paste(collapse = '') %>%    # collapse to single string
  read_html() %>%    # reread as HTML
  html_node('table#returns') %>%    # select desired node
  html_table()