我似乎总是在使用Python或R抓取引用站点时遇到问题。每当我在R中使用我的正常xpath方法(Python)或Rvest方法时,我想要的表似乎永远不会被刮刀拾取。
library(rvest)
url = 'https://www.pro-football-reference.com/years/2016/games.htm'
webpage = read_html(url)
table_links = webpage %>% html_node("table") %>% html_nodes("a")
boxscore_links = subset(table_links, table_links %>% html_text() %in% "boxscore")
boxscore_links = as.list(boxscore_links)
for(x in boxscore_links{
keep = substr(x, 10, 36)
url2 = paste('https://www.pro-football-reference.com', keep, sep = "")
webpage2 = read_html(url2)
home_team = webpage2 %>% html_nodes(xpath='//*[@id="all_home_starters"]') %>% html_text()
away_team = webpage2 %>% html_nodes(xpath='//*[@id="all_vis_starters"]') %>% html_text()
home_starters = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_text()
home_starters2 = webpage2 %>% html_nodes(xpath='//*[(@id="div_home_starters")]') %>% html_table()
#code that will bind lineup tables with some master table -- code to be written later
}
我试图抓住首发阵容表。第一部分代码在2016年提取所有boxscores的url,for循环进入每个boxscore页面,希望提取由" Insert Team Here"起动。
以下是一个链接:' https://www.pro-football-reference.com/boxscores/201609110rav.htm'
当我运行上面的代码时,home_starters和home_starters2对象包含零元素(理想情况下,它应该包含我试图引入的表格的表格或元素)。
我很感激帮助!
答案 0 :(得分:1)
我花了最后三个小时来解决这个问题。这就是应该做的事情。这是我的示例,但我确定您可以将其应用于您的示例。
"https://www.pro-football-reference.com/years/2017/" %>% read_html() %>% html_nodes(xpath = '//comment()') %>% # select comments
html_text() %>% # extract comment text
paste(collapse = '') %>% # collapse to single string
read_html() %>% # reread as HTML
html_node('table#returns') %>% # select desired node
html_table()