使用rvest来抓取HTML数据

时间:2017-08-30 16:58:12

标签: html r rvest

我正在努力为数据科学101项目抓取曲棍球参考。我遇到了特定表的问题。该网页为:https://www.hockey-reference.com/boxscores/201611090BUF.html。所需的表格位于"高级统计报告(所有情况)"。我尝试了以下代码:

url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
ret <- url %>%
  read_html()%>%
  html_nodes(xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))]') %>%
  html_text()

此代码从上表中删除所有数据,但在高级表之前停止。我还试图更精细地使用:

url="https://www.hockey-reference.com/boxscores/201611090BUF.html"
ret <- url %>%
  read_html()%>%
  html_nodes(xpath='//*[(@id = "OTT_adv")]//*[contains(concat( " ", @class, " " ), concat( " ", "right", " " ))]') %>%
  html_text()

产生&#34;字符(0)&#34;消息话题。任何和所有的帮助将不胜感激..如果它还不清楚,我对R.相当新。谢谢!

1 个答案:

答案 0 :(得分:2)

您尝试抓取的信息将隐藏为网页上的评论。这是一个需要一些工作来清理最终结果的解决方案:

library(rvest)
url="https://www.hockey-reference.com/boxscores/201611090BUF.html"

page<-read_html(url)  # parse html

commentedNodes<-page %>%                   
  html_nodes('div.section_wrapper') %>%  # select node with comment
  html_nodes(xpath = 'comment()')    # select comments within node

#there are multiple (3) nodes containing comments
#chose the 2 via trail and error
output<-commentedNodes[2] %>%
  html_text() %>%             # return contents as text
  read_html() %>%             # parse text as html
  html_nodes('table') %>%     # select table node
  html_table()                # parse table and return data.frame

输出将是2个元素的列表,每个表一个。玩家姓名和统计数据会在每个可用选项中重复多次,因此您需要清理这些数据以用于最终目的。