Question

我正试图从basketball-reference.com抓取团队统计网页，但是当我使用readHTML时，它只会带回前两个表。

我的R代码如下所示：

url = "http://www.basketball-reference.com/leagues/NBA_2015.html"
teamPageTables = readHTMLTable(url)

这将返回仅2的列表。页面上的前两个表。我希望列表中包含页面中的所有表格。

我也尝试过使用rvest和我想要的表格的XPath（Miscellaneous Stats表），但也没有运气。

BBR改变了一些阻止抓取的东西。我甚至看到其他关于抓取团队网站的帖子，他们指的是他想要的桌子是在索引16 ...我复制了他的代码但仍然没有。

非常感谢任何帮助。谢谢，

Answer 1

由于其他表位于注释中，readHTMLTable()不会捕获它。但是，请考虑使用readLines阅读网址文字，然后移除评论标记，从中解析文档。事实证明页面上有85个表格！下面提取了可在屏幕上立即查看的10个表格：

library(XML)

# READ URL TEXT
url <- "http://www.basketball-reference.com/leagues/NBA_2015.html"
urltxt <- readLines(url)
# REMOVE COMMENT TAGS
urltxt <- gsub("-->", "", gsub("<!--", "", urltxt))

# PARSE UNCOMMENTED TEXT
doc <- htmlParse(urltxt)

# RETRIEVE ALL <table> TAGS
tables <- xpathApply(doc, "//table")

# LIST OF DATAFRAMES
teamPageTables <- lapply(tables[c(1:2,19:26)], function(i) readHTMLTable(i))

Answer 2

仅此网页有两个有效的html表。其他表在页面内作为html注释，可能由一些javascript解析。您也许可以尝试解析这些注释。

下面的代码显示了两个有效的表，并将原始html写入文件。在文本编辑器中打开bb.html并注意到许多表都在其中

library(rvest)
url <- "http://www.basketball-reference.com/leagues/NBA_2015.html"
page <- read_html(url)

# there are two valid tables - get them with css id's
team_stats_per_game <- html_node(page, "#team-stats-per_game")
divs_standings_E <- html_nodes(page, "#divs_standings_E")

# look at the actual page text - open bb.html in a text editor
text <- readLines(url)
writeLines(text, "bb.html")

评论表看起来像

<div class="placeholder"></div>
<!--  
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_misc_stats">
  <table class="sortable stats_table" id="misc_stats" data-cols-to-freeze=2><caption>Miscellaneous Stats Table</caption>
etc.
-->

R中的readHTMLTable仅从篮球参考页面返回前两个表

2 个答案: