Question

我正在尝试使用rvest学习网络抓取，并试图重现此处给出的示例：

https://www.r-bloggers.com/using-rvest-to-scrape-an-html-table/

已经安装了rvest，我只是复制粘贴了文章中提供的代码：

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/table[1]') %>%
  html_table()
population <- population[[1]]

唯一的区别是我使用read_html()而不是html()，因为不推荐使用后者。

而不是本文报告的输出，此代码产生了熟悉的：

Error in population[[1]] : subscript out of bounds

其源头是，在没有最后两行的情况下运行代码将使population的值为{xml_nodeset (0)}

所有与之相关的先前问题都表明，这是由于使用javascript动态格式化表格导致的。但这不是事实（除非Wikipedia自2015年发表rbloggers以来更改了格式）。

由于我不知所措，任何见识将不胜感激！

Answer 1

html已更改。该xpath不再有效。您可以执行以下操作：

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  read_html() %>%
  html_node(xpath='//table') %>%
  html_table()

当我切换到html_node，它返回第一个匹配项时，我不再需要索引[1]。

更长的xpath现在在您的原始路径中有一个div：

//*[@id="mw-content-text"]/div/table[1]

这是您获得的路径，您可以在表的浏览器中右键单击复制xpath。

您希望避免使用长的xpath，因为它们很脆弱，并且在页面的html更改时，很容易损坏。

您还可以使用css并按类进行抓取（例如）

library("rvest")
url <- "http://en.wikipedia.org/wiki/List_of_U.S._states_and_territories_by_population"
population <- url %>%
  read_html() %>%
  html_node(css='.wikitable') %>%
  html_table()

尝试重现网络抓取示例时出现{xml_nodeset（0）}问题（不要认为这是JS问题）

1 个答案: