Question

全部，我正在尝试解析位于https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population处的1个表。而且我想使用htmltab包来完成此任务。目前，我的代码如下所示。但是我越来越低于错误。我尝试在函数中传递“等级”，“世界人口百分比”，但仍然收到错误。我不确定，可能是什么问题？

请注意：我是R和Webscraping的新手，如果您可以提供代码说明，那将是很大的帮助。

url3 <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population#Sovereign_states_and_dependencies_by_population"
list_of_countries<- htmltab(doc = url3, which = "//th[text() = 'Country(or dependent territory)']/ancestor::table")

Error: Couldn't find the table. Try passing (a different) information to the which argument.

Answer 1

这是XPath问题，而不是R问题。如果您检查该表的HTML，则相关标题为

<th class="headerSort" tabindex="0" role="columnheader button" title="Sort ascending">
  Country<br><small>(or dependent territory)</small>
</th>

所以text()只是“国家”。

例如，这可能有效（这不是唯一的选择，您只需尝试各种xpath选择器即可查看）。

htmltab(doc = url3, which = "//th[text() = 'Country']/ancestor::table")

或者它是页面上的第一张表，因此您可以尝试使用which=1。

（Chrome浏览器中的NB，您可以在开发者控制台中执行$x("//th[text() = 'Country']")等操作，当然，在其他浏览器中也可以尝试这些操作）

如何使用htmltab包从Wikipedia解析表？

1 个答案: