Question

例如，对于纽约市，我想从信息框中提取网站（右边的表格）。

我正在使用它：

url = "https://en.wikipedia.org/wiki/New_York_City"
page = read_html(url)

links = page %>%
  html_nodes("table tr a")

但那是错的。

Answer 1

使用xpath您可以先按类名infobox获取信息框，然后通过标记名a获取所有链接。

library("rvest")

url <- "https://en.wikipedia.org/wiki/New_York_City"
infobox <- url %>%
  read_html() %>%
  html_nodes(xpath='//table[contains(@class, "infobox")]//a')

print(infobox)

输出

{xml_nodeset (81)}
 [1] <a href="/wiki/City_(New_York)" class="mw-redirect" title="City (New York)">City</a>
 [2] <a href="/wiki/File:NYC_Montage_2014_4_-_Jleon.jpg" class="image" title="Clockwise, from top: Midtow ...
 [3] <a href="/wiki/Midtown_Manhattan" title="Midtown Manhattan">Midtown Manhattan</a>
 [4] <a href="/wiki/Times_Square" title="Times Square">Times Square</a>
 [5] <a href="/wiki/Unisphere" title="Unisphere">Unisphere</a>
...

如何使用rvest在R中提取Wikipedia表的特定元素？

1 个答案: