Question

我希望获得两个h2标题之间的所有文字。我能够获得我想要的两个标题，但我现在已经选择了它们之间的具体内容。

library(rvest)

page <- read_html("https://en.wikipedia.org/wiki/Midway_Atoll")

headlines <- html_nodes(page, "h2")
x <- grep(pattern= "Contents", x=as.character(headlines))
headlines <- headlines[x:(x+1)]

不确定我是否错过了rvest的观点，但必须有两个步骤来做到这一点（得到我想要的标题，然后获取下面的＆＃39; li＆＃39;条目。） / p>

Answer 1

如果我理解正确，您需要标题后面的文字。因此结果应该是一个字符向量，每个h2标题有一个元素。

E.g。第二个是位置之后的文字，所以

顾名思义，Midway与North之间大致相等   美国和亚洲，几乎在世界各地   纵向来自英国格林威治。它靠近西北端   夏威夷群岛，距檀香山约三分之一，   夏威夷，日本东京。

Midway Atoll距离东部不到140海里（259公里; 161英里）   国际日期线，约2,800海里（5,200公里;   3,200英里）旧金山以西，2,200海里（4,100公里;   东京以东2,500英里。

这可以使用 xpath preceding-sibling完成，如下所示：

require(rvest)
require(purrr)
page <- read_html("https://en.wikipedia.org/wiki/Midway_Atoll")

# Make sure you scope on the content of the website
content <- html_node(page, "#mw-content-text")
# Select the h2 headings, we need to know how much there are
headlines <- html_nodes(content, "h2")

# The following xpath looks at all nodes within the content and 
# counts how much of the preceding ones are h2-tags.
# For the Location text we want all nodes that have 1 preceding H2 tags
# namely "Location" itself. "Contents" (the heading of the TOC) does
# not count as it is nested. So no direct child of content.
# This xpath only selects p-tags look at the P.S.: to select all tags 
# within a paragraph

xpath <- sprintf("./p[count(preceding-sibling::h2)=%d]", seq_along(headlines)-1)

map(xpath, ~html_nodes(x = content, xpath = .x)) %>% # Get the text inside the headlines
  map(html_text, trim = TRUE) %>% # get per node in between
  map_chr(paste, collapse = "\n") %>% # collapse the text inbetween
  set_names(headlines %>% html_node("span") %>% html_text())

结果如下：

                                      <NA> 
"Midway Atoll (/ˈmɪdweɪ/; also called Mid" 
                                  Location 
"As its name suggests, Midway is roughly " 
                     Geography and geology 
"Midway Atoll is part of a chain of volca"

P.S。：另类

# The not(local-name() = 'h2') makes sure that we only get "non h2" nodes

xpath <- sprintf("./*[count(preceding-sibling::h2)=%d and not(local-name() = 'h2')]", 
                     seq_along(headlines)-1)

rvest：获得两个标题之间的所有内容

2 个答案: