我希望获得两个h2标题之间的所有文字。我能够获得我想要的两个标题,但我现在已经选择了它们之间的具体内容。
library(rvest)
page <- read_html("https://en.wikipedia.org/wiki/Midway_Atoll")
headlines <- html_nodes(page, "h2")
x <- grep(pattern= "Contents", x=as.character(headlines))
headlines <- headlines[x:(x+1)]
不确定我是否错过了rvest的观点,但必须有两个步骤来做到这一点(得到我想要的标题,然后获取下面的&#39; li&#39;条目。) / p>
答案 0 :(得分:3)
如果我理解正确,您需要标题后面的文字。
因此结果应该是一个字符向量,每个h2
标题有一个元素。
E.g。第二个是位置之后的文字,所以
顾名思义,Midway与North之间大致相等 美国和亚洲,几乎在世界各地 纵向来自英国格林威治。它靠近西北端 夏威夷群岛,距檀香山约三分之一, 夏威夷,日本东京。
Midway Atoll距离东部不到140海里(259公里; 161英里) 国际日期线,约2,800海里(5,200公里; 3,200英里)旧金山以西,2,200海里(4,100公里; 东京以东2,500英里。
这可以使用 xpath preceding-sibling
完成,如下所示:
require(rvest)
require(purrr)
page <- read_html("https://en.wikipedia.org/wiki/Midway_Atoll")
# Make sure you scope on the content of the website
content <- html_node(page, "#mw-content-text")
# Select the h2 headings, we need to know how much there are
headlines <- html_nodes(content, "h2")
# The following xpath looks at all nodes within the content and
# counts how much of the preceding ones are h2-tags.
# For the Location text we want all nodes that have 1 preceding H2 tags
# namely "Location" itself. "Contents" (the heading of the TOC) does
# not count as it is nested. So no direct child of content.
# This xpath only selects p-tags look at the P.S.: to select all tags
# within a paragraph
xpath <- sprintf("./p[count(preceding-sibling::h2)=%d]", seq_along(headlines)-1)
map(xpath, ~html_nodes(x = content, xpath = .x)) %>% # Get the text inside the headlines
map(html_text, trim = TRUE) %>% # get per node in between
map_chr(paste, collapse = "\n") %>% # collapse the text inbetween
set_names(headlines %>% html_node("span") %>% html_text())
结果如下:
<NA>
"Midway Atoll (/ˈmɪdweɪ/; also called Mid"
Location
"As its name suggests, Midway is roughly "
Geography and geology
"Midway Atoll is part of a chain of volca"
P.S。:另类
# The not(local-name() = 'h2') makes sure that we only get "non h2" nodes
xpath <- sprintf("./*[count(preceding-sibling::h2)=%d and not(local-name() = 'h2')]",
seq_along(headlines)-1)
答案 1 :(得分:0)
{{1}}