rvest:获得两个标题之间的所有内容

时间:2017-03-27 11:15:00

标签: r rvest

我希望获得两个h2标题之间的所有文字。我能够获得我想要的两个标题,但我现在已经选择了它们之间的具体内容。

library(rvest)

page <- read_html("https://en.wikipedia.org/wiki/Midway_Atoll")

headlines <- html_nodes(page, "h2")
x <- grep(pattern= "Contents", x=as.character(headlines))
headlines <- headlines[x:(x+1)]

不确定我是否错过了rvest的观点,但必须有两个步骤来做到这一点(得到我想要的标题,然后获取下面的&#39; li&#39;条目。) / p>

2 个答案:

答案 0 :(得分:3)

如果我理解正确,您需要标题后面的文字。 因此结果应该是一个字符向量,每个h2标题有一个元素。

E.g。第二个是位置之后的文字,所以

  

顾名思义,Midway与North之间大致相等   美国和亚洲,几乎在世界各地   纵向来自英国格林威治。它靠近西北端   夏威夷群岛,距檀香山约三分之一,   夏威夷,日本东京。

     

Midway Atoll距离东部不到140海里(259公里; 161英里)   国际日期线,约2,800海里(5,200公里;   3,200英里)旧金山以西,2,200海里(4,100公里;   东京以东2,500英里。

这可以使用 xpath preceding-sibling完成,如下所示:

require(rvest)
require(purrr)
page <- read_html("https://en.wikipedia.org/wiki/Midway_Atoll")

# Make sure you scope on the content of the website
content <- html_node(page, "#mw-content-text")
# Select the h2 headings, we need to know how much there are
headlines <- html_nodes(content, "h2")

# The following xpath looks at all nodes within the content and 
# counts how much of the preceding ones are h2-tags.
# For the Location text we want all nodes that have 1 preceding H2 tags
# namely "Location" itself. "Contents" (the heading of the TOC) does
# not count as it is nested. So no direct child of content.
# This xpath only selects p-tags look at the P.S.: to select all tags 
# within a paragraph

xpath <- sprintf("./p[count(preceding-sibling::h2)=%d]", seq_along(headlines)-1)

map(xpath, ~html_nodes(x = content, xpath = .x)) %>% # Get the text inside the headlines
  map(html_text, trim = TRUE) %>% # get per node in between
  map_chr(paste, collapse = "\n") %>% # collapse the text inbetween
  set_names(headlines %>% html_node("span") %>% html_text()) 

结果如下:

                                      <NA> 
"Midway Atoll (/ˈmɪdweɪ/; also called Mid" 
                                  Location 
"As its name suggests, Midway is roughly " 
                     Geography and geology 
"Midway Atoll is part of a chain of volca" 

P.S。:另类

# The not(local-name() = 'h2') makes sure that we only get "non h2" nodes

xpath <- sprintf("./*[count(preceding-sibling::h2)=%d and not(local-name() = 'h2')]", 
                     seq_along(headlines)-1)

答案 1 :(得分:0)

{{1}}