Question

例如这个HTML代码：

<p>hello world</p>

<p>the weather is fine today</p>

<p>it is fine in a lot of places in the world<p>

对于关键词“世界”，结果将是：

hello world

it is fine in a lot of places in the world

Answer 1

哦，我们是一个代码编写服务。呵呵。也许我们可以使用XPath完成所有操作，而不是在R中旋转不必要的循环：

for(var key in result1){
    var a = result1[key];
    // do something with 'a'
}

如果你不能升级到Hadleyverse，那么类似的成语将在library(xml2) library(rvest) doc_txt <- "hello world the weather is fine today it is fine in a lot of places in the world" doc <- read_html(doc_txt) xml_text(xml_nodes(doc, xpath="//p[text()[contains(.,'world')]]")) ## [1] "hello world" ## [2] "it is fine in a lot of places in the world"包中有效：

XML

Answer 2

以下是两种选择：

1）XML 使用XML包：

Lines <- "<p>hello world</p>

<p>the weather is fine today</p>

<p>it is fine in a lot of places in the world<p>"

library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
grep("hello", xpathSApply(xmlRoot(doc), "//p", xmlValue), value = TRUE)

，并提供：

[1] "hello world"

2）正则表达式如果和始终出现在示例中的同一行，那么这也会有效：

L <- readLines(textConnection(Lines))
gsub(".*<p>|</p>.*", "", grep("<p>.*hello", L, value = TRUE))

，并提供：

[1] "hello world"

如何使用R从包含特定单词的html文件（<p> </p>之间的部分）中提取段落？

2 个答案: