如何使用R从包含特定单词的html文件(<p> </p>之间的部分)中提取段落?

时间:2016-03-26 14:11:48

标签: r

例如这个HTML代码:

<p>hello world</p>

<p>the weather is fine today</p>

<p>it is fine in a lot of places in the world<p>

对于关键词“世界”,结果将是:

hello world

it is fine in a lot of places in the world

2 个答案:

答案 0 :(得分:1)

哦,我们是一个代码编写服务。呵呵。也许我们可以使用XPath完成所有操作,而不是在R中旋转不必要的循环:

for(var key in result1){
    var a = result1[key];
    // do something with 'a'
}

如果你不能升级到Hadleyverse,那么类似的成语将在library(xml2) library(rvest) doc_txt <- "<p>hello world</p> <p>the weather is fine today</p> <p>it is fine in a lot of places in the world<p>" doc <- read_html(doc_txt) xml_text(xml_nodes(doc, xpath="//p[text()[contains(.,'world')]]")) ## [1] "hello world" ## [2] "it is fine in a lot of places in the world" 包中有效:

XML

答案 1 :(得分:0)

以下是两种选择:

1)XML 使用XML包:

Lines <- "<p>hello world</p>

<p>the weather is fine today</p>

<p>it is fine in a lot of places in the world<p>"

library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
grep("hello", xpathSApply(xmlRoot(doc), "//p", xmlValue), value = TRUE)

,并提供:

[1] "hello world"

2)正则表达式如果<p></p>始终出现在示例中的同一行,那么这也会有效:

L <- readLines(textConnection(Lines))
gsub(".*<p>|</p>.*", "", grep("<p>.*hello", L, value = TRUE))

,并提供:

[1] "hello world"