Question

我有一系列表达式，例如：

"<i>the text I need to extract</i></b></a></div>"

我需要在和＆＃34;符号＆＃34;之间提取文字。这样，结果应该是：

"the text I need to extract"

目前我正在使用R中的gsub手动删除所有非文本的符号。但是，我想使用正则表达式来完成这项工作。有没有人知道正则表达式来提取和之间？

感谢。

Answer 1

如果示例中只有一个...，那么匹配以及之前的所有内容并将其替换为空字符串：

x <- "<i>the text I need to extract</i></b></a></div>"
gsub(".*<i>|</i>.*", "", x)

，并提供：

[1] "the text I need to extract"

如果同一个字符串中可能出现多次，请尝试：

library(gsubfn)
strapplyc(x, "<i>(.*?)</i>", simplify = c)

在此示例中给出相同的内容。

Answer 2

这种方法使用的是我保持 qdapRegex 的软件包，它不是正则表达式，但可能对您或未来的搜索者有用。函数rm_between允许用户在左右边界之间提取文本，并可选择包含它们。这种方法很简单，因为您不必考虑特定的正则表达式，只需要确切的左右边界：

library(qdapRegex)

x <- "<i>the text I need to extract</i></b></a></div>"

rm_between(x, "<i>", "</i>", extract=TRUE)

## [[1]]
## [1] "the text I need to extract"

我想指出，为这项工作使用html解析器可能更可靠。

Answer 3

如果这是html（它看起来像是），你应该使用html解析器。套餐XML可以执行此操作

library(XML)
x <- "<i>the text I need to extract</i></b></a></div>"
xmlValue(getNodeSet(htmlParse(x), "//i")[[1]])
# [1] "the text I need to extract"

在整个html文档中，您可以使用

doc <- htmlParse(x)
sapply(getNodeSet(doc, "//i"), xmlValue)

Answer 4

如果您不知道字符串中的匹配数，则可以对gregexpr和regmatches使用以下方法。

vec <- c("<i>the text I need to extract</i></b></a></div>",
         "abc <i>another text</i> def <i>and another text</i> ghi")

regmatches(vec, gregexpr("(?<=<i>).*?(?=</i>)", vec, perl = TRUE))
# [[1]]
# [1] "the text I need to extract"
# 
# [[2]]
# [1] "another text"     "and another text"

Answer 5

<i>((?:(?!<\/i>).)*)<\/i>

这应该为你做。

使用R中的正则表达式在某些符号之间提取文本

5 个答案: