正则表达式提取内部文本

时间:2015-01-29 07:13:14

标签: regex r

我收到了数千条带有以下结构的消息

Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text

text I want to extract

-------------------------------------------
Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text
-------------------------------------------

我想从中提取"text I want to extract"部分并丢弃其他任何内容。 现在我可以在几行R代码中执行此操作,例如

str_locate(messages[i],"-{5,}")

但这相当于很多代码。有没有办法在单行中提取文本?

3 个答案:

答案 0 :(得分:3)

您可以使用strsplit()

尝试这样的操作
x <- c("Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text", "text I want to extract",
"-------------------------------------------
Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text
-------------------------------------------")

sapply(
    strsplit(x, "\n?-+\n?"),
    function(x) if(length(x) == 1) x else x[nzchar(x)][2]
)
# [1] "text I want to extract" "text I want to extract"
# [3] "text I want to extract"

从技术上讲,这是一个单行: - )

答案 1 :(得分:3)

您可以使用单个gsub命令执行此任务。

gsub("^(?:[^\n]*\n){1,2}(?:-+\n)?|(?:\n[^\n]*){2,3}$", "", vec)
# [1] "text I want to extract" "text I want to extract" "text I want to extract"

其中vec是此向量:

vec <- c("Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text", 
"text I want to extract",
"-------------------------------------------
Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text
-------------------------------------------")

答案 2 :(得分:1)

基于@Richard Scriven的数据,

 sub('\n.*', '', sub('^-*\n[A-Za-z ]+\n-+\n|^[A-Za-z ]+\n-*\n', '', x))
 #[1] "text I want to extract" "text I want to extract" 
 #  "text I want to extract"