我收到了数千条带有以下结构的消息
Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text
或
text I want to extract
或
-------------------------------------------
Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text
-------------------------------------------
我想从中提取"text I want to extract"
部分并丢弃其他任何内容。
现在我可以在几行R代码中执行此操作,例如
str_locate(messages[i],"-{5,}")
但这相当于很多代码。有没有办法在单行中提取文本?
答案 0 :(得分:3)
您可以使用strsplit()
x <- c("Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text", "text I want to extract",
"-------------------------------------------
Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text
-------------------------------------------")
sapply(
strsplit(x, "\n?-+\n?"),
function(x) if(length(x) == 1) x else x[nzchar(x)][2]
)
# [1] "text I want to extract" "text I want to extract"
# [3] "text I want to extract"
从技术上讲,这是一个单行: - )
答案 1 :(得分:3)
您可以使用单个gsub
命令执行此任务。
gsub("^(?:[^\n]*\n){1,2}(?:-+\n)?|(?:\n[^\n]*){2,3}$", "", vec)
# [1] "text I want to extract" "text I want to extract" "text I want to extract"
其中vec
是此向量:
vec <- c("Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text",
"text I want to extract",
"-------------------------------------------
Some header text
-------------------------------------------
text I want to extract
-------------------------------------------
Some footer text
-------------------------------------------")
答案 2 :(得分:1)
基于@Richard Scriven的数据,
sub('\n.*', '', sub('^-*\n[A-Za-z ]+\n-+\n|^[A-Za-z ]+\n-*\n', '', x))
#[1] "text I want to extract" "text I want to extract"
# "text I want to extract"