在开始和结束正则表达式之间返回文本

时间:2018-10-01 18:05:30

标签: r regex grep stringr

我正在研究一个正则表达式,以从报纸数据库下载的文件中提取一些文本。这些文件的格式大多为 。但是,每篇文章的全文均以明确定义的短语^Full text:开始。但是,全文的结尾未标出。我能想到的最好的是,全文以各种元数据标签结尾,如下所示:Subject: , CREDIT:, Credit

因此,我当然可以开始本文的开头。但是,我很难找到一种在全文开头和结尾之间选择文本的方法。

这有两个因素。首先,显然结尾字符串有所不同,尽管我觉得我可以选择类似以下内容:`^ [:alnum:] {5,}:'并且这样可以捕获结尾。但是另一个复杂的因素是,在全文开始之前会出现类似的标记。如何让R仅返回全文正则表达式和结尾正则表达式之间的文本?

test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')

test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')

我当前的尝试在这里:

test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]

谢谢。

1 个答案:

答案 0 :(得分:1)

这只会搜索匹配'Full text:'的元素,然后搜索匹配':'之后的下一个元素

get_text <- function(x){
  start <- grep('Full text:', x)
  end <- grep(':', x) 
  end <- end[which(end > start)[1]] - 1
  x[start:end]
}

get_text(test)
# [1] "Full text: some article text that I need to capture"  
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"  
# [2] "the second line of the article that I need to capture"