Question

我正在研究一个正则表达式，以从报纸数据库下载的文件中提取一些文本。这些文件的格式大多为。但是，每篇文章的全文均以明确定义的短语^Full text:开始。但是，全文的结尾未标出。我能想到的最好的是，全文以各种元数据标签结尾，如下所示：Subject: , CREDIT:, Credit。

因此，我当然可以开始本文的开头。但是，我很难找到一种在全文开头和结尾之间选择文本的方法。

这有两个因素。首先，显然结尾字符串有所不同，尽管我觉得我可以选择类似以下内容：`^ [：alnum：] {5，}：'并且这样可以捕获结尾。但是另一个复杂的因素是，在全文开始之前会出现类似的标记。如何让R仅返回全文正则表达式和结尾正则表达式之间的文本？

test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')

test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')

我当前的尝试在这里：

test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]

谢谢。

Answer 1

这只会搜索匹配'Full text:'的元素，然后搜索匹配':'之后的下一个元素

get_text <- function(x){
  start <- grep('Full text:', x)
  end <- grep(':', x) 
  end <- end[which(end > start)[1]] - 1
  x[start:end]
}

get_text(test)
# [1] "Full text: some article text that I need to capture"  
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"  
# [2] "the second line of the article that I need to capture"

在开始和结束正则表达式之间返回文本

1 个答案: