我正在研究一个正则表达式,以从报纸数据库下载的文件中提取一些文本。这些文件的格式大多为 。但是,每篇文章的全文均以明确定义的短语^Full text:
开始。但是,全文的结尾未标出。我能想到的最好的是,全文以各种元数据标签结尾,如下所示:Subject: , CREDIT:, Credit
。
因此,我当然可以开始本文的开头。但是,我很难找到一种在全文开头和结尾之间选择文本的方法。
这有两个因素。首先,显然结尾字符串有所不同,尽管我觉得我可以选择类似以下内容:`^ [:alnum:] {5,}:'并且这样可以捕获结尾。但是另一个复杂的因素是,在全文开始之前会出现类似的标记。如何让R仅返回全文正则表达式和结尾正则表达式之间的文本?
test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')
test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')
我当前的尝试在这里:
test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]
谢谢。
答案 0 :(得分:1)
这只会搜索匹配'Full text:'
的元素,然后搜索匹配':'
之后的下一个元素
get_text <- function(x){
start <- grep('Full text:', x)
end <- grep(':', x)
end <- end[which(end > start)[1]] - 1
x[start:end]
}
get_text(test)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"
# [2] "the second line of the article that I need to capture"