如何删除gsub中未指定的短语

时间:2017-02-04 10:52:31

标签: r regex

我的数据框中有一列充满了文本(长度各异),例如

'Nature of specimen= D2x4, stomach biopsies\nbalblablablabl\nabla\nSomeRandomText\nNature of specimen= Colonx2, polypx1\nMore Random Text\nNature of specimen= TIx2, polypx1\n'

我想只提取Nature of specimen.*?\n,以便我留下:

Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n

我认为我需要gsub所有不是Nature of specimen.*?\n但我不知道如何否定整个正则表达式的东西。目前我试过

`df$Text<-gsub("[^(Nature of specimen.*?\n)]","",df$Text`

但是只删除正则表达式中的每个字符而不是预期的输出。

3 个答案:

答案 0 :(得分:2)

不是regex解决方案(很糟糕),但在strsplit使用

基本上我将它拆分为“\ n”,然后选择每个替代值并将其粘贴回来

paste0(unlist(strsplit(x, "\n"))[c(TRUE,FALSE)], collapse = "\n")
[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1"


library(stringr)
paste0(unlist(str_extract_all(x, pattern = "Nature of specimen=.*\n")), collapse = "")

答案 1 :(得分:1)

我们还可以使用stri_extract

中效率更高的stringi
library(stringi)
paste(stri_extract_all_regex(str1, "Nature of specimen=.*\n")[[1]], collapse="")
#[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n"

答案 2 :(得分:0)

这也应该有效:

library(stringr)
str_match_all(text, ".*(Nature\\s+of\\s+specimen[^\\n]+)\\n")[[1]][,2]
# [1] "Nature of specimen= D2x4, stomach biopsies" "Nature of specimen= Colonx2, polypx1"       "Nature of specimen= TIx2, polypx1"