我的数据框中有一列充满了文本(长度各异),例如
'Nature of specimen= D2x4, stomach biopsies\nbalblablablabl\nabla\nSomeRandomText\nNature of specimen= Colonx2, polypx1\nMore Random Text\nNature of specimen= TIx2, polypx1\n'
我想只提取Nature of specimen.*?\n
,以便我留下:
Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n
我认为我需要gsub所有不是Nature of specimen.*?\n
但我不知道如何否定整个正则表达式的东西。目前我试过
`df$Text<-gsub("[^(Nature of specimen.*?\n)]","",df$Text`
但是只删除正则表达式中的每个字符而不是预期的输出。
答案 0 :(得分:2)
不是regex
解决方案(很糟糕),但在strsplit
使用
基本上我将它拆分为“\ n”,然后选择每个替代值并将其粘贴回来
paste0(unlist(strsplit(x, "\n"))[c(TRUE,FALSE)], collapse = "\n")
[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1"
library(stringr)
paste0(unlist(str_extract_all(x, pattern = "Nature of specimen=.*\n")), collapse = "")
答案 1 :(得分:1)
我们还可以使用stri_extract
stringi
library(stringi)
paste(stri_extract_all_regex(str1, "Nature of specimen=.*\n")[[1]], collapse="")
#[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n"
答案 2 :(得分:0)
这也应该有效:
library(stringr)
str_match_all(text, ".*(Nature\\s+of\\s+specimen[^\\n]+)\\n")[[1]][,2]
# [1] "Nature of specimen= D2x4, stomach biopsies" "Nature of specimen= Colonx2, polypx1" "Nature of specimen= TIx2, polypx1"