Question

我的数据框中有一列充满了文本（长度各异），例如

'Nature of specimen= D2x4, stomach biopsies\nbalblablablabl\nabla\nSomeRandomText\nNature of specimen= Colonx2, polypx1\nMore Random Text\nNature of specimen= TIx2, polypx1\n'

我想只提取Nature of specimen.*?\n，以便我留下：

Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n

我认为我需要gsub所有不是Nature of specimen.*?\n但我不知道如何否定整个正则表达式的东西。目前我试过

`df$Text<-gsub("[^(Nature of specimen.*?\n)]","",df$Text`

但是只删除正则表达式中的每个字符而不是预期的输出。

Answer 1

不是regex解决方案（很糟糕），但在strsplit使用

基本上我将它拆分为“\ n”，然后选择每个替代值并将其粘贴回来

paste0(unlist(strsplit(x, "\n"))[c(TRUE,FALSE)], collapse = "\n")
[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1"


library(stringr)
paste0(unlist(str_extract_all(x, pattern = "Nature of specimen=.*\n")), collapse = "")

Answer 2

我们还可以使用stri_extract

中效率更高的stringi

library(stringi)
paste(stri_extract_all_regex(str1, "Nature of specimen=.*\n")[[1]], collapse="")
#[1] "Nature of specimen= D2x4, stomach biopsies\nNature of specimen= Colonx2, polypx1\nNature of specimen= TIx2, polypx1\n"

Answer 3

这也应该有效：

library(stringr)
str_match_all(text, ".*(Nature\\s+of\\s+specimen[^\\n]+)\\n")[[1]][,2]
# [1] "Nature of specimen= D2x4, stomach biopsies" "Nature of specimen= Colonx2, polypx1"       "Nature of specimen= TIx2, polypx1"

如何删除gsub中未指定的短语

3 个答案: