如何在保留数据帧的同时仅提取符合单词条件的句子

时间:2019-05-01 20:48:34

标签: r

下面的代码部分很好地表示了我正在处理的数据集。

x <- "test is bad. test1 is good. but test is better. Yet test1 is fake"
y <- "test1 is bad. test is good. but test1 is better. Yet test is fake"
a <- "this sentence is for trying purposes"
z <- data.frame(text = c(x,y,a))
z$date <- c("2011","2012","2015")
z$amount <- c(20000, 300, 5600)
z$text <- as.character(z$text)

我要做的本质上是仅提取包含单词test1的句子,并将其解析到新的列(z $ entences)中以执行其他操作。

我尝试使用以下内容:

z$sentences <- grep("test1", unlist(strsplit(z$text, '(?<=\\.)\\s+', 
                              perl=TRUE)), value=TRUE)

但是它返回错误,因为替换有4行,而数据有3行。

我也尝试过使用unlist,但是在此过程中,其他列信息却丢失了。

2个令人满意的结果:

仅包含“ test1”或长格式句子的额外列,每行仍包含带有句子的数据(日期,金额)。

预期输出:

With all sentences that match in column

所有与列匹配的句子

with a new row for each sentence matching condition

每个与条件匹配的句子都有一个新行,尽管不必最后一行。

欢迎任何帮助

2 个答案:

答案 0 :(得分:2)

问题在于grep仅返回可以小于原始长度的match元素

lst1 <- strsplit(z$text, '(?<=\\.)\\s+', perl = TRUE)
z$sentences <- sapply(lst1, function(x) paste(grep("test1", x, 
        value = TRUE), collapse=" "))

另一个没有拆分的选项是gsub

trimws(gsub("(([A-Za-z, ]*)test1[A-Za-z, ]+\\.?)(*SKIP)(*F)|.",
             "", z$text, perl = TRUE))
#[1] "test1 is good. Yet test1 is fake"   "test1 is bad. but test1 is better."
#[3] "" 

答案 1 :(得分:-1)

您可以使用str_extract软件包中的stringr

library(stringr)

z$sentences <- str_extract(z$text,'.*test1.*')

z
                                                               text date amount                                                         sentences
1 test is bad. test1 is good. but test is better. Yet test1 is fake 2011  20000 test is bad. test1 is good. but test is better. Yet test1 is fake
2 test1 is bad. test is good. but test1 is better. Yet test is fake 2012    300 test1 is bad. test is good. but test1 is better. Yet test is fake
3                              this sentence is for trying purposes 2015   5600                                                              <NA>