Question

下面的代码部分很好地表示了我正在处理的数据集。

x <- "test is bad. test1 is good. but test is better. Yet test1 is fake"
y <- "test1 is bad. test is good. but test1 is better. Yet test is fake"
a <- "this sentence is for trying purposes"
z <- data.frame(text = c(x,y,a))
z$date <- c("2011","2012","2015")
z$amount <- c(20000, 300, 5600)
z$text <- as.character(z$text)

我要做的本质上是仅提取包含单词test1的句子，并将其解析到新的列（z $ entences）中以执行其他操作。

我尝试使用以下内容：

z$sentences <- grep("test1", unlist(strsplit(z$text, '(?<=\\.)\\s+', 
                              perl=TRUE)), value=TRUE)

但是它返回错误，因为替换有4行，而数据有3行。

我也尝试过使用unlist，但是在此过程中，其他列信息却丢失了。

2个令人满意的结果：

仅包含“ test1”或长格式句子的额外列，每行仍包含带有句子的数据（日期，金额）。

预期输出：

所有与列匹配的句子

每个与条件匹配的句子都有一个新行，尽管不必最后一行。

欢迎任何帮助

Answer 1

问题在于grep仅返回可以小于原始长度的match元素

lst1 <- strsplit(z$text, '(?<=\\.)\\s+', perl = TRUE)
z$sentences <- sapply(lst1, function(x) paste(grep("test1", x, 
        value = TRUE), collapse=" "))

另一个没有拆分的选项是gsub

trimws(gsub("(([A-Za-z, ]*)test1[A-Za-z, ]+\\.?)(*SKIP)(*F)|.",
             "", z$text, perl = TRUE))
#[1] "test1 is good. Yet test1 is fake"   "test1 is bad. but test1 is better."
#[3] ""

Answer 2

您可以使用str_extract软件包中的stringr。

library(stringr)

z$sentences <- str_extract(z$text,'.*test1.*')

z
                                                               text date amount                                                         sentences
1 test is bad. test1 is good. but test is better. Yet test1 is fake 2011  20000 test is bad. test1 is good. but test is better. Yet test1 is fake
2 test1 is bad. test is good. but test1 is better. Yet test is fake 2012    300 test1 is bad. test is good. but test1 is better. Yet test is fake
3                              this sentence is for trying purposes 2015   5600                                                              <NA>

如何在保留数据帧的同时仅提取符合单词条件的句子

2 个答案: