下面的代码部分很好地表示了我正在处理的数据集。
x <- "test is bad. test1 is good. but test is better. Yet test1 is fake"
y <- "test1 is bad. test is good. but test1 is better. Yet test is fake"
a <- "this sentence is for trying purposes"
z <- data.frame(text = c(x,y,a))
z$date <- c("2011","2012","2015")
z$amount <- c(20000, 300, 5600)
z$text <- as.character(z$text)
我要做的本质上是仅提取包含单词test1的句子,并将其解析到新的列(z $ entences)中以执行其他操作。
我尝试使用以下内容:
z$sentences <- grep("test1", unlist(strsplit(z$text, '(?<=\\.)\\s+',
perl=TRUE)), value=TRUE)
但是它返回错误,因为替换有4行,而数据有3行。
我也尝试过使用unlist,但是在此过程中,其他列信息却丢失了。
2个令人满意的结果:
仅包含“ test1”或长格式句子的额外列,每行仍包含带有句子的数据(日期,金额)。
预期输出:
所有与列匹配的句子
每个与条件匹配的句子都有一个新行,尽管不必最后一行。
欢迎任何帮助
答案 0 :(得分:2)
问题在于grep
仅返回可以小于原始长度的match
元素
lst1 <- strsplit(z$text, '(?<=\\.)\\s+', perl = TRUE)
z$sentences <- sapply(lst1, function(x) paste(grep("test1", x,
value = TRUE), collapse=" "))
另一个没有拆分的选项是gsub
trimws(gsub("(([A-Za-z, ]*)test1[A-Za-z, ]+\\.?)(*SKIP)(*F)|.",
"", z$text, perl = TRUE))
#[1] "test1 is good. Yet test1 is fake" "test1 is bad. but test1 is better."
#[3] ""
答案 1 :(得分:-1)
您可以使用str_extract
软件包中的stringr
。
library(stringr)
z$sentences <- str_extract(z$text,'.*test1.*')
z
text date amount sentences
1 test is bad. test1 is good. but test is better. Yet test1 is fake 2011 20000 test is bad. test1 is good. but test is better. Yet test1 is fake
2 test1 is bad. test is good. but test1 is better. Yet test is fake 2012 300 test1 is bad. test is good. but test1 is better. Yet test is fake
3 this sentence is for trying purposes 2015 5600 <NA>