Question

我只在Python / Java中找到了这个问题的解决方案。

我有一个带有新闻文章和相应日期的data.frame。我还有一个关键字列表，我想检查每篇文章。

df <- data.frame(c("2015-05-06", "2015-05-07", "2015-05-08", "2015-05-09"), 
                 c("Articel does not contain a key word", "Articel does contain the key word revenue", "Articel does contain two keywords revenue and margin","Articel does not contain the key word margin"))
colnames(df) <- c("date","article")

key.words <- c("revenue", "margin", "among others")

我提出了一个很好的解决方案，如果我只想检查文章中是否包含其中一个词：

article.containing.keyword <- filter(df, grepl(paste(key.words, collapse="|"), df$article))

这很有效，但我实际上正在寻找的是一个解决方案，我可以设置一个阈值a＆＃34;文章必须包含至少n个单词才能被过滤＆＃34;，例如，文章必须包含至少n = 2个关键字才能被过滤器选中。所需的输出如下：

  date       article
3 2015-05-08 Articel does contain two keywords revenue and margin

Answer 1

您可以使用stringr::str_count：

str_count(df$article, paste(key.words, collapse="|"))
[1] 0 1 2 1

可以转换为以这种方式过滤：

article.containing.keyword <- dplyr::filter(df, str_count(df$article, paste(key.words, collapse="|")) >= 2)
        date                                              article
1 2015-05-08 Articel does contain two keywords revenue and margin

检查字符串是否包含单词R列表中至少n个单词

1 个答案: