Question

我有一个给定的单词列表，例如：

words <- c("breast","cancer","chemotherapy")

我有一个非常大的数据框，1个变量和超过10,000个条目（行）。

我想选择“单词”中包含任何单词的所有行。不仅是某个单词，“单词”中的任何单词都是重要的。包含“单词”中的多个单词也很重要。

如果我知道“单词”是什么，我可以多次进行字符串提取。然而，“单词”每次都在变化而且无法看到。有没有直接的方法呢？

此外，我是否可以选择“单词”中包含2个或更多单词的所有行？例如。仅包含“癌症”不计，但包含“乳腺”和“癌症”计数。同样，“单词”每次都会改变，而且无法看到。有直接的方法吗？

Answer 1

一些假数据：

SecondViewController

您可以使用words <- c("breast","cancer","chemotherapy") df <- data.frame(v1 = c("there was nothing found","the chemotherapy is effective","no cancer no chemotherapy","the breast looked normal","something"))，grepl和sapply的组合：

rowSums

这导致：

df[rowSums(sapply(words, grepl, df$v1)) > 0, , drop = FALSE]

如果想要仅选择至少包含两个单词的行，则：

                             v1
2 the chemotherapy is effective
3     no cancer no chemotherapy
4      the breast looked normal

结果：

df[rowSums(sapply(words, grepl, df$v1)) > 1, , drop = FALSE]

注意：您需要使用v1 3 no cancer no chemotherapy，因为您的数据框有一个变量（列）。如果您的数据框有多个变量（列），则不需要使用drop = FALSE。

如何在给定的单词列表中查找包含单词的行？不仅某个单词，某个单词列表中的任何单词都很重要

1 个答案: