Question

我有3列数据框，其中第3个（最后一个）包含文本主体，类似于一个句子。

此外，我还有一个单词向量。
如何以优雅的方式计算以下内容：

总共找到15个最常见的单词（及其出现的次数）上述向量中出现的第三列？

句子可能如下所示：
I like dogs and my father like cats
vector=["dogs", "like"]
在这里，最常见的单词是dogs和like。

Answer 1

您可以尝试以下操作：

library(tidytext)
library(tidyverse)

df %>%                           # your data
unnest_tokens(word,text) %>%     # clean a bit the data and split the phrases
group_by(word) %>%               # grouping by words
summarise(Freq = n()) %>%        # count them
arrange(-Freq) %>%               # order decreasing
top_n(2)                         # here the top 2, you can use 15

结果：

# A tibble: 8 x 2
  word   Freq
  <chr> <int>
1 dogs      3
2 i         2

如果您已经将单词分开，则可以跳过第二行。

有数据：

df <- data.frame(
id = c(1,2,3),
group = c(1,1,1),
text = c("I like dogs","I don't hate dogs", "dogs are the best"), stringsAsFactors = F)

数据框列中最频繁的

1 个答案: