使用R和术语文档矩阵创建频率表

时间:2018-02-16 06:42:53

标签: r frequency text-mining grepl term-document-matrix

我创建了以下数据框,其中包含一些电子邮件主题行。

 df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
                            'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'))

我创建了一个从上述数据框派生的常用单词列表。我已将这些关键字添加到数据框中,并将它们虚拟编码为0

 most_freq_words <- c('Free', 'New', 'Limited', 'Offer')



Subject                                               Free New Limited Offer                                                    

 'Free Free Free! Clear Cover with New Phone',          0   0     0      0
 'Offer ! Buy New phone and get earphone at             0   0     0      0
 1000. Limited Offer!'

我想获取电子邮件主题中单词的频率计数。输出应如下

  Subject                                             Free New Limited Offer                                                    

 'Free Free Free!  Clear Cover with New Phone',         3   1     0      0
 'Offer ! Buy New phone and get earphone at             0   1     1      2
 1000. Limited Offer!'

我尝试过以下代码

for (i in 1:length(most_freq_words)){
df[[most_freq_words[i]]] <- as.numeric(grepl(tolower(most_freq_words[i]), 
tolower(df$subject)))}

然而,这说明句子中是否存在该词。我需要上面给出的输出。我请求别人帮助我

3 个答案:

答案 0 :(得分:3)

我使用tidytext包处理了这个任务。首先,我在数据集中添加了一个分组变量。然后,我使用unnest_token()分隔单词。除了most_freq_words中的字词之外,我删除了所有字词。然后,我计算每个单词出现在每个单词中的次数。最后,我将长格式数据转换为宽格式数据。如果您仍想要原始句子,可以轻松地将其添加到输出中(例如,在cbind(subject = df$subject)行之后添加spread()

library(dplyr)
library(tidytext)

df <- data.frame(subject=c('Free ! Free! Free ! Clear Cover with New Phone',
                           'Offer ! Buy New phone and get earphone at 1000. Limited Offer!'),
                 stringsAsFactors = FALSE)

most_freq_words <- c('Free', 'New', 'Limited', 'Offer')

mutate(df, group = 1:n()) %>%
unnest_tokens(input = subject, output = word, token = "words", to_lower = FALSE) %>%
filter(word %in% most_freq_words) %>%
count(group, word) %>%
spread(key = word, value = n, fill = 0)

  group  Free Limited   New Offer
  <int> <dbl>   <dbl> <dbl> <dbl>
1     1  3.00    0     1.00  0   
2     2  0       1.00  1.00  2.00

答案 1 :(得分:3)

以下是tidyverse的另一个选项。我们使用map循环显示&#39; most_freq_words&#39;,从&#39; subject&#39;专栏&#39; df&#39;使用str_count转换为tibble,从&#39; most_freq_words&#39;中设置列的名称。并使用原始数据集&#39; df&#39;

绑定列
library(tidyverse)
most_freq_words %>% 
      map(~ str_count(df$subject, .x) %>%
                    as_tibble %>% 
                    set_names(.x)) %>% 
      bind_cols(df, .)
#                                                         subject Free New Limited Offer
#1                 Free ! Free! Free ! Clear Cover with New Phone    3   1       0     0
#2 Offer ! Buy New phone and get earphone at 1000. Limited Offer!    0   1       1     2

答案 2 :(得分:2)

grepl替换为gregexpr,然后检查length列表项的1st。此外,for-loop也应该在df的每一行上运行。保持OP的for-loop意图修改后的代码将如下所示:

for (i in 1:length(most_freq_words)){
  for(j in 1:nrow(df)){
    df[j,most_freq_words[i]] <- ifelse(gregexpr(tolower(most_freq_words[i]),
       tolower(df$subject[j]))[[1]][[1]] >0,
    length(gregexpr(tolower(most_freq_words[i]), tolower(df$subject[j]))[[1]]), 0)
  }
}  


> df
                                                         subject Free New Limited Offer
1                Free ! Free! Free ! Clear Cover with New  Phone    3   1       0     0
2 Offer ! Buy New phone and get earphone at 1000. Limited Offer!    0   1       1     2