R中的Ngram:计算单词频率和值的总和

时间:2017-02-18 18:50:13

标签: r

我想执行以下计算:

输入:

Column_A                    Column_B
Word_A                      10
Word_A Word_B               20
Word_B Word_A               30
Word_A Word_B Word_C        40

输出

Column_A1                   Column_B1
Word_A                      100 = 10+20+30+40
Word_B                      90  = 20+30+40
Word_C                      40  = 40
Word_A Word_B               90  = 20+30+40
Word_A Word_C               40  = 40
Word_B Word_C               40  = 40
Word_A Word_B Word_C        40  = 40

输出中单词的顺序无关紧要,因此Word_A Word_B = 90 = Word_B Word_A。使用RWeka和tm库我能够提取unigrams(只有一个单词),我需要n-gram,其中n = 1,2,3并计算column_B1

1 个答案:

答案 0 :(得分:1)

整齐的方法:

library(tidyverse)
library(tokenizers)

df %>% 
    rowwise() %>% 
    mutate(ngram = list(c(tokenize_ngrams(Column_A, lowercase = FALSE, n = 3, n_min = 1), 
                              tokenize_skip_ngrams(Column_A, lowercase = FALSE, n = 2), 
                          recursive = TRUE)), 
           ngram = list(unique(map_chr(strsplit(ngram, ' '), 
                                       ~paste(sort(.x), collapse = ' '))))) %>% 
    unnest() %>% 
    count(ngram, wt = Column_B)

## # A tibble: 7 × 2
##                  ngram     n
##                  <chr> <int>
## 1               Word_A   100
## 2        Word_A Word_B    90
## 3 Word_A Word_B Word_C    40
## 4        Word_A Word_C    40
## 5               Word_B    90
## 6        Word_B Word_C    40
## 7               Word_C    40

请注意,目前这只能通过三个字的字符串来强大。对于更长的字符串,您必须弄清楚要跳过ngrams的距离,或者完全采用不同的方法。