我想执行以下计算:
输入:
Column_A Column_B
Word_A 10
Word_A Word_B 20
Word_B Word_A 30
Word_A Word_B Word_C 40
输出
Column_A1 Column_B1
Word_A 100 = 10+20+30+40
Word_B 90 = 20+30+40
Word_C 40 = 40
Word_A Word_B 90 = 20+30+40
Word_A Word_C 40 = 40
Word_B Word_C 40 = 40
Word_A Word_B Word_C 40 = 40
输出中单词的顺序无关紧要,因此Word_A Word_B = 90 = Word_B Word_A。使用RWeka和tm库我能够提取unigrams(只有一个单词),我需要n-gram,其中n = 1,2,3并计算column_B1
答案 0 :(得分:1)
整齐的方法:
library(tidyverse)
library(tokenizers)
df %>%
rowwise() %>%
mutate(ngram = list(c(tokenize_ngrams(Column_A, lowercase = FALSE, n = 3, n_min = 1),
tokenize_skip_ngrams(Column_A, lowercase = FALSE, n = 2),
recursive = TRUE)),
ngram = list(unique(map_chr(strsplit(ngram, ' '),
~paste(sort(.x), collapse = ' '))))) %>%
unnest() %>%
count(ngram, wt = Column_B)
## # A tibble: 7 × 2
## ngram n
## <chr> <int>
## 1 Word_A 100
## 2 Word_A Word_B 90
## 3 Word_A Word_B Word_C 40
## 4 Word_A Word_C 40
## 5 Word_B 90
## 6 Word_B Word_C 40
## 7 Word_C 40
请注意,目前这只能通过三个字的字符串来强大。对于更长的字符串,您必须弄清楚要跳过ngrams的距离,或者完全采用不同的方法。