使用TM,我将DocumentTermMatrix与字典列表进行比较以计算总数:
totals <- inspect(DocumentTermMatrix(x, list(dictionary = d)))
这适用于单个单词,但我希望包含双字,并且无法弄清楚如何执行此操作。
我试过RWeka:
TrigramTokenizer <- function(x) NGramTokenizer(x,
Weka_control(min = 3, max = 3))
tdm <- TermDocumentMatrix(v.corpus,
control = list(tokenize = TrigramTokenizer))
BUt收到以下错误消息:
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j, v' different lengths
In addition: Warning messages:
1: In parallel::mclapply(x, termFreq, control) :
all scheduled cores encountered errors in user code
2: In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'
3: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion.
你能帮忙解决错误信息吗?
谢谢!
答案 0 :(得分:2)
请参阅我的回答here
使用 RWeka 和并行包时似乎存在问题。一世 找到解决方法解决方案here.
1: http://r.789695.n4.nabble.com/RWeka-and-multicore-package-td4678473.html#a4678948
最重要的一点是不加载RWeka包并在封装函数中使用命名空间。
所以你的tokenizer应该看起来像
BigramTokenizer <- function(x) {RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 2, max = 2))}