Question

我正在尝试使用 tm 包创建 TermDocumentMatrix ，但似乎遇到了困难。

输入：

trainDF<-as.matrix(list("I'm going home", "trying to fix this", "when I go home"))

目标 - 从输入创建TDM :(不是下面列出的所有控制参数）

control <- list(
    weight= weightTfIdf, 
    removeNumbers=TRUE, 
    removeStopwords=TRUE, 
    removePunctuation=TRUE,    
    stemWords=TRUE, 
    maxWordLength=maxWordL,
    bounds=list(local=c(minDocFreq, maxDocFreq))
)

tdm<- TermDocumentMatrix(Corpus(DataframeSource(trainDF)),control = control)

我得到的错误：

Warning message:
In is.na(x) : is.na() applied to non-(list or vector) of type 'NULL'

tdm对象为空。有什么想法吗？

Answer 1

错误表明您在边界中选择最小和最大文档频率时出现问题。例如，以下工作：

control=list(weighting = weightTfIdf,
             removeNumbers=TRUE, 
             removeStopwords=TRUE, 
             removePunctuation=TRUE, 
             bounds=list(local=c(1,3)))
tdm<- TermDocumentMatrix(Corpus(DataframeSource(trainDF)), control=control)

请注意，在最新版本的TM中，要指定加权，您需要使用weighting = weightTfIdf而不是weight = weightTfIdf。同样，您应该在控制列表中使用stemming=TRUE来阻止单词。我不确定maxWordLength目前是不是一个选项。 TM将默默地忽略控制列表中的无效选项，因此在您返回检查矩阵之前，您将不会知道出现问题。

R中TermDocumentMatrix函数的问题

1 个答案: