Question

我不熟悉使用R.进行文本处理。我尝试使用下面的简单代码

library(RTextTools) texts <- c("This is the first document.", "This is the second file.", "This is the third text.") matrix <- create_matrix(texts,ngramLength=3)

这是问题Finding 2 & 3 word Phrases Using R TM Package

中的答案之一

但是，它会给出错误Error in FUN(X[[2L]], ...) : non-character argument。

我可以在删除ngramLength参数时生成文档术语矩阵，但我需要搜索某些字长的短语。有关替代或更正的任何建议吗？

Answer 1

ngramLength似乎不起作用。这是一个解决方法：

library(RTextTools)
library(tm)
library(RWeka) # this library is needed for NGramTokenizer
library 
texts <- c("This is the first document.", 
           "Is this a text?", 
           "This is the second file.", 
           "This is the third text.", 
           "File is not this.") 
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(Corpus(VectorSource(texts)),
                         control=list(
                                      weighting = weightTf,
                                      tokenize = TrigramTokenizer))

as.matrix(dtm)

标记生成器使用RWeka的{{1}}代替NGramTokenizer调用的标记生成器。您现在可以在其他RTextTools函数中使用create_matrix，例如在下面训练分类模型：

dtm

Answer 2

我遇到了同样的错误。我在此请求https://github.com/timjurka/RTextTools/pull/5/files中找到了一个修复程序。我通过“trace（create_matrix，edit = T）”进行了更改。现在它有效：）

Answer 3

我认为这不是Character（输入数据类型）的问题。当我使用NYTimes数据集时出现相同的错误，该数据集与包一起提供并运行与帮助手册中相同的代码。

RTextTools create_matrix返回非字符参数错误

3 个答案: