文件术语Matric(dtm) - R.

时间:2017-10-01 12:41:51

标签: r

我正在尝试创建40个不同文本的语料库的文档术语矩阵(dtm)。我试图不包含超过20个字符的单词。我怎么能这样做?

1 个答案:

答案 0 :(得分:1)

您可以尝试传递wordLenghts作为控制参数:

library(tm)
DocumentTermMatrix(corpus,control=list(wordLengths=c(1,20)))

来自文档:

wordLenghts - An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.