我正在尝试创建40个不同文本的语料库的文档术语矩阵(dtm)。我试图不包含超过20个字符的单词。我怎么能这样做?
答案 0 :(得分:1)
您可以尝试传递wordLenghts
作为控制参数:
library(tm)
DocumentTermMatrix(corpus,control=list(wordLengths=c(1,20)))
来自文档:
wordLenghts - An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.