我目前正在使用tm软件包来提取要集群的术语,以便在我的桌面上运行的25k项目(30Mb)的大小合适的数据库中进行重复检测,但是当我尝试在我的服务器上运行它时这似乎需要花费大量的时间。仔细观察后,我发现我已经通过4GB交换运行了线路应用(posts.TmDoc,1,sum)来计算条款的频率。此外,即使运行as.matrix也会在我的桌面上生成3GB的文档,请参阅http://imgur.com/a/wllXv
这是否需要在25k项目上生成18k项的频率计数?有没有其他方法可以在不将TermDocumentMatrix强制转换为矩阵或向量的情况下生成频率计数?
我不能删除基于稀疏性的术语,因为实际的算法是如何实现的。它查找至少2个但不超过50个和它们上的组共有的术语,计算每个组的相似度值。
以下是参考上下文中的代码
min_word_length = 5
max_word_length = Inf
max_term_occurance = 50
min_term_occurance = 2
# Get All The Posts
Posts = db.getAllPosts()
posts.corpus = Corpus(VectorSource(Posts[,"provider_title"]))
# remove things we don't want
posts.corpus = tm_map(posts.corpus,content_transformer(tolower))
posts.corpus = tm_map(posts.corpus, removePunctuation)
posts.corpus = tm_map(posts.corpus, removeNumbers)
posts.corpus = tm_map(posts.corpus, removeWords, stopwords('english'))
# grab any words longer than 5 characters
posts.TmDoc = TermDocumentMatrix(posts.corpus, control=list(wordLengths=c(min_word_length, max_word_length)))
# get the words that occur more than once, but not more than 50 times
clustterms = names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance & apply(posts.TmDoc, 1, sum) < max_term_occurance))
答案 0 :(得分:3)
因为我从未真正需要频率计数,所以我可以使用findFreqTerms命令
setdiff(findFreqTerms(posts.TmDoc, 2), findFreqTerms(posts.TmDoc, 50))
与
相同names(which(apply(posts.TmDoc, 1, sum) >= min_term_occurance & apply(posts.TmDoc, 1, sum) < max_term_occurance))
但是马上跑来跑去