Question

对不起新的问题，但我是文本挖掘的新手，需要在profy的建议中。现在，经过与http://my.domain.com的长期折磨，我有干净的语料库下一个问题

content_transformer

例如我需要这种格式

1. How  select from `dtm`  the words with small frequencies , so that the amount of frequencies was not more than 1%

所以这里的总频率总和= 1％这是怎么回事？

Answer 1

您可以查看termDocumentMatrix包的tm功能。这包含一种计算每个文档的单词出现次数的方法。在总语料库中添加这些数字应该可以引导您到达目的地。

dtm <- DocumentTermMatrix(corpus)
# wordcounts for complete corpus
counts <- colSums(as.matrix(dtm))

# number of documents
nb <- length(counts)
# frequencies
freqs <- counts / nb

按R中的频率排列文档术语矩阵的单词

1 个答案: