如何为R中的每个文档生成tf-idf的顶部单词?

时间:2014-03-03 18:54:32

标签: r text tf-idf tm

我从R。{/ p>中的tm包中获得了一个文档术语矩阵

dd <- Corpus(VectorSource(train$text)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, tolower)
dd <- tm_map(dd, removePunctuation)
dd <- tm_map(dd, removeWords, stopwords("english"))
dd <- tm_map(dd, stemDocument)
dd <- tm_map(dd, removeNumbers)
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))

我无法找到操作文档术语矩阵的方法来提取我想要的信息:每个文档的前三个关键字由tf-idf提供。我怎么做到的?

编辑: 示例文本(全部来自Yelp Review学术数据集):

doc1 <- "Luckily, I didn't have to travel far to make my connecting flight. And for this, I thank you, Phoenix.  My brief layover was pleasant as the employees were kind and the flight was on time.  Hopefully, next time I can grace Phoenix with my presence for a little while longer."
doc2 <- "Nobuo shows his unique talents with everything on the menu. Carefully crafted features with much to drink. Start with the pork belly buns and a stout. Then go on until you can no longer."
doc3 <- "The oldish man who owns the store is as sweet as can be. Perhaps sweeter than the cookies or ice cream. Here's the lowdown: Giant ice cream cookie sandwiches for super cheap. The flavor permutations are basically endless. I had snickerdoodle with cookies and cream ice cream. It was marvelous."

我应该提一下,我有超过180,000个这种性质的文件,所以一个可以扩展的解决方案,而不是仅仅针对这些具体例子的解决方案,将会很棒。

1 个答案:

答案 0 :(得分:2)

这有效:

apply(dtm, 1, function(x) {
    x2 <- sort(x, TRUE)
    x2[x2 >= x2[3]]
})

## $doc1
##   flight  phoenix     time 
## 0.126797 0.126797 0.126797 
## 
## $doc2
##      belli        bun       care      craft      drink    everyth     featur 
## 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 
##       menu       much      nobuo       pork       show      start      stout 
## 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 
##     talent      uniqu 
## 0.08805347 0.08805347 
## 
## $doc3
##     cream     cooki       ice 
## 0.2113283 0.1584963 0.1584963 

如果你想扩大规模,我会使用并行计算。