Question

我正在使用＆＃34; tm＆＃34;进行文本挖掘。 R中的包，我可以在生成一个术语文档矩阵后得到单词频率：

freq <- colSums(as.matrix(dtm))

ord <- order(freq)

freq[head(ord)]   
# abit   acal access accord across acsess     
#    1      1      1      1      1      1 

freq[tail(ord)]    
# direct   save  month   will  thank   list     
#    106    107    116    122    132    154

它只能按顺序向我显示单词频率列表，我想知道我是否可以单独检查一个单词的频率？我还可以查看短语的频率吗？例如，单词＆＃34;感谢＆＃34;多少次？是一个文本语料库或短语的频率是什么＆＃34;联系号码＆＃34;在这个语料库中显示？

非常感谢任何提示和建议。

Answer 1

我通过tm包中的数据显示：

library(tm)
data(crude)
dtm <- as.matrix(DocumentTermMatrix(crude))

#find the column that contains the word "demand"
columnindices <- which(colnames(dtm)=="demand")

#how often dooes the word "demand" show up?
sum(dtm[,columnindices])
>6

如果你想用短语做这个，你的dtm必须包含这些短语，而不仅仅是在大多数情况下使用的单词包。如果此数据可用，则过程与单个单词相同。

我可以使用R检查文档聚类中预定单词或短语的频率吗？

1 个答案: