使用R计算文本频率的文本挖掘

时间:2013-12-19 07:53:51

标签: r text-mining tm

我想计算“不确定性”一词的出现,但前提是“经济政策”或“立法”或与政策有关的词语出现在同一文本中。现在,我已经在R中输出了一个代码来计算文本中所有单词的频率,但它没有看出计数单词是否出现在正确的上下文中。你对如何纠正这个有什么建议吗?

library(tm) #load text mining library
setwd('D:/3_MTICorpus') #sets R's working directory to near where my files are
ae.corpus<-Corpus(DirSource("D:/3_MTICorpus"),readerControl=list(reader=readPlain))
summary(ae.corpus) #check what went in
ae.corpus <- tm_map(ae.corpus, tolower)
ae.corpus <- tm_map(ae.corpus, removePunctuation)
ae.corpus <- tm_map(ae.corpus, removeNumbers)
myStopwords <- c(stopwords('english'), "available", "via")
ae.corpus <- tm_map(ae.corpus, removeWords, myStopwords) # this stopword file is at C:\Users\[username]\Documents\R\win-library\2.13\tm\stopwords 
#library(SnowballC)
#ae.corpus <- tm_map(ae.corpus, stemDocument)

ae.tdm <- DocumentTermMatrix(ae.corpus, control = list(minWordLength = 3))
inspect(ae.tdm)
findFreqTerms(ae.tdm, lowfreq=2)
findAssocs(ae.tdm, "economic",.7)
d<- Dictionary (c("economic", "uncertainty", "policy"))
inspect(DocumentTermMatrix(ae.corpus, list(dictionary = d)))

1 个答案:

答案 0 :(得分:0)

您可以将期限 - 文档矩阵转换为具有0/1值的矩阵

dtm$v[dtm$v > 0] <- 1

dtm <- as.matrix(dtm)

然后您可以轻松使用table

table(tdm[which(rownames(tdm)=='uncertainty'),], tdm[which(rownames(tdm)=='economic_policy'),])

应该产生这样的东西:

     0  1
  0 105  13
  1  7  5