R:TM封装从单列中查找单词频率

时间:2015-02-25 15:20:34

标签: r tm qdap

我最近一直在努力尝试使用data.frame包在R中的tm中的单个列中查找单词频率。虽然data.frame本身有许多基于数字和字符的列,但我只对纯文本的单个列感兴趣。虽然我在清理文本本身时没有遇到任何问题,但只要我尝试使用findFreqTerms()命令拉出单词频率,我就会收到以下错误:

Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE

我接着说我需要将我的数据转换为DocumentTermMatrixTermDocumentMatrix,但由于我只有一个我正在使用的列,我也可以'创造任何一个。错误如下:

> Test <- DocumentTermMatrix(Types)
Error in UseMethod("TermDocumentMatrix", x) : 
  no applicable method for 'TermDocumentMatrix' applied to an object of class "c('PlainTextDocument', 'TextDocument')"

有没有办法从单列获取频率计数?我已经在下面粘贴了我的完整代码,并为我采取的每个步骤进行了解释。我感谢你们能给我的任何帮助。

> # extracting the single column I wish to analyse from the data frame
  Types <-Expenses$Types
> # lower all cases
  Types <- tolower(Types)
> # remove punctuation
  Types <- removePunctuation(Types)
> # remove numbers
  Types <- removeNumbers(Types)
> # attempting to find word frequency
  findFreqTerms(Types)
Error: inherits(x, c("DocumentTermMatrix", "TermDocumentMatrix")) is not TRUE

2 个答案:

答案 0 :(得分:3)

首先需要语料库和术语文档矩阵...

library(tm)
a <- c("hello man", "how's it going", "just fine")
a <- tolower(a)
a <- removePunctuation(a)
a <- removeNumbers(a)
myCorpus <- Corpus(VectorSource(a))
myTDM <- TermDocumentMatrix(myCorpus)
findFreqTerms(myTDM)

答案 1 :(得分:3)

如果您使用qdap套餐,则可以直接从文字变量中找到字词的频率:

library(qdap)
a <- c("hello man", "how's it going", "just fine", "really fine", "man o man!")
a <- tolower(a)
a <- removePunctuation(a)
a <- removeNumbers(a)
freq_terms(a) # there are several additional arguments
  WORD   FREQ
1 man       3
2 fine      2
3 going     1
4 hello     1
5 hows      1
6 it        1
7 just      1
8 o         1
9 really    1