R中的TM Package清理文本

时间:2013-11-12 00:03:26

标签: r nlp tm

我正在尝试使用R中的TM包清理我的文本语料库但是我一直收到此错误:

no applicable method for 'removePunctuation' applied to an object of class "data.frame"

我的数据是从文本文件中读取的聊天记录,在R:

中看起来像这样
     V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.

我用:

tdm <- TermDocumentMatrix(text,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))

但是我收到了这个错误:

Error in UseMethod("TermDocumentMatrix", x) : 
  no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"

好像我不应该将数据框输入到函数中,但我还能怎么做呢?

由于

3 个答案:

答案 0 :(得分:1)

正如@Martin Bel所指出的,qdap version 1.1.0也可以做到这一点。我已经为qdap添加了一些支持,以便与tm包更加兼容,包括tdm函数,该函数在这里可以正常工作:

首先阅读您的数据(我添加了冒号):

library(qdap)
dat <- read.transcript(text="ID    V1
1   In the process
2   Sorry I had to step away for a moment.
3   I am getting an error page that says QB is currently unavailable.
4   That link gives me the same error message.", header=TRUE, sep="   ")

#制作术语文档矩阵:

tdm(dat$V1, id(dat), stopwords=tm::stopwords("en"))

#使用tm包执行相同的操作:

TermDocumentMatrix(Corpus(VectorSource(dat[, 1])),
    control = list(
        removePunctuation = TRUE,
        stopwords = TRUE
    )
)

答案 1 :(得分:1)

你非常接近,最快的方法是使用DataframeSource制作一个语料库对象,然后从中创建一个术语doc矩阵。使用您的示例:

让我们输入数据......

Text <- readLines(n=4)
In the process
Sorry I had to step away for a moment.
I am getting an error page that says QB is currently unavailable.
That link gives me the same error message.

df <- data.frame(V1 = Text, stringsAsFactors = FALSE)

现在将数据框转换为术语文档矩阵......

require(tm)
mycorpus <- Corpus(DataframeSource(df))
tdm <- TermDocumentMatrix(mycorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))

现在检查输出......

inspect(tdm)
   A term-document matrix (14 terms, 4 documents)

Non-/sparse entries: 15/41
Sparsity           : 73%
Maximal term length: 11 
Weighting          : term frequency (tf)

             Docs
Terms         1 2 3 4
  away        0 1 0 0
  currently   0 0 1 0
  error       0 0 1 1
  getting     0 0 1 0
  gives       0 0 0 1
  link        0 0 0 1
  message     0 0 0 1
  moment      0 1 0 0
  page        0 0 1 0
  process     1 0 0 0
  says        0 0 1 0
  sorry       0 1 0 0
  step        0 1 0 0
  unavailable 0 0 1 0

答案 2 :(得分:-1)

您只需要通过执行text[,1]

从数据框中解压缩文本
tdm <- TermDocumentMatrix(text[,1],
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))