我正在尝试使用R中的TM包清理我的文本语料库但是我一直收到此错误:
no applicable method for 'removePunctuation' applied to an object of class "data.frame"
我的数据是从文本文件中读取的聊天记录,在R:
中看起来像这样 V1
1 In the process
2 Sorry I had to step away for a moment.
3 I am getting an error page that says QB is currently unavailable.
4 That link gives me the same error message.
我用:
tdm <- TermDocumentMatrix(text,
control = list(removePunctuation = TRUE,
stopwords = TRUE))
但是我收到了这个错误:
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "data.frame"
好像我不应该将数据框输入到函数中,但我还能怎么做呢?
由于
答案 0 :(得分:1)
正如@Martin Bel所指出的,qdap version 1.1.0也可以做到这一点。我已经为qdap添加了一些支持,以便与tm包更加兼容,包括tdm
函数,该函数在这里可以正常工作:
首先阅读您的数据(我添加了冒号):
library(qdap)
dat <- read.transcript(text="ID V1
1 In the process
2 Sorry I had to step away for a moment.
3 I am getting an error page that says QB is currently unavailable.
4 That link gives me the same error message.", header=TRUE, sep=" ")
#制作术语文档矩阵:
tdm(dat$V1, id(dat), stopwords=tm::stopwords("en"))
#使用tm包执行相同的操作:
TermDocumentMatrix(Corpus(VectorSource(dat[, 1])),
control = list(
removePunctuation = TRUE,
stopwords = TRUE
)
)
答案 1 :(得分:1)
你非常接近,最快的方法是使用DataframeSource
制作一个语料库对象,然后从中创建一个术语doc矩阵。使用您的示例:
让我们输入数据......
Text <- readLines(n=4)
In the process
Sorry I had to step away for a moment.
I am getting an error page that says QB is currently unavailable.
That link gives me the same error message.
df <- data.frame(V1 = Text, stringsAsFactors = FALSE)
现在将数据框转换为术语文档矩阵......
require(tm)
mycorpus <- Corpus(DataframeSource(df))
tdm <- TermDocumentMatrix(mycorpus, control = list(removePunctuation = TRUE, stopwords = TRUE))
现在检查输出......
inspect(tdm)
A term-document matrix (14 terms, 4 documents)
Non-/sparse entries: 15/41
Sparsity : 73%
Maximal term length: 11
Weighting : term frequency (tf)
Docs
Terms 1 2 3 4
away 0 1 0 0
currently 0 0 1 0
error 0 0 1 1
getting 0 0 1 0
gives 0 0 0 1
link 0 0 0 1
message 0 0 0 1
moment 0 1 0 0
page 0 0 1 0
process 1 0 0 0
says 0 0 1 0
sorry 0 1 0 0
step 0 1 0 0
unavailable 0 0 1 0
答案 2 :(得分:-1)
您只需要通过执行text[,1]
:
tdm <- TermDocumentMatrix(text[,1],
control = list(removePunctuation = TRUE,
stopwords = TRUE))