在R

时间:2018-04-17 04:29:16

标签: r text-mining information-retrieval term-document-matrix

我之前有term document matrix,想要将new document添加到that term document matrix,以另一种方式加入两个文档矩阵。

我的术语文档矩阵是:

     Docs
Term   1
eat    7
food   2
run    2
sick   3

然后另一个文件是watch football match and eat food

在此过程之后,我希望我的学期文档矩阵为:

         Docs
Term     1   2
eat      7   1
food     2   1
run      2   0
sick     3   0
watch    0   1
football 0   1
match    0   1
and      0   1

我试过这个:

library("SnowballC")
library("NLP")
library("tm")
library("lsa")

                   #mytermdm (term document matrix i have before)

text2 <- "watch fottball match and eat food"
myCorpus <- Corpus(VectorSource(text2))

tdm2 <- TermDocumentMatrix(myCorpus, control = list
                         (removeNumbers = TRUE, 
                         removePunctuation = TRUE, 
                         stopwords=stopwords_en, 
                         stemming=TRUE)
)
mytdm3 <- c(mytermdm,tdm2)
inspect(mytdm3)

我明白了:

TermDocumentMatrix (terms: 7, document:2)

Error in `[.simple_triplet_matrix`(x,terms,doc)`
    Repeated indices currently no allowed.

1 个答案:

答案 0 :(得分:0)

我已经解决了它,在结合两个术语文档矩阵之前,我替换了tdm2中的文档名称。所以,完整的算法:

library("SnowballC")
library("NLP")
library("tm")
library("lsa")

#mytermdm (term document matrix i have before)

text2 <- "watch fottball match and eat food"
myCorpus <- Corpus(VectorSource(text2))

tdm2 <- TermDocumentMatrix(myCorpus, control = list
                     (removeNumbers = TRUE, 
                     removePunctuation = TRUE, 
                     stopwords=stopwords_en, 
                     stemming=TRUE)
)

colnames(tdm2) <- as.numeric(max(colnames(mytermdm)))+1     #my add solution 


mytdm3 <- c(mytermdm,tdm2)
inspect(mytdm3)