检查DocumentTermMatrix中的特定文档以获取特定术语

时间:2014-12-08 18:20:20

标签: r text-mining tm

我使用R的tm包进行文本挖掘。这就是我的代码:

library(tm)

在R

中加载数据
pathToData = "R/group_data"
 newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE), 
                readerControl = list(reader = readPlain))

新闻语料库的长度

      length(newsCorpus)

预处理语料库数据

newsCorpus = tm_map(newsCorpus,removePunctuation)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus,removeNumbers)
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
newsCorpus[["103806"]]

newsCorpus = tm_map(newsCorpus, stripWhitespace)
newsCorpus[["103806"]]

语料库元素到纯文本

newsCorpus = Corpus(VectorSource(newsCorpus))

具有TFIDF权重的文档术语矩阵

docTermMatrix = DocumentTermMatrix(newsCorpus, 
                               control = list(weighting = weightTfIdf, 
                                              minWordLength = 1,
                                              minDocFreq = 1))                                                  

结果矩阵的维度

dim(docTermMatrix)

docTermMatrix看起来像这样:

<<DocumentTermMatrix (documents: 1986, terms: 22213)>>
 Non-/sparse entries: 173995/43941023
 Sparsity           : 100%
 Maximal term length: 163
 Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

现在我想检查文档“101287”的docTermMatrix并查找术语“textmining”,“clustering”。但由于文档术语矩阵已将文档名称(行)更改为1,2,3,4 ...,因此我无法再找到名为“101287”的文档并查找“textmining”,“clustering”列。有没有办法保存文档名称? 如果我遗失某些事情,我会道歉......

上述代码

的R输出
> library(tm)
  > pathToData = "R/group_data"
  > newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE), 
              readerControl = list(reader = readPlain))

 > length(newsCorpus)
    [1] 1986

 > newsCorpus[["103806"]]
  <<PlainTextDocument (metadata: 7)>>
  From: cheekeen@tartarus.uwa.edu.au (Desmond Chan)
  Subject: Re: Honda clutch chatter
  Organization: The University of Western Australia
  Lines: 8
  NNTP-Posting-Host: tartarus.uwa.edu.au
  X-Newsreader: NN version 6.4.19 #1

  I also experience this kinda problem in my 89 BMW 318. During cold
  start ups, the clutch seems to be sticky and everytime i drive out, for
  about 5km, the clutch seems to stick onto somewhere that if i depress
  the clutch, the whole chassis moves along. But after preheating, it
  becomes smooth again. I think that your suggestion of being some
  humudity is right but there should be some remedy. I also found out that
  my clutch is already thin but still alright for a couple grand more!

 > newsCorpus = tm_map(newsCorpus,removePunctuation)
 > newsCorpus = tm_map(newsCorpus,removeNumbers) 
 > newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
 > newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english")) 
 > newsCorpus = tm_map(newsCorpus, stripWhitespace)

 > newsCorpus = Corpus(VectorSource(newsCorpus)) 

 > docTermMatrix = DocumentTermMatrix(newsCorpus, control = list(weighting =     weightTfIdf,minWordLength = 1,minDocFreq = 1))  


 > dim(docTermMatrix)
 [1]  1986 22213



>inspect(docTermMatrix["1","bmw"])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 0/1
Sparsity           : 100%
Maximal term length: 3
Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)

    Terms
Docs bmw
  1   0

>inspect(docTermMatrix["103806", "bmw"])
Error in `[.simple_triplet_matrix`(docTermMatrix, "103806", "bmw") : 
Subscript out of bounds.

1 个答案:

答案 0 :(得分:0)

您基本上希望在文档术语矩阵中编码您的文档ID。您可以将其保存为文本语料库中的属性。看看这个more detailed answer