我使用R的tm包进行文本挖掘。这就是我的代码:
library(tm)
在R
中加载数据pathToData = "R/group_data"
newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE),
readerControl = list(reader = readPlain))
新闻语料库的长度
length(newsCorpus)
预处理语料库数据
newsCorpus = tm_map(newsCorpus,removePunctuation)
newsCorpus[["103806"]]
newsCorpus = tm_map(newsCorpus,removeNumbers)
newsCorpus[["103806"]]
newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
newsCorpus[["103806"]]
newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
newsCorpus[["103806"]]
newsCorpus = tm_map(newsCorpus, stripWhitespace)
newsCorpus[["103806"]]
语料库元素到纯文本
newsCorpus = Corpus(VectorSource(newsCorpus))
具有TFIDF权重的文档术语矩阵
docTermMatrix = DocumentTermMatrix(newsCorpus,
control = list(weighting = weightTfIdf,
minWordLength = 1,
minDocFreq = 1))
结果矩阵的维度
dim(docTermMatrix)
docTermMatrix看起来像这样:
<<DocumentTermMatrix (documents: 1986, terms: 22213)>>
Non-/sparse entries: 173995/43941023
Sparsity : 100%
Maximal term length: 163
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
现在我想检查文档“101287”的docTermMatrix并查找术语“textmining”,“clustering”。但由于文档术语矩阵已将文档名称(行)更改为1,2,3,4 ...,因此我无法再找到名为“101287”的文档并查找“textmining”,“clustering”列。有没有办法保存文档名称? 如果我遗失某些事情,我会道歉......
> library(tm)
> pathToData = "R/group_data"
> newsCorpus = Corpus(DirSource(pathToData, recursive = TRUE),
readerControl = list(reader = readPlain))
> length(newsCorpus)
[1] 1986
> newsCorpus[["103806"]]
<<PlainTextDocument (metadata: 7)>>
From: cheekeen@tartarus.uwa.edu.au (Desmond Chan)
Subject: Re: Honda clutch chatter
Organization: The University of Western Australia
Lines: 8
NNTP-Posting-Host: tartarus.uwa.edu.au
X-Newsreader: NN version 6.4.19 #1
I also experience this kinda problem in my 89 BMW 318. During cold
start ups, the clutch seems to be sticky and everytime i drive out, for
about 5km, the clutch seems to stick onto somewhere that if i depress
the clutch, the whole chassis moves along. But after preheating, it
becomes smooth again. I think that your suggestion of being some
humudity is right but there should be some remedy. I also found out that
my clutch is already thin but still alright for a couple grand more!
> newsCorpus = tm_map(newsCorpus,removePunctuation)
> newsCorpus = tm_map(newsCorpus,removeNumbers)
> newsCorpus = tm_map(newsCorpus, content_transformer(tolower))
> newsCorpus = tm_map(newsCorpus, removeWords, stopwords("english"))
> newsCorpus = tm_map(newsCorpus, stripWhitespace)
> newsCorpus = Corpus(VectorSource(newsCorpus))
> docTermMatrix = DocumentTermMatrix(newsCorpus, control = list(weighting = weightTfIdf,minWordLength = 1,minDocFreq = 1))
> dim(docTermMatrix)
[1] 1986 22213
>inspect(docTermMatrix["1","bmw"])
<<DocumentTermMatrix (documents: 1, terms: 1)>>
Non-/sparse entries: 0/1
Sparsity : 100%
Maximal term length: 3
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)
Terms
Docs bmw
1 0
>inspect(docTermMatrix["103806", "bmw"])
Error in `[.simple_triplet_matrix`(docTermMatrix, "103806", "bmw") :
Subscript out of bounds.