我正在处理“.txt”格式的数据。我正在尝试使用R中的'tm'库来执行文本挖掘。我的问题是,我总是得到稀疏度为0%的文档术语矩阵,而不管数据是足够大还是小。我无法获得任何类型的单词关联,我也无法获得可查看的聚类树状图。当我尝试使用K表示集群图来分析我的数据时,我收到一条错误消息。这是我使用的代码:
cname = file.path("F:","texts") #folder containing text data files
dir(cname)
library(tm)
docs <- Corpus(DirSource(cname))
## Preprocessing
docs <- tm_map(docs, removePunctuation) # *Removing punctuation:*
docs <- tm_map(docs, removeNumbers) # *Removing numbers:*
docs <- tm_map(docs, tolower) # *Converting to lowercase:*
docs <- tm_map(docs, removeWords, stopwords("english")) #Remove stopwords
library(SnowballC)
docs <- tm_map(docs, stemDocument) # *Removing common word endings*
docs <- tm_map(docs, stripWhitespace) # *Stripping whitespace
docs <- tm_map(docs, PlainTextDocument)
### Staging the Data
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(dtm)
tdm
freq <- colSums(as.matrix(dtm))
# removing sparse terms:
dtms <- removeSparseTerms(dtm, 0.1)
# Word Frequency
freq <- colSums(as.matrix(dtms))
### Term Correlations
findAssocs(dtm, c("young","politics"), corlimit=0.8)
### Hierarchal Clustering
dtms <- removeSparseTerms(dtm, 0.15)
library(cluster)
d <- dist(t(dtms), method="euclidian")
fit <- hclust(d=d, method="ward")
plot.new()
plot(fit, hang=-1)
groups <- cutree(fit, k=5) # "k=" defines the number of clusters used
rect.hclust(fit, k=5, border="red")
### K-means clustering
library(fpc)
library(cluster)
dtms <- removeSparseTerms(dtm, 0.15)
d <- dist(t(dtms), method="euclidian")
kfit <- kmeans(d, 2)
clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
这是我检查术语文档矩阵时得到的输出:
<<DocumentTermMatrix (documents: 1, terms: 1850)>>
Non-/sparse entries: 1850/0
Sparsity : 0%
Maximal term length: 23
Weighting : term frequency (tf)
尝试获取K表示群集图时出错:
“plot.window(...)中的错误:需要有限的'xlim'值另外: 警告信息:在sqrt(detA * pmax(0,yl2 - y ^ 2)):NaNs产生“
任何单词的相关输出始终为0。 Cluster Dendogram plot unable to make sense
$young
numeric(0)
$politics
numeric(0)
我还附上了丛集树状图。
答案 0 :(得分:0)
仅当语料库包含多个文件时,相关性和其他功能才有效。我的语料库只有一个文件,所以他们没有产生输出。不管怎样,谢谢!