我需要帮助弄清楚我做错了什么。我已经从2个文本文件创建了语料库,并且创建了DocumentTermMatrix但是相关性返回值1,好像数据正在检查单词和它自己之间的相关性,或者所有单词是否属于同一个向量。虽然在导入语料库时这是一个分隔符问题,但我不知道自己做错了什么。
setwd("C:/Users/kangom/Documents")
options(stringAsFactors = FALSE)
library(tm)
docs <- Corpus(DirSource("cmtxtmining"))
summary(docs)
摘要结果:
#A corpus with 2 text documents
#The metadata consists of 2 tag-value pairs and a data frame
#Available tags are:
# create_date creator
#Available variables in the data frame are:
# MetaID
然后我做了:
docs <- tm_map(docs, tolower)
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, removeWords, stopwords("english"))
docs <- tm_map(docs, stripWhitespace)
dtm <- DocumentTermMatrix(docs)
dtms <- removeSparseTerms(dtm, 0.1)
inspect(dtms[1:2, 1:50])
检查结果:
#A document-term matrix (2 documents, 50 terms)
#Non-/sparse entries: 100/0
#Sparsity : 0%
#Maximal term length: 20
#Weighting : term frequency (tf)
# Terms
#Docs aaron abilities ability able accelerate accept accepting
# cmstrenght.txt 2 4 119 16 1 4 1
# cmweakness.txt 1 2 17 29 13 2 2
# Terms
#Docs accepts accident accidents accomplish accomplishments
# cmstrenght.txt 10 113 17 3 2
# cmweakness.txt 2 105 37 2 2
# Terms
#Docs accordingly account accountabilities accountability
# cmstrenght.txt 1 1 2 54
# cmweakness.txt 1 2 2 141
列表继续......
findAssocs(dtms, "account", corlimit=0.6) #(similar result with findAssocs(dtm, "account", corlimit=0.6))
# account
#able 1
#accelerate 1
#accepting 1
#accidents 1
#accountability 1
#accountable 1
#accountably 1
#achieve 1
#acquisition 1
#across 1
#acting 1
#action 1
#active 1
#activities 1
#adapting 1
#address 1
#addressing 1
#adjust 1
#aesb 1
#affect 1
#aggressive 1
#aggressively 1
#aggressiveness 1
#aim 1
#alex 1
#align 1
#aligned 1
#alignment 1
#alive 1
#allow 1
#...
#zero 1