我一直在使用tm包来运行一些文本分析。我的问题是创建一个矩阵术语频繁的文档来构建图形。 我想建立一个显示超过20次的条款的图表,所以
我如何创建这个matirx?
### Stage the Data
dtm <- DocumentTermMatrix(docs)
tdm <- TermDocumentMatrix(docs)
### Explore your data
freq <- colSums(as.matrix(dtm))
length(freq)
ord <- order(freq)
m <- as.matrix(dtm)
dim(m)
write.csv(m, file="DocumentTermMatrix.csv")
termDocMatrix <- as.matrix(tdm)
termDocMatrix
termDocMatrix必须仅包含超过20的术语
谢谢。
答案 0 :(得分:1)
您可以在documentTermMatrix中使用findFreqTerms来查找相关字词。见下面的例子。之后,您可以对此子集进行常规矩阵计算。
根据评论OP进行编辑:添加额外的代码行,显示它如何适用于TermDocumentMatrix。
library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("smart"))
#Based on DocumentTermMatrix
dtm <- DocumentTermMatrix(crude)
# filter the documenttermmatrix to only include items with a frequency of 20 or more
dtm <- dtm[, findFreqTerms(dtm, lowfreq = 20)]
inspect(dtm)
<<DocumentTermMatrix (documents: 20, terms: 9)>>
Non-/sparse entries: 107/73
Sparsity : 41%
Maximal term length: 6
Weighting : term frequency (tf)
Terms
Docs bpd crude dlrs market mln oil opec prices reuter
127 0 2 2 1 0 5 0 3 1
144 4 0 0 3 4 12 13 5 1
191 0 2 1 0 0 2 0 0 1
194 0 3 2 0 0 1 0 0 1
211 0 0 2 0 2 1 0 0 1
236 7 2 2 0 4 7 6 5 1
237 0 0 1 0 1 3 1 1 1
242 0 0 0 2 0 3 2 2 1
246 0 0 0 0 0 5 1 1 1
248 2 0 4 8 3 9 6 9 1
273 8 5 2 1 9 5 5 5 1
349 0 2 0 1 0 4 2 1 1
352 0 0 0 2 0 5 2 5 1
353 2 2 0 0 0 4 4 2 1
368 0 0 0 0 0 3 0 0 1
489 0 0 1 0 3 4 0 2 1
502 0 0 1 0 3 5 0 2 1
543 0 2 5 0 0 3 0 2 1
704 0 0 0 2 0 3 0 3 1
708 0 1 0 0 2 1 0 0 1
#based on TermDocumentMatrix
tdm <- TermDocumentMatrix(crude)
# filter the termdocumentmatrix to only include items with a frequency of 20 or more
tdm <- tdm[findFreqTerms(tdm, lowfreq = 20), ]
inspect(tdm)
<<TermDocumentMatrix (terms: 9, documents: 20)>>
Non-/sparse entries: 107/73
Sparsity : 41%
Maximal term length: 6
Weighting : term frequency (tf)
Docs
Terms 127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
bpd 0 4 0 0 0 7 0 0 0 2 8 0 0 2 0 0 0 0 0 0
crude 2 0 2 3 0 2 0 0 0 0 5 2 0 2 0 0 0 2 0 1
dlrs 2 0 1 2 2 2 1 0 0 4 2 0 0 0 0 1 1 5 0 0
market 1 3 0 0 0 0 0 2 0 8 1 1 2 0 0 0 0 0 2 0
mln 0 4 0 0 2 4 1 0 0 3 9 0 0 0 0 3 3 0 0 2
oil 5 12 2 1 1 7 3 3 5 9 5 4 5 4 3 4 5 3 3 1
opec 0 13 0 0 0 6 1 2 1 6 5 2 2 4 0 0 0 0 0 0
prices 3 5 0 0 0 5 1 2 1 9 5 1 5 2 0 2 2 2 3 0
reuter 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1