如何从一组文本构建一个术语频繁文档矩阵

时间:2016-01-25 11:38:56

标签: r tm

我一直在使用tm包来运行一些文本分析。我的问题是创建一个矩阵术语频繁的文档来构建图形。 我想建立一个显示超过20次的条款的图表,所以

我如何创建这个matirx?

### Stage the Data      
dtm <- DocumentTermMatrix(docs)   
tdm <- TermDocumentMatrix(docs)   


### Explore your data      
freq <- colSums(as.matrix(dtm))   
length(freq)   
ord <- order(freq)   
m <- as.matrix(dtm)   
dim(m)  

write.csv(m, file="DocumentTermMatrix.csv")   
termDocMatrix <- as.matrix(tdm)
termDocMatrix

termDocMatrix必须仅包含超过20的术语

谢谢。

1 个答案:

答案 0 :(得分:1)

您可以在documentTermMatrix中使用findFreqTerms来查找相关字词。见下面的例子。之后,您可以对此子集进行常规矩阵计算。

根据评论OP进行编辑:添加额外的代码行,显示它如何适用于TermDocumentMatrix。

library(tm)
data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))
crude <- tm_map(crude, removePunctuation)
crude <- tm_map(crude, removeNumbers)
crude <- tm_map(crude, removeWords, stopwords("smart"))


#Based on DocumentTermMatrix
dtm <- DocumentTermMatrix(crude)
# filter the documenttermmatrix to only include items with a frequency of 20 or more
dtm <- dtm[, findFreqTerms(dtm, lowfreq = 20)]
inspect(dtm)

<<DocumentTermMatrix (documents: 20, terms: 9)>>
Non-/sparse entries: 107/73
Sparsity           : 41%
Maximal term length: 6
Weighting          : term frequency (tf)

     Terms
Docs  bpd crude dlrs market mln oil opec prices reuter
  127   0     2    2      1   0   5    0      3      1
  144   4     0    0      3   4  12   13      5      1
  191   0     2    1      0   0   2    0      0      1
  194   0     3    2      0   0   1    0      0      1
  211   0     0    2      0   2   1    0      0      1
  236   7     2    2      0   4   7    6      5      1
  237   0     0    1      0   1   3    1      1      1
  242   0     0    0      2   0   3    2      2      1
  246   0     0    0      0   0   5    1      1      1
  248   2     0    4      8   3   9    6      9      1
  273   8     5    2      1   9   5    5      5      1
  349   0     2    0      1   0   4    2      1      1
  352   0     0    0      2   0   5    2      5      1
  353   2     2    0      0   0   4    4      2      1
  368   0     0    0      0   0   3    0      0      1
  489   0     0    1      0   3   4    0      2      1
  502   0     0    1      0   3   5    0      2      1
  543   0     2    5      0   0   3    0      2      1
  704   0     0    0      2   0   3    0      3      1
  708   0     1    0      0   2   1    0      0      1

#based on TermDocumentMatrix
tdm <- TermDocumentMatrix(crude)
# filter the termdocumentmatrix to only include items with a frequency of 20 or more
tdm <- tdm[findFreqTerms(tdm, lowfreq = 20), ]

inspect(tdm)
<<TermDocumentMatrix (terms: 9, documents: 20)>>
Non-/sparse entries: 107/73
Sparsity           : 41%
Maximal term length: 6
Weighting          : term frequency (tf)

        Docs
Terms    127 144 191 194 211 236 237 242 246 248 273 349 352 353 368 489 502 543 704 708
  bpd      0   4   0   0   0   7   0   0   0   2   8   0   0   2   0   0   0   0   0   0
  crude    2   0   2   3   0   2   0   0   0   0   5   2   0   2   0   0   0   2   0   1
  dlrs     2   0   1   2   2   2   1   0   0   4   2   0   0   0   0   1   1   5   0   0
  market   1   3   0   0   0   0   0   2   0   8   1   1   2   0   0   0   0   0   2   0
  mln      0   4   0   0   2   4   1   0   0   3   9   0   0   0   0   3   3   0   0   2
  oil      5  12   2   1   1   7   3   3   5   9   5   4   5   4   3   4   5   3   3   1
  opec     0  13   0   0   0   6   1   2   1   6   5   2   2   4   0   0   0   0   0   0
  prices   3   5   0   0   0   5   1   2   1   9   5   1   5   2   0   2   2   2   3   0
  reuter   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1