TM DocumentTermMatrix给出了语料库意外的结果

时间:2017-07-28 17:27:28

标签: r text-mining tm term-document-matrix

也许我误解了tm::DocumentTermMatrix的工作原理。我有一个语料库,经过预处理后看起来像这样:

head(Description.text, 3)
[1] "azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram"                    
[2] "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur"     
[3] "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin"

我通过以下方式处理:

Description.text.features <- DocumentTermMatrix(Corpus(VectorSource(Description.text)), list(
    bounds = list(local = c(3, Inf)),
    tokenize = 'scan'
))

当我检查DTM的第一行时,我得到了这个:

inspect(Description.text.features[1,])
<<DocumentTermMatrix (documents: 1, terms: 887)>>
Non-/sparse entries: 0/887
Sparsity           : 100%
Maximal term length: 15
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs banc camill mar martin ospedal presid san sanitar torin vittor
   1    0      0   0      0       0      0   0       0     0      0

这些术语与[{1}}中的第一个文档不对应(例如Description.textbanc不在第一个文档中,并且例如零camillmartin

此外,如果我跑:

presid

我得到零,表明在第一个文档中没有带有频率的术语&gt;零!

这里发生了什么?

由于

更新

我创建了自己的语料库,以便dtm&#39;功能,实际上它给出了非常不同的结果。除了文档术语权重与Description.text.features[1,] %>% as.matrix() %>% sum 非常不同(我的预期是你所期望的语料库),我的功能比使用tm函数(~3000比800的tm)更多。< / p>

这是我的功能:

tm::DocumentTermMatrix

1 个答案:

答案 0 :(得分:1)

以下是使用 tm 替代方案 quanteda 的解决方法。您甚至可能会发现后者的相对简单性,结合其速度和功能,足以将其用于分析的其余部分!

description.text <- 
  c("azi sanitar local to1 presid osp martin presid ospedalier martin tofan torin tel possibil raggiung ospedal segu bus tram",
    "torin croll controsoffitt repart pediatr martin mag cartell compars sest pian ospedal martin torin ospedal tofan sol due anno riapertur",
    "ospedal martin croll controsoffitt repart pediatr mag ospedal martin croll controsoffitt repart pediatr distacc intonac avven nott mattin")

require(quanteda)
require(magrittr)

qdfm <- dfm(description.text)
head(qdfm, nfeat = 10)
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse).
# (showing first 3 documents and first 10 features)
#        features
# docs    azi sanitar local to1 presid osp martin ospedalier tofan torin
#   text1   1       1     1   1      2   1      2          1     1     1
#   text2   0       0     0   0      0   0      2          0     1     2
#   text3   0       0     0   0      0   0      2          0     0     0

qdfm2 <- qdfm %>% dfm_trim(min_count = 3, min_docfreq = 3)
qdfm2
# Document-feature matrix of: 3 documents, 2 features (0% sparse).
# (showing first 3 documents and first 2 features)
#        features
# docs    martin ospedal
#   text1      2       1
#   text2      2       2
#   text3      2       2

转换回 tm

convert(qdfm2, to = "tm")
# <<DocumentTermMatrix (documents: 3, terms: 2)>>
# Non-/sparse entries: 6/0
# Sparsity           : 0%
# Maximal term length: 7
# Weighting          : term frequency (tf)

在您的示例中,您使用tf-idf加权。在 quanteda

中也很容易
dfm_weight(qdfm, "tfidf") %>% head
# Document-feature matrix of: 3 documents, 35 features (56.2% sparse).
# (showing first 3 documents and first 6 features)
#          features
# docs          azi   sanitar     local       to1    presid       osp
#   text1 0.4771213 0.4771213 0.4771213 0.4771213 0.9542425 0.4771213
#   text2 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#   text3 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000