tm包版本0.7不保留DocumentTermMatrix中的字内短划线

时间:2018-01-02 11:56:18

标签: r

tm软件包的行为在0.6-2和0.7-x版本之间发生了变化。 在新版本中,DocumentTermMatrix不保留字内短划线,是一个错误还是有新选项来强制执行?下面是一个示例,使用安装了不同路径的两个版本。我正在运行R 3.3.3。

> string1 <- "big data data analysis machine learning project management"
> string2 <- "big-data data-analysis machine-learning project-management"
> 
> two_strings <- c(string1, string2)
> 
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.6-2")
> myCorpus <- Corpus(VectorSource(two_strings))
> dtm_0.6 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.6)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity           : 50%
Maximal term length: 18
Weighting          : term frequency (tf)

    Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
   1        1   1        0    2             0        1       1                0
   2        0   0        1    0             1        0       0                1
    Terms
Docs management project project-management
   1          1       1                  0
   2          0       0                  1

因此,对于旧版本0.6-2,第二个字符串中的破折号被正确保留。使用新版本0.7-3:

> detach("package:tm", unload=TRUE)
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> dtm_0.7 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs analysis big data learning machine management project
   1        1   1    2        1       1          1       1
   2        1   1    2        1       1          1       1

我试图按照以下方式强制执行破折号的保存,但无济于事:

> dtm_test <- DocumentTermMatrix(myCorpus, 
+            control = list(removePunctuation = list(preserve_intra_word_dashes = TRUE)))
> inspect(dtm_test)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs analysis big data learning machine management project
   1        1   1    2        1       1          1       1
   2        1   1    2        1       1          1       1

有什么建议吗?谢谢!

1 个答案:

答案 0 :(得分:0)

答案来自tm作者本人,博士。 Ingo Feinerer - 谢谢!在此重现:

自0.7以来,默认语料库是“SimpleCorpus”(如果支持;那个 取决于来源)。参见?SimpleCorpus

触发某种行为(参见?TermDocumentMatrix)。

使用VCorpus代替Corpus来强制执行旧行为:

检查(TermDocumentMatrix(语料库(VectorSource(two_strings)))) 检查(TermDocumentMatrix(VCorpus(VectorSource(two_strings))))

回到上面的例子,现在使用VCorpus:

> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> myVCorpus <- VCorpus(VectorSource(two_strings))
> dtm_0.7 <- DocumentTermMatrix(myVCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity           : 50%
Maximal term length: 18
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
   1        1   1        0    2             0        1       1                0
   2        0   0        1    0             1        0       0                1
    Terms
Docs management project
   1          1       1
   2          0       0