tm软件包的行为在0.6-2和0.7-x版本之间发生了变化。 在新版本中,DocumentTermMatrix不保留字内短划线,是一个错误还是有新选项来强制执行?下面是一个示例,使用安装了不同路径的两个版本。我正在运行R 3.3.3。
> string1 <- "big data data analysis machine learning project management"
> string2 <- "big-data data-analysis machine-learning project-management"
>
> two_strings <- c(string1, string2)
>
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.6-2")
> myCorpus <- Corpus(VectorSource(two_strings))
> dtm_0.6 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.6)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project project-management
1 1 1 0
2 0 0 1
因此,对于旧版本0.6-2,第二个字符串中的破折号被正确保留。使用新版本0.7-3:
> detach("package:tm", unload=TRUE)
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> dtm_0.7 <- DocumentTermMatrix(myCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
我试图按照以下方式强制执行破折号的保存,但无济于事:
> dtm_test <- DocumentTermMatrix(myCorpus,
+ control = list(removePunctuation = list(preserve_intra_word_dashes = TRUE)))
> inspect(dtm_test)
<<DocumentTermMatrix (documents: 2, terms: 7)>>
Non-/sparse entries: 14/0
Sparsity : 0%
Maximal term length: 10
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big data learning machine management project
1 1 1 2 1 1 1 1
2 1 1 2 1 1 1 1
有什么建议吗?谢谢!
答案 0 :(得分:0)
答案来自tm作者本人,博士。 Ingo Feinerer - 谢谢!在此重现:
自0.7以来,默认语料库是“SimpleCorpus”(如果支持;那个 取决于来源)。参见?SimpleCorpus
触发某种行为(参见?TermDocumentMatrix)。
使用VCorpus代替Corpus来强制执行旧行为:
检查(TermDocumentMatrix(语料库(VectorSource(two_strings)))) 检查(TermDocumentMatrix(VCorpus(VectorSource(two_strings))))
回到上面的例子,现在使用VCorpus:
> library("tm", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.3/tm_0.7-3")
> myVCorpus <- VCorpus(VectorSource(two_strings))
> dtm_0.7 <- DocumentTermMatrix(myVCorpus)
> inspect(dtm_0.7)
<<DocumentTermMatrix (documents: 2, terms: 11)>>
Non-/sparse entries: 11/11
Sparsity : 50%
Maximal term length: 18
Weighting : term frequency (tf)
Sample :
Terms
Docs analysis big big-data data data-analysis learning machine machine-learning
1 1 1 0 2 0 1 1 0
2 0 0 1 0 1 0 0 1
Terms
Docs management project
1 1 1
2 0 0