TermDocumentMatrix执行未经请求的清理(例如删除标点符号)

时间:2017-05-07 02:35:13

标签: r tm

根据我对文档的理解,TermDocumentMatrix包的tm功能无法正常运行。它似乎正在按照我没有要求的条款进行处理。

以下是一个例子:

require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf), 
                                                 removePunctuation = FALSE))
rownames(tdm)

我们可以从输出中看到标点符号已被删除,表达式“上升...什么”已被拆分:

 [1] "a"         "about"     "am"        "and"       "astrology" "cap"       "capricorn" "does"      "i"         "me"        "moon"      "rising"    "say"       "sun"       "that"     
[16] "what"  

related SO question中,问题在于令牌工具正在移除标点符号。但是,我使用默认的words标记器,我不相信这样做:

> sapply(corpus, words)
      [,1]           
 [1,] "Astrology:"   
 [2,] "I"            
 [3,] "am"           
 [4,] "a"            
 [5,] "Capricorn"    
 [6,] "Sun"          
 [7,] "Cap"          
 [8,] "moon"         
 [9,] "and"          
[10,] "cap"          
[11,] "rising...what"
[12,] "does"         
[13,] "that"         
[14,] "say"          
[15,] "about"        
[16,] "me?" 

观察到的行为是否不正确,或者我的误解是什么?

1 个答案:

答案 0 :(得分:3)

你有SimpleCorpus个对象,came with tm package version 0.7根据?SimpleCorpus -

  

采用内部各种快捷方式来提升性能和最小化   记忆压力

class(corpus)
# [1] "SimpleCorpus" "Corpus"  

现在,正如help(TermDocumentMatrix)所述:

  

可用的本地选项记录在termFreq中并且在内部   委托termFreq电话。对于SimpleCorpus ,不同。在   在这种情况下,所有选项都按固定顺序一次性处理   提高绩效。 它总是使用Boost Tokenizer (通过Rcpp)......

所以你使用words作为标记器,这确实会给你

words(sentence)
 [1] "Astrology:"    "I"             "am"            "a"             "Capricorn"     "Sun"           "Cap"          
 [8] "moon"          "and"           "cap"           "rising...what" "does"          "that"          "say"          
[15] "about"         "me?"  

正如评论中所述,您可以明确地将您的语料库设为易变?VCorpus以获得完全的灵活性:

  

易失性语料库完全保留在内存中,因此只会进行所有更改   影响相应的R对象

corpus <- VCorpus(VectorSource(sentence)) 
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))