根据我对文档的理解,TermDocumentMatrix
包的tm
功能无法正常运行。它似乎正在按照我没有要求的条款进行处理。
以下是一个例子:
require(tm)
sentence <- "Astrology: I am a Capricorn Sun Cap moon and cap rising...what does that say about me?"
corpus <- Corpus(VectorSource(sentence))
tdm <- TermDocumentMatrix(corpus, control = list(wordLengths = c(1, Inf),
removePunctuation = FALSE))
rownames(tdm)
我们可以从输出中看到标点符号已被删除,表达式“上升...什么”已被拆分:
[1] "a" "about" "am" "and" "astrology" "cap" "capricorn" "does" "i" "me" "moon" "rising" "say" "sun" "that"
[16] "what"
在related SO question中,问题在于令牌工具正在移除标点符号。但是,我使用默认的words
标记器,我不相信这样做:
> sapply(corpus, words)
[,1]
[1,] "Astrology:"
[2,] "I"
[3,] "am"
[4,] "a"
[5,] "Capricorn"
[6,] "Sun"
[7,] "Cap"
[8,] "moon"
[9,] "and"
[10,] "cap"
[11,] "rising...what"
[12,] "does"
[13,] "that"
[14,] "say"
[15,] "about"
[16,] "me?"
观察到的行为是否不正确,或者我的误解是什么?
答案 0 :(得分:3)
你有SimpleCorpus
个对象,came with tm package version 0.7根据?SimpleCorpus
-
采用内部各种快捷方式来提升性能和最小化 记忆压力
class(corpus)
# [1] "SimpleCorpus" "Corpus"
现在,正如help(TermDocumentMatrix)
所述:
可用的本地选项记录在termFreq中并且在内部 委托termFreq电话。对于SimpleCorpus ,不同。在 在这种情况下,所有选项都按固定顺序一次性处理 提高绩效。 它总是使用Boost Tokenizer (通过Rcpp)......
所以你不使用words
作为标记器,这确实会给你
words(sentence)
[1] "Astrology:" "I" "am" "a" "Capricorn" "Sun" "Cap"
[8] "moon" "and" "cap" "rising...what" "does" "that" "say"
[15] "about" "me?"
正如评论中所述,您可以明确地将您的语料库设为易变?VCorpus
以获得完全的灵活性:
易失性语料库完全保留在内存中,因此只会进行所有更改 影响相应的R对象
corpus <- VCorpus(VectorSource(sentence))
Terms(TermDocumentMatrix(corpus, control = list(tokenize="words"))