使用tm&创建N-Grams RWeka - 与VCorpus合作,但不与Corpus合作

时间:2017-03-13 05:33:23

标签: r tm n-gram term-document-matrix rweka

按照许多指南创建 biGrams ,使用&#t;'和' RWeka'我感到沮丧的是,在 tdm 中只返回 1-gram 。通过大量的反复试验,我发现使用' VCorpus '但不使用' 语料库'。顺便说一句,我很确定这是与Corpus' 〜1个月前,但现在不是。

R(3.3.3),RTools(3.4),RStudio(1.0.136)和所有软件包(tm 0.7-1,RWeka 0.4-31)已更新至最新版本。

如果您对Corpus的合作以及其他人有同样的问题,我将不胜感激。

#A Reproducible example
#
#Weka bi-gram test
#

library(tm)
library(RWeka)

someCleanText <- c("Congress shall make no law respecting an establishment of",
                    "religion, or prohibiting the free exercise thereof or",
                    "abridging the freedom of speech or of the press or the",
                    "right of the people peaceably to assemble and to petition",
                    "the Government for a redress of grievances")

aCorpus <- Corpus(VectorSource(someCleanText))   #With this, only 1-Grams are created
#aCorpus <- VCorpus(VectorSource(someCleanText)) #With this, biGrams are created as desired

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))

aTDM <- TermDocumentMatrix(aCorpus, control=list(tokenize=BigramTokenizer))

print(aTDM$dimnames$Terms)

结果&#39;语料库&#39;

 [1] "congress"      "establishment" "law"           "make"         
 [5] "respecting"    "shall"         "exercise"      "free"         
 [9] "prohibiting"   "religion"      "the"           "thereof"      
[13] "abridging"     "freedom"       "press"         "speech"       
[17] "and"           "assemble"      "peaceably"     "people"       
[21] "petition"      "right"         "for"           "government"   
[25] "grievances"    "redress"

使用&#39; VCorpus&#39;

的结果
 [1] "a redress"        "abridging the"    "an establishment" "and to"          
 [5] "assemble and"     "congress shall"   "establishment of" "exercise thereof"
 [9] "for a"            "free exercise"    "freedom of"       "government for"  
[13] "law respecting"   "make no"          "no law"           "of grievances"   
[17] "of speech"        "of the"           "or of"            "or prohibiting"  
[21] "or the"           "peaceably to"     "people peaceably" "press or"        
[25] "prohibiting the"  "redress of"       "religion or"      "respecting an"   
[29] "right of"         "shall make"       "speech or"        "the free"        
[33] "the freedom"      "the government"   "the people"       "the press"       
[37] "thereof or"       "to assemble"      "to petition"

2 个答案:

答案 0 :(得分:0)

我正在使用R.3.4.1并更改为R3.3.3,现在VCorpus解决方案为我工作。 TM和RWeka都正确地创建了双字母组合。

sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

答案 1 :(得分:0)


我能够重现与您获得的完全相同的结果。

当我开始阅读 Corpus VCorpus 时,大多数参考文献指出,区别基本上是VCorpus是一个易失的语料库,可以保留在内存中,但不是唯一的区别。 语料库默认使用SimpleCorpus作为默认语言,它不具有VCorpus的所有属性,这就是为什么您能够使用VCorpus而不是常规语料库来获取2克语法的原因。 有关此信息的更多信息,请转到stackexchange中的此发布: https://stats.stackexchange.com/questions/164372/what-is-vectorsource-and-vcorpus-in-tm-text-mining-package-in-r