我刚开始使用R中的tm包,似乎无法解决问题。 即使我的tokenizer函数似乎工作正常:
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
uniTDM <- TermDocumentMatrix(corpus, control=list(tokenize = uniTokenizer))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
triTDM <- TermDocumentMatrix(corpus, control=list(tokenize = triTokenizer))
当我尝试从biTDM中拉出2克时,只有1克出现......
findFreqTerms(biTDM, 50)
[1] "after" "and" "most" "the" "were" "years" "love"
[8] "you" "all" "also" "been" "did" "from" "get"
同样,2克的功能似乎很明显:
x <- biTokenizer(corpus)
head(x)
[1] "c in" "in the" "the years"
[4] "years thereafter" "thereafter most" "most of"
答案 0 :(得分:0)
我只能假设问题出在这里:NGramTokenizer
需要VCorpus
个对象而不是Corpus
个对象。
library(tm)
library(RWeka)
# some dummy text
text <- c("Lorem ipsum dolor sit amet, consetetur sadipscing elitr",
"sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat",
"sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet")
# create a VCorpus
corpus <- VCorpus(VectorSource(text))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
biTDM <- TermDocumentMatrix(corpus, control=list(tokenize = biTokenizer))
print(biTDM$dimnames$Terms)
[1] "accusam et" "aliquyam erat" "amet consetetur" "at vero" "clita kasd" "consetetur sadipscing" "diam nonumy" "diam voluptua" "dolor sit" "dolore magna"
[11] "dolores et" "duo dolores" "ea rebum" "eirmod tempor" "eos et" "est lorem" "et accusam" "et dolore" "et ea" "et justo"
[21] "gubergren no" "invidunt ut" "ipsum dolor" "justo duo" "kasd gubergren" "labore et" "lorem ipsum" "magna aliquyam" "no sea" "nonumy eirmod"
[31] "sadipscing elitr" "sanctus est" "sea takimata" "sed diam" "sit amet" "stet clita" "takimata sanctus" "tempor invidunt" "ut labore" "vero eos"
[41] "voluptua at"