TermDocumentMatrix不响应令牌生成器

时间:2018-12-12 22:37:46

标签: r tokenize term-document-matrix

我对R非常陌生,我正在尝试制作NGram WordCloud。但是,我的结果始终显示1Gram而不是NGram。我已经在网上搜索了几天的答案,并尝试了不同的方法...结果仍然相同。另外,由于某种原因,我没有看到每个人都在使用的Ngramtokenizer函数。但是,我找到了我在这里使用的另一个标记化函数。我希望有人能帮助我。提前致谢!

library(dplyr)
library(ggplot2)
library(tidytext)
library(wordcloud)
library(tm)
library(RTextTools)

library(readxl)
library(qdap)
library(RWeka)
library(tau)
library(quanteda)


rm(list = ls())

#setwd("C:\\RStatistics\\Data\\")

#allverbatims <-read_excel("RS_Verbatims2018.xlsx") #reads excel files
#selgroup <- subset(allverbatims, FastNPS=="Detractors")
#selcolumns <- selgroup[ ,3:8]

#sample data 
selcolumns <- c("this is a test","my test is not working","sample data here")

Comments <- Corpus(VectorSource(selcolumns))
CommentClean <- tm_map(Comments, removePunctuation)
CommentClean <- tm_map(CommentClean, content_transformer(tolower))
CommentClean <- tm_map(CommentClean,removeNumbers)
CommentClean <- tm_map(CommentClean, stripWhitespace)
CommentClean <- tm_map(CommentClean,removeWords,c(stopwords('english')))

#create manual tokenizer using tau textcnt since NGramTokenizer is not available

tokenize_ngrams <- function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x,method="string", n=n))))) 

    #test tokenizer
    head(tokenize_ngrams(CommentClean))

    td_mat <- TermDocumentMatrix(CommentClean, control = list(tokenize = tokenize_ngrams))

    inspect(td_mat) #should be bigrams but the result is 1 gram

    matrix <- as.matrix(td_mat)
    sorted <- sort(rowSums(matrix),decreasing = TRUE)
    data_text <- data.frame(word = names(sorted),freq = sorted)

    set.seed(1234)
    wordcloud(word = data_text$word, freq = data_text$freq, min = 5, max.words = 100, random.order = FALSE, rot.per = 0.1, colors = rainbow(30))

0 个答案:

没有答案