我想在TM包的JSS article上绘制如图6所示的术语 - 文档矩阵 1文章链接:https://www.jstatsoft.org/article/view/v025i05
我的语料库Speach-English.txt在这里: https://github.com/yushu-liu/speach-english.git
该图应如下所示:
这是我的代码:
library(tm)
library(stringr)
library(wordcloud)
text <- paste(readLines("D:/Rdata/speach-English.txt"), collapse = " ")
text_tidy <- gsub(pattern = "\\W",replace=" ",text)
text_tidy2 <- gsub(pattern = "\\d",replace=" ",text_tidy)
text_tidy2 <- tolower(text_tidy2)
text_tidy2 <- removeWords(text_tidy2,stopwords())
text_tidy2 <- gsub(pattern = "\\b[A-z]\\b{1}",replace=" ", text_tidy2 )
text_tidy2 <- stripWhitespace(text_tidy2)
textbag <- str_split(text_tidy2,pattern = "\\s+")
textbag <- unlist(textbag)
tdm <- TermDocumentMatrix(textbag, control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE))
plot(tdm, terms = findFreqTerms(tdm, lowfreq = 6)[1:25], corThreshold = 0.5)
但是出现了一个错误:
Error in UseMethod("TermDocumentMatrix", x) :
no applicable method for 'TermDocumentMatrix' applied to an object of class "character"
为什么呢?谢谢!
答案 0 :(得分:2)
问题是您尚未创建Corpus
类的对象,这是您需要向TermDocumentMatrix()
提供的对象类型。请参阅下面的示例。
我想指出的另一点是,在你的str_split(text_tidy2,pattern = "\\s+")
行中,你将文本分成了unigrams(个别条款)。因此,您只能获得一个术语的文档。从这个结构创建一个tdm没有多大意义。这条线的目的是什么?也许我可以指出你想要的东西。
library(tm)
text <- readLines("https://raw.githubusercontent.com/yushu-liu/speach-english/master/speach-English.txt")
#first define the type of source you want to use and how it shall be read
x <- VectorSource(text)
#create a corpus object
x <- VCorpus(x)
#feed it to tdm
tdm <- TermDocumentMatrix(x)
tdm
#<<TermDocumentMatrix (terms: 4159, documents: 573)>>
#Non-/sparse entries: 14481/2368626
#Sparsity : 99%
#Maximal term length: 21
#Weighting : term frequency (tf)