使用TM库,语料库包含来自Vector源结构的单词:
text <- readLines("some.txt")
finalCorpus <- Corpus(VectorSource(newCorpus))
finalCorpus <- tm_map(finalCorpus, stripWhitespace)
save(finalCorpus, file="data/DEBUG.Rda")# DEBUG
df<- data.frame(lapply(finalCorpus, as.character), stringsAsFactors=FALSE)
df
>protracted periods meditation fasting prayer ennui fever energy vigor
>married joseph lee dollars million canadian dollars gbp pastored african
>american church snow hill jersey children died infancy **meta list author
>character datetimestamp list sec min hour mday mon year wday yday isdst
>description character heading character id language en origin character
>X2 X3
>1 list list**
**之间的单词来自语料库,而不是来自导入的文本,为什么我得到它们以及如何删除它们(没有removeWords TM函数)?