我正在尝试从语料库中删除标点符号,数字和空白。
我的代码是:
# Create a corpus
bd_corpus = Corpus(VectorSource(bd_text))
# Clean the corpus by removing puncuation, numbers, and white spaces
bd_clean <- tm_map(bd_corpus,removePunctuation)
bd_clean <- tm_map(bd_corpus,removeNumbers)
bd_clean <- tm_map(bd_corpus,removeStripwhitespace)
wordcloud(bd_clean)
#modify your word cloud
wordcloud(bd_clean, random.order = F, max.words = 25, scale = c(7, 0.5))
它输出一个词云,但是词云中有冒号,反斜杠,句点等,例如“ here”,“ hey”和“ people”。
此外,这是控制台输出:
# Clean the corpus by removing puncuation, numbers, and white spaces
> bd_clean <- tm_map(bd_corpus,removePunctuation)
Warning message:
In tm_map.SimpleCorpus(bd_corpus, removePunctuation) :
transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeNumbers)
Warning message:
In tm_map.SimpleCorpus(bd_corpus, removeNumbers) :
transformation drops documents
> bd_clean <- tm_map(bd_corpus,removeStripwhitespace)
Error in tm_map.SimpleCorpus(bd_corpus, removeStripwhitespace) :
object 'removeStripwhitespace' not found
答案 0 :(得分:0)
来自@Gregor,上方有注释:
假设我有x <-1。然后运行以下命令:y <-x + 1,y <-x + 2,y <-x +3。最后,y是什么? 4是正确的答案-因为当我们运行y <-x + 3时,y之前是什么并不重要。您正在执行相同的操作:bd_clean <-tm_map(bd_corpus,removePunctuation)从bd_corpus中删除标点符号。下一行bd_clean <-tm_map(bd_corpus,removeNumbers)从bd_corpus中删除数字,并覆盖版本而不标点。相反,您需要具有bd_clean <-tm_map(bd_corpus,bd_clean),以基于已经完成的工作。