目前,我正在研究文本挖掘(互联网评论)案例,我真的想删除那些无意义的单词。例如aaaaawwwwww,weeeeeeeeeeeeeee等。
我想制定一条规则,以删除包含3个或更多连续字母的单词。
有人可以帮助我通过tm_map
或基于DocumentTermMatrix
来实现这一目标吗?
这是我到目前为止所做的
tm_test = tm_map(tm_test, removePunctuation)
for (i in seq(tm_test)) {
tm_test[[i]] = gsub("/", " ", tm_test[[i]])
tm_test[[i]] = gsub("@", " ", tm_test[[i]])
tm_test[[i]] = gsub("\\|", " ", tm_test[[i]])
}
tm_test = tm_map(tm_test, removeNumbers)
tm_test = tm_map(tm_test, tolower)
tm_test = tm_map(tm_test, PlainTextDocument)
tm_test = tm_map(tm_test, removeWords, stopwords("english"))
tm_test = tm_map(tm_test, PlainTextDocument)
# tm_test = tm_map(tm_test, stemDocument)
# tm_test = tm_map(tm_test, PlainTextDocument)
tm_test = tm_map(tm_test, stripWhitespace)
tm_test = tm_map(tm_test, PlainTextDocument)
dtm_test = DocumentTermMatrix(tm_test)
模拟输入数据:
text = c('apple', 'banana', 'orange', 'travelling', 'esteem', 'woooo','awwwwwwww','waaaaakakakakaka')
预期输出:
out = c('apple', 'banana', 'orange', 'tarvelling', 'esteem')