如何删除连续超过3个字母的单词?

时间:2019-03-30 09:37:53

标签: r regex text-mining

目前,我正在研究文本挖掘(互联网评论)案例,我真的想删除那些无意义的单词。例如aaaaawwwwww,weeeeeeeeeeeeeee等。

我想制定一条规则,以删除包含3个或更多连续字母的单词。

有人可以帮助我通过tm_map或基于DocumentTermMatrix来实现这一目标吗?

这是我到目前为止所做的

tm_test = tm_map(tm_test, removePunctuation)
for (i in seq(tm_test)) {
  tm_test[[i]] = gsub("/", " ", tm_test[[i]])
  tm_test[[i]] = gsub("@", " ", tm_test[[i]])
  tm_test[[i]] = gsub("\\|", " ", tm_test[[i]])
} 
tm_test = tm_map(tm_test, removeNumbers) 
tm_test = tm_map(tm_test, tolower) 
tm_test = tm_map(tm_test, PlainTextDocument) 
tm_test = tm_map(tm_test, removeWords, stopwords("english")) 
tm_test = tm_map(tm_test, PlainTextDocument)
# tm_test = tm_map(tm_test, stemDocument)    
# tm_test = tm_map(tm_test, PlainTextDocument)
tm_test = tm_map(tm_test, stripWhitespace)
tm_test = tm_map(tm_test, PlainTextDocument)
dtm_test = DocumentTermMatrix(tm_test)

模拟输入数据:

text = c('apple', 'banana', 'orange', 'travelling', 'esteem', 'woooo','awwwwwwww','waaaaakakakakaka')

预期输出:

out = c('apple', 'banana', 'orange', 'tarvelling', 'esteem')

0 个答案:

没有答案