我是R的新手,我正在寻找一种使用停用词删除一些英语单词的方法
这里我所做的功能:
cleanfunction <- function(test) {
test <-removeWords(test,stopwords("en"))
test<-gsub("\\b[A-z]\\b{1}"," ",test)
test<-gsub("\\W"," ",test)
test<-gsub("\\d"," ",test)
test<-stripWhitespace(test)
return (test)
}
Mdatasub2 <-aggregate(Reviews ~ Product.Name,data =Mdatasub2,FUN=cleanfunction)
事情是,它不会删除“the”,“just”,“this”“got”
提前致谢
答案 0 :(得分:0)
您需要对代码进行一些更改。您需要tm
库和tm_map
功能,如下所示:
library(tm)
cleanfunction <- function(test) {
## You can use tm_map but I am keeping gsub function
test <-gsub("\\b[A-z]\\b{1}"," ",test)
test <-gsub("\\W"," ",test)
test <-gsub("\\d"," ",test)
# You need to convert your vector to corpus
myCorpus <- Corpus(VectorSource(test))
## You can add any words that you would like to exclude in myStopwords.
## stopwords("english") have some default word list that it would exclude from corpus but not all common words. So, myStopwords will help you to remove certain words that you wish to remove
myStopwords <- c("got", "just", "this", "the")
myCorpus <- tm_map(myCorpus, removeWords, c(myStopwords, stopwords("english")))
## Stripping extra white space
test <- tm_map(myCorpus, stripWhitespace)
return (test)
}
有关tm_map
的详细信息,您可以使用?tm_map
并查看文档。