R removewords()不起作用

时间:2017-09-30 14:33:20

标签: r

我是R的新手,我正在寻找一种使用停用词删除一些英语单词的方法

这里我所做的功能:

cleanfunction  <- function(test) {
test <-removeWords(test,stopwords("en"))
test<-gsub("\\b[A-z]\\b{1}"," ",test)
test<-gsub("\\W"," ",test)
test<-gsub("\\d"," ",test)
test<-stripWhitespace(test)

return (test)
}

Mdatasub2 <-aggregate(Reviews ~ Product.Name,data =Mdatasub2,FUN=cleanfunction)

事情是,它不会删除“the”,“just”,“this”“got”

提前致谢

1 个答案:

答案 0 :(得分:0)

您需要对代码进行一些更改。您需要tm库和tm_map功能,如下所示:

library(tm)

cleanfunction  <- function(test) {

    ## You can use tm_map but I am keeping gsub function
    test <-gsub("\\b[A-z]\\b{1}"," ",test)
    test <-gsub("\\W"," ",test)
    test <-gsub("\\d"," ",test)

    # You need to convert your vector to corpus  
    myCorpus <- Corpus(VectorSource(test))

    ## You can add any words that you would like to exclude in myStopwords. 
    ## stopwords("english") have some default word list that it would exclude from corpus but not all common words. So, myStopwords will help you to remove certain words that you wish to remove

    myStopwords <- c("got", "just", "this", "the")
    myCorpus <- tm_map(myCorpus, removeWords, c(myStopwords, stopwords("english"))) 

    ## Stripping extra white space    
    test <- tm_map(myCorpus, stripWhitespace)

return (test)

}

有关tm_map的详细信息,您可以使用?tm_map并查看文档。