Question

我有一个具有这种结构的数据框：

Note.Reco Review Review.clean.lower
10 Good Products  good products
9 Nice film      nice film
....         ....

第一列是电影的排名，然后第二列是客户评论，第三列是小写字母的评论。

我现在尝试用这个删除停用词：

Data_clean$Raison.Reco.clean1 <- Corpus(VectorSource(Data_clean$Review.clean.lower))
Data_clean$Review.clean.lower1 <- tm_map(Data_clean$Review.clean.lower1, removeWords, stopwords("english"))

但R工作室崩溃了

你能帮我解决这个问题吗？

谢谢

编辑：

#clean up
# remove grammar/punctuation
Data_clean$Review.clean.lower <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))

Data_corpus <- Corpus(VectorSource(Data_clean$Review.clean.lower))

Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("french"))

train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]

因此，当我运行最后2条指令时，我收到错误。

Answer 1

尝试以下方法。您可以直接在语料库而不是列上进行清理。

Data_corpus <-
  Corpus(VectorSource(Data_clean$Review.clean.lower))

  Data_clean <- tm_map(Data_corpus,  removeWords, stopwords("english"))

编辑：正如您所提到的，您希望在删除停用词后能够访问输出，请尝试使用以下内容而不是上述内容：

library(tm)

stopWords <- stopwords("en")

Data_clean$Review.clean.lower<- as.character(Data_clean$Review.clean.lower)
 '%nin%' <- Negate('%in%')
 Data_clean$Review.clean.lower1<-lapply(Data_clean$Review.clean.lower, function(x) {
  chk <- unlist(strsplit(x," "))
  p <- chk[chk %nin% stopWords]
  paste(p,collapse = " ")
})

以上代码的示例输出：

>  print(Data_clean)
>       note Note.Reco.Review Review.clean.lower Review.clean.lower1
>     1   10    Good Products      good products       good products
>     2    9        Nice film     is a nice film           nice film

另请查看以下内容： R remove stopwords from a character vector using %in%

删除R中的停用词

1 个答案: