包tm:removeWords如何避免删除CERTIAN(具体否定)" english"如果指定了停用词?

时间:2015-10-27 08:16:00

标签: r tm stop-words corpus

我想通过removeWords使用stopwords("english")corpus <- tm_map(corpus,removeWords, stopwords("english")))函数,但是有些像&#34; not&#34;和其他一些否定词我不喜欢喜欢保持。

是否可以使用removeWords, stopwords("english")功能但如果指定,则排除该列表中的某些单词?

我怎样才能防止删除&#34; not&#34;例如?

(次要)是否可以将此类控制列表设置为所有&#34;否定&#34;?

我宁愿不使用我感兴趣的停止列表中的单词来创建我自己的自定义列表。

1 个答案:

答案 0 :(得分:5)

您可以通过计算stopwords("en")与要排除的字词列表之间的差异来创建自定义停用词列表:

exceptions   <- c("not")
my_stopwords <- setdiff(stopwords("en"), exceptions)

如果您需要删除所有否定,可以grep列表stopwords() {/ 1>}

exceptions <- grep(pattern = "not|n't", x = stopwords(), value = TRUE)
# [1] "isn't"     "aren't"    "wasn't"    "weren't"   "hasn't"    "haven't"   "hadn't"    "doesn't"   "don't"     "didn't"   
# [11] "won't"     "wouldn't"  "shan't"    "shouldn't" "can't"     "cannot"    "couldn't"  "mustn't"   "not"
my_stopwords <- setdiff(stopwords("en"), exceptions)