Question

我正在尝试使用包tm来过滤以下文档中的停用词。

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(stopwords=TRUE))

但是，当我运行此代码时，我仍然会在DocumentTermMatrix中获得以下内容。

colnames(matrix)
[1] "brown"  "dog"    "fox"    "jumps"  "lazy"   "over"   "quick"  "the"    "walrus"

“The”在列表tm使用的列表中列为停用词。我对stopwords参数做错了什么，或者这是tm包中的错误？

编辑：我联系了Ingo Feinerer，他注意到技术上不是错误：

首先处理用户提供的选项，然后全部处理选项。因此，在标记化之前完成停用词删除（如已经由Vincent Zoonekynd在stackoverflow.com上写过了完全是你的结果。

因此，解决方案是在stopwords参数之前明确列出默认标记化选项，例如：

library(tm)
documents <- c("the quick brown fox jumps over the lazy dog", "i am the walrus")
corpus <- Corpus(VectorSource(documents))
matrix <- DocumentTermMatrix(corpus,control=list(tokenize=scan_tokenizer,stopwords=TRUE))
colnames(matrix)

Answer 1

您还可以尝试在创建术语矩阵之前从语料库中删除停用词。

text_corpus <- tm_map(text_corpus, removeWords, stopwords("english"))
dtm <- DocumentTermMatrix(text_corpus)

这通常适合我。

Answer 2

这是一个错误：您可能想要将其报告给包作者。 termFreq函数将各种过滤器应用于文本，但并不总是按正确的顺序排列。在你的例子中，代码试图在标记化之前删除停用词，即在文本被剪切成单词之前 - 它应该在之后，一旦我们知道单词是什么。

Answer 3

快速修复将在以后运行：

matrix <- matrix[,!colnames(matrix)%in%stopwords()]

包tm stop-word参数

3 个答案: