Question

我有一个60000行/短语的数据框，我想用它作为停用词并从文本中删除。

在使用停用词列表读取csv文件后，我使用tm包并使用此行：

corpus <- tm_map(corpus, removeWords, df$mylistofstopwords)

但是我收到了这个错误：

In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE),  :
  PCRE pattern compilation error
    'regular expression is too large'
    at ''

有没有问题，因为列表很大？有什么我可以解决的吗？

Answer 1

您可以通过将禁用词列表拆分为多个部分来解决您的问题，如下所示：

chunk <- 1000
i <- 0
n <- length(df$mylistofstopwords)
while (i != n) {
    i2 <- min(i + chunk, n)
    corpus <- tm_map(corpus, removeWords, df$mylistofstopwords[(i+1):i2])
    i <- i2
}

或者，你可以使用一个可以处理长停用词列表的包。 corpus 就是这样一个包。 quanteda 是另一个。以下是如何在语料库中获取逐个文档的矩阵

library(corpus)
x <- term_matrix(corpus, drop = df$mylistofstopwords)

此处，输入参数corpus可以是 tm 语料库。

根据长列表移除停用词

1 个答案: