Question

我想使用Quanteda删除带有停用词的特定列表。

我用这个：

df <- data.frame(data = c("Here is an example text and why I write it", "I can explain and here you but I can help as I would like to help"))
mystopwords <- c("is","an")
corpus<- dfm(tokens_remove(tokens(df$data, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE), remove = c(stopwords(language = "el", source = "misc"), mystopwords), ngrams = c(4,6)))

但是我收到此错误：

> Error in tokens_select(x, ..., selection = "remove") : 
  unused arguments (remove = c(stopwords(language = "en", source = "misc"), stopwords1), ngrams = c(4, 6))

如何在Quanteda中使用mystopwords列表的正确方法？

Answer 1

以@phiver的答案为基础，这是删除 quanteda 中特定标记的标准方法。无需使用stopwords()，因为您要提供要删除的令牌向量，并且patterns参数可以采用向量，而应使用valuetype = 'fixed'。

我使用 dplyr 来提高代码的可读性，但您不必这样做。

library(quanteda)
library(dplyr)
df <- data.frame(data = c("Here is an example text and why I write it", 
                          "I can explain and here you but I can help as I would like to help"),
                 stringsAsFactors = FALSE)

mystopwords <- c("is","an")
corpus <- 
  tokens(df$data,
         remove_punct = TRUE, 
         remove_numbers = TRUE, 
         remove_symbols = TRUE) %>%
  tokens_remove(pattern = mystopwords,
                valuetype = 'fixed') %>%
  dfm(ngrams = c(4,6))

Answer 2

这将起作用。首先，我在data.frame中添加了stringAsFactors = FALSE。提供给tokens的文本必须是字符向量，而不是因素。接下来，我从您的代码中更改了remove =，因为它必须为pattern =。最后，我的ngram部分需要位于dfm函数中，而不是token_remove函数中。

使用嵌套功能时，最好对代码进行更多格式化。这样可以更好地显示可能出现的错误。

library(quanteda)
df <- data.frame(data = c("Here is an example text and why I write it", 
                          "I can explain and here you but I can help as I would like to help"),
                 stringsAsFactors = FALSE)

mystopwords <- c("is","an")
corpus <- dfm(tokens_remove(tokens(df$data, 
                                   remove_punct = TRUE, 
                                   remove_numbers = TRUE, 
                                   remove_symbols = TRUE), 
                            pattern = c(stopwords(language = "el", source = "misc"), 
                                       mystopwords) 
                            ), 
              ngrams = c(4,6)
              )

停用词Quanteda的特定列表

2 个答案: