停用词Quanteda的特定列表

时间:2019-01-14 14:05:17

标签: r quanteda

我想使用Quanteda删除带有停用词的特定列表。

我用这个:

df <- data.frame(data = c("Here is an example text and why I write it", "I can explain and here you but I can help as I would like to help"))
mystopwords <- c("is","an")
corpus<- dfm(tokens_remove(tokens(df$data, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE), remove = c(stopwords(language = "el", source = "misc"), mystopwords), ngrams = c(4,6)))

但是我收到此错误:

> Error in tokens_select(x, ..., selection = "remove") : 
  unused arguments (remove = c(stopwords(language = "en", source = "misc"), stopwords1), ngrams = c(4, 6))

如何在Quanteda中使用mystopwords列表的正确方法?

2 个答案:

答案 0 :(得分:1)

以@phiver的答案为基础,这是删除 quanteda 中特定标记的标准方法。无需使用stopwords(),因为您要提供要删除的令牌向量,并且patterns参数可以采用向量,而应使用valuetype = 'fixed'

我使用 dplyr 来提高代码的可读性,但您不必这样做。

library(quanteda)
library(dplyr)
df <- data.frame(data = c("Here is an example text and why I write it", 
                          "I can explain and here you but I can help as I would like to help"),
                 stringsAsFactors = FALSE)

mystopwords <- c("is","an")
corpus <- 
  tokens(df$data,
         remove_punct = TRUE, 
         remove_numbers = TRUE, 
         remove_symbols = TRUE) %>%
  tokens_remove(pattern = mystopwords,
                valuetype = 'fixed') %>%
  dfm(ngrams = c(4,6))

答案 1 :(得分:0)

这将起作用。首先,我在data.frame中添加了stringAsFactors = FALSE。提供给tokens的文本必须是字符向量,而不是因素。接下来,我从您的代码中更改了remove =,因为它必须为pattern =。最后,我的ngram部分需要位于dfm函数中,而不是token_remove函数中。

使用嵌套功能时,最好对代码进行更多格式化。这样可以更好地显示可能出现的错误。

library(quanteda)
df <- data.frame(data = c("Here is an example text and why I write it", 
                          "I can explain and here you but I can help as I would like to help"),
                 stringsAsFactors = FALSE)

mystopwords <- c("is","an")
corpus <- dfm(tokens_remove(tokens(df$data, 
                                   remove_punct = TRUE, 
                                   remove_numbers = TRUE, 
                                   remove_symbols = TRUE), 
                            pattern = c(stopwords(language = "el", source = "misc"), 
                                       mystopwords) 
                            ), 
              ngrams = c(4,6)
              )