我想使用Quanteda删除带有停用词的特定列表。
我用这个:
df <- data.frame(data = c("Here is an example text and why I write it", "I can explain and here you but I can help as I would like to help"))
mystopwords <- c("is","an")
corpus<- dfm(tokens_remove(tokens(df$data, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE), remove = c(stopwords(language = "el", source = "misc"), mystopwords), ngrams = c(4,6)))
但是我收到此错误:
> Error in tokens_select(x, ..., selection = "remove") :
unused arguments (remove = c(stopwords(language = "en", source = "misc"), stopwords1), ngrams = c(4, 6))
如何在Quanteda中使用mystopwords列表的正确方法?
答案 0 :(得分:1)
以@phiver的答案为基础,这是删除 quanteda 中特定标记的标准方法。无需使用stopwords()
,因为您要提供要删除的令牌向量,并且patterns
参数可以采用向量,而应使用valuetype = 'fixed'
。
我使用 dplyr 来提高代码的可读性,但您不必这样做。
library(quanteda)
library(dplyr)
df <- data.frame(data = c("Here is an example text and why I write it",
"I can explain and here you but I can help as I would like to help"),
stringsAsFactors = FALSE)
mystopwords <- c("is","an")
corpus <-
tokens(df$data,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = mystopwords,
valuetype = 'fixed') %>%
dfm(ngrams = c(4,6))
答案 1 :(得分:0)
这将起作用。首先,我在data.frame中添加了stringAsFactors = FALSE
。提供给tokens
的文本必须是字符向量,而不是因素。接下来,我从您的代码中更改了remove =
,因为它必须为pattern =
。最后,我的ngram部分需要位于dfm
函数中,而不是token_remove
函数中。
使用嵌套功能时,最好对代码进行更多格式化。这样可以更好地显示可能出现的错误。
library(quanteda)
df <- data.frame(data = c("Here is an example text and why I write it",
"I can explain and here you but I can help as I would like to help"),
stringsAsFactors = FALSE)
mystopwords <- c("is","an")
corpus <- dfm(tokens_remove(tokens(df$data,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE),
pattern = c(stopwords(language = "el", source = "misc"),
mystopwords)
),
ngrams = c(4,6)
)