我使用这种方式从文本中删除停用词
dfm <-
tokens(df$text,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = stopwords(source = "smart")) %>%
tokens_wordstem()
但是在结果中,我发现有这样的停用词:
dont
是否可以在不使用自定义停用词列表的情况下将其删除?
答案 0 :(得分:3)
当您说“删除它们”时,我假设您的意思是从令牌中删除dont
,而现有停用词列表仅删除don’t
。 (尽管从您的问题或某些答案是如何解释的,这还不是很清楚。) quanteda 框架中存在两个简单的解决方案。
首先,您可以将其他删除方式附加到tokens_remove()
调用中。
第二,您可以处理stopwords()
返回的字符向量,以也包括不带撇号的版本。
插图:
library("quanteda")
## Package version: 1.5.1
toks <- tokens("I don't know what I dont or cant know.")
# original
tokens_remove(toks, c(stopwords("en")))
## tokens from 1 document.
## text1 :
## [1] "know" "dont" "cant" "know" "."
# manual addition
tokens_remove(toks, c(stopwords("en"), "dont", "cant"))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."
# automatic addition to stopwords
tokens_remove(toks, c(
stopwords("en"),
stringi::stri_replace_all_fixed(stopwords("en"), "'", "")
))
## tokens from 1 document.
## text1 :
## [1] "know" "know" "."
答案 1 :(得分:1)
stopwords
函数本身无法做到这一点。但是,您可以从“智能”词典中轻松创建自己的词典,然后删除不需要的单词:
my_stopwords <- data.frame(word=stopwords(source="smart")) %>% filter(word != "dont")
答案 2 :(得分:1)
您可以尝试使用几个软件包和功能来对其进行管理。看来您对tidyverse
充满信心,因此这里提供了一种解决方案。
请记住,这不是一个完美的方法,如果您的文本(短文本)非常少,我认为您可以手动进行管理以消除错误:如果您不知道有多少个,什么是什么,我的解决方案可能会有所帮助错误。
library(quanteda) # for your purposes
library(qdap) # to detect errors
library(tidytext) # lovely package about tidy texts
由于您尚未共享数据,这里有一些伪造的数据:
df <- data.frame(id = c(1:2),text = c("dont panic", "don't panic"), stringsAsFactors = F)
df
id text
1 1 dont panic
2 2 don't panic
现在,首先我们要消除错误:
unnested <- df %>% unnest_tokens(not.found,text) # one line per words
errors <- data.frame(check_spelling(unnested$not.found)) # check the errors, it could take time
full <- unnested %>% left_join(errors) # join them!
结果在这里:
full
id not.found row word.no suggestion more.suggestions
1 1 dont 1 1 don't donut, don, dot, docent, donate, donuts, dopant
2 1 panic NA <NA> <NA> NULL
3 2 don't NA <NA> <NA> NULL
4 2 panic NA <NA> <NA> NULL
现在可以很容易地对其进行整理:
full <- full %>%
# if there is a correction, replace the wrong word with it
mutate(word = ifelse(is.na(suggestion), not.found, suggestion)) %>%
# select useful columns
select(id,word) %>%
# group them and create the texts
group_by(id) %>%
summarise(text = paste(word, collapse = ' '))
full
# A tibble: 2 x 2
id text
<int> <chr>
1 1 don't panic
2 2 don't panic
现在您可以开始做自己的事情了:
tokens(as.character(full$text),
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE) %>%
tokens_remove(pattern = stopwords(source = "smart")) %>%
tokens_wordstem()
tokens from 2 documents.
text1 :
[1] "panic"
text2 :
[1] "panic"