我在R中有一个停用词的字符向量:
stopwords = c("a" ,
"able" ,
"about" ,
"above" ,
"abst" ,
"accordance" ,
...
"yourself" ,
"yourselves" ,
"you've" ,
"z" ,
"zero")
假设我有字符串:
str <- c("I have zero a accordance")
如何从str
删除我定义的停用词?
我认为gsub
或其他grep
工具可能是一个很好的选择,尽管其他建议值得欢迎。
答案 0 :(得分:14)
试试这个:
str <- c("I have zero a accordance")
stopwords = c("a", "able", "about", "above", "abst", "accordance", "yourself",
"yourselves", "you've", "z", "zero")
x <- unlist(strsplit(str, " "))
x <- x[!x %in% stopwords]
paste(x, collapse = " ")
# [1] "I have"
添加:编写“removeWords”函数很简单,因此无需为此目的加载外部包:
removeWords <- function(str, stopwords) {
x <- unlist(strsplit(str, " "))
paste(x[!x %in% stopwords], collapse = " ")
}
removeWords(str, stopwords)
# [1] "I have"
答案 1 :(得分:14)
您可以使用tm
库:
date_select
答案 2 :(得分:0)
如果stopwords
长,则removeWords()
solution应该比任何基于正则表达式的解决方案都快得多。
为完整起见,如果str
是字符串的向量,则可以这样写:
library("magrittr")
library("stringr")
library("purrr")
remove_words <- function(x, .stopwords) {
x %>%
stringr::str_split(" ") %>%
purrr::flatten_chr() %>%
setdiff(.stopwords) %>%
stringr::str_c(collapse = " ")
}
purrr::map_chr(str, remove_words, .stopwords = stopwords)