从字符串中删除字符向量中的单词

时间:2016-03-04 07:46:00

标签: r

我在R中有一个停用词的字符向量:

stopwords = c("a" ,
            "able" ,
            "about" ,
            "above" ,
            "abst" ,
            "accordance" ,
            ...
            "yourself" ,
            "yourselves" ,
            "you've" ,
            "z" ,
            "zero")

假设我有字符串:

str <- c("I have zero a accordance")

如何从str删除我定义的停用词?

我认为gsub或其他grep工具可能是一个很好的选择,尽管其他建议值得欢迎。

3 个答案:

答案 0 :(得分:14)

试试这个:

str <- c("I have zero a accordance")

stopwords = c("a", "able", "about", "above", "abst", "accordance", "yourself",
"yourselves", "you've", "z", "zero")

x <- unlist(strsplit(str, " "))

x <- x[!x %in% stopwords]

paste(x, collapse = " ")

# [1] "I have"

添加:编写“removeWords”函数很简单,因此无需为此目的加载外部包:

removeWords <- function(str, stopwords) {
  x <- unlist(strsplit(str, " "))
  paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(str, stopwords)
# [1] "I have"

答案 1 :(得分:14)

您可以使用tm库:

date_select

答案 2 :(得分:0)

如果stopwords长,则removeWords() solution应该比任何基于正则表达式的解决方案都快得多。

为完整起见,如果str是字符串的向量,则可以这样写:

library("magrittr")
library("stringr")
library("purrr")

remove_words <- function(x, .stopwords) {
  x %>%
    stringr::str_split(" ") %>%
    purrr::flatten_chr() %>%
    setdiff(.stopwords) %>%
    stringr::str_c(collapse = " ")
}
purrr::map_chr(str, remove_words, .stopwords = stopwords)