我有一个包含文本的数据框,我试图从存储在矢量中的文本中删除某些单词。请帮助我实现这一目标!
stopwords <- c("today","hot","outside","so","its")
df <- data.frame(a = c("a1", "a2", "a3"), text = c("today the weather looks hot", "its so rainy outside", "today its sunny"))
预期输出:
a text new_text
1 a1 Today the weather looks hot the weather looks
2 a2 its so rainy outside rainy
3 a3 today its sunny sunny
答案 0 :(得分:1)
将所有stopwords
粘贴在一起,然后使用gsub
删除它们。
df$new_text <- trimws(gsub(paste0(stopwords, collapse = "|"), "", df$text))
df
# a text new_text
#1 a1 today the weather looks hot the weather looks
#2 a2 its so rainy outside rainy
#3 a3 today its sunny sunny
或与str_remove_all
stringr::str_remove_all(df$text, paste0(stopwords, collapse = "|"))
为了更加安全,在每个stopwords
周围添加单词边界,以便不替换"so"
或"something"
中的"some"
。
df$new_text <- trimws(gsub(paste0("\\b", stopwords, "\\b",
collapse = "|"), "", df$text))