我需要从数据框中的句子中删除已定义的字符串:
sent1 = data.frame(Sentences=c("bad printer for the money wireless setup was surprisingly easy",
"love my samsung galaxy tabinch gb whitethis is the first"), user = c(1,2))
Sentences User
bad printer for the money wireless setup was surprisingly easy 1
love my samsung galaxy tabinch gb whitethis is the first 2
用于排除的定义字符串,例如:
stop_words <- c("bad", "money", "love", "is", "the")
我想知道这样的事情:
library(stringr)
words1 <- (str_split(unlist(sent1$Sentences)," "))
ddd = which(words1[[1]] %in% stop_words)
words1[[1]][-ddd]
但我需要列表中的所有项目。然后我需要输出表与输入表 sent1
的结构相同,但是没有定义的字符串。
拜托,我非常感谢任何帮助或建议。
答案 0 :(得分:5)
您可以组合停用词并创建正则表达式模式。因此,您只需要一个gsub
命令。
# create regex pattern
pattern <- paste0("\\b(?:", paste(stop_words, collapse = "|"), ")\\b ?")
# [1] "\\b(?:bad|money|love|is|the)\\b ?"
# remove stop words
res <- gsub(pattern, "", sent1$Sentences)
# [1] "printer for wireless setup was surprisingly easy"
# [2] "my samsung galaxy tabinch gb whitethis first"
# store result in a data frame
data.frame(Sentences = res)
# Sentences
# 1 printer for wireless setup was surprisingly easy
# 2 my samsung galaxy tabinch gb whitethis first