从数据框中的句子中删除定义的字符串

时间:2015-02-04 15:36:39

标签: regex r string

我需要从数据框中的句子中删除已定义的字符串:

sent1 = data.frame(Sentences=c("bad printer for the money wireless setup was surprisingly easy",
                           "love my samsung galaxy tabinch gb whitethis is the first"), user = c(1,2))

Sentences                                                            User
bad printer for the money wireless setup was surprisingly easy        1
love my samsung galaxy tabinch gb whitethis is the first              2

用于排除的定义字符串,例如:

stop_words <- c("bad", "money", "love", "is", "the")

我想知道这样的事情:

library(stringr)
words1 <- (str_split(unlist(sent1$Sentences)," "))
ddd = which(words1[[1]] %in% stop_words)
words1[[1]][-ddd]

但我需要列表中的所有项目。然后我需要输出表与输入表 sent1 的结构相同,但是没有定义的字符串。

拜托,我非常感谢任何帮助或建议。

1 个答案:

答案 0 :(得分:5)

您可以组合停用词并创建正则表达式模式。因此,您只需要一个gsub命令。

# create regex pattern
pattern <- paste0("\\b(?:", paste(stop_words, collapse = "|"), ")\\b ?")
# [1] "\\b(?:bad|money|love|is|the)\\b ?"

# remove stop words
res <- gsub(pattern, "", sent1$Sentences)
# [1] "printer for wireless setup was surprisingly easy"
# [2] "my samsung galaxy tabinch gb whitethis first"

# store result in a data frame
data.frame(Sentences = res)
#                                          Sentences
# 1 printer for wireless setup was surprisingly easy
# 2     my samsung galaxy tabinch gb whitethis first