它将“三星”和“三星”视为重复

Question

只是为了帮助那些刚刚自愿删除问题的人，按照他尝试的代码请求和其他评论。让我们假设他们尝试过这样的事情：

str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

并希望学习更好的方法。那么从字符串中删除重复单词的最佳方法是什么？

Answer 1

如果您仍然对备用解决方案感兴趣，可以使用unique，这会略微简化您的代码。

paste(unique(d), collapse = ' ')

根据Thomas的评论，您可能确实想删除标点符号。 R＆＃39; gsub有一些很好的内部模式，你可以使用而不是严格的正则表达式。当然，如果你想做一些更精致的正则表达式，你总是可以指定特定的实例。

d <- gsub("[[:punct:]]", "", d)

Answer 2

不需要额外的包

str <- c("How do I best try and try and try and find a way to to improve this code?",
         "And and here's a second one one and not a third One.")

原子功能：

rem_dup.one <- function(x){
  paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")

矢量化

rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)

结果

"how do i best try and find a way to improve this code" "and here's a second one not third"

Answer 3

我不确定字符串是否值得关注。此解决方案将qdap与附加qdapRegex包一起使用，以确保标点符号和起始字符串大小写不会干扰删除但保留：

str <- c("How do I best try and try and try and find a way to to improve this code?",
    "And and here's a second one one and not a third One.")

library(qdap)
library(dplyr) # so that pipe function (%>% can work) 

str %>% 
    tolower() %>%
    word_split() %>% 
    sapply(., function(x) unbag(unique(x))) %>% 
    rm_white_endmark() %>%  
    rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
    unname()

## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."

Answer 4

删除重复的单词，除了任何特殊字符。使用此功能

rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}

输入数据：

duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)

输出：三星wa80e5lec顶部装满6公斤钻石鼓（银）。

删除R中字符串中的重复单词

4 个答案:

它将“三星”和“三星”视为重复