删除R中字符串中的重复单词

时间:2013-11-29 10:31:33

标签: r duplicates

只是为了帮助那些刚刚自愿删除问题的人,按照他尝试的代码请求和其他评论。让我们假设他们尝试过这样的事情:

str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')

并希望学习更好的方法。那么从字符串中删除重复单词的最佳方法是什么?

4 个答案:

答案 0 :(得分:4)

如果您仍然对备用解决方案感兴趣,可以使用unique,这会略微简化您的代码。

paste(unique(d), collapse = ' ')

根据Thomas的评论,您可能确实想删除标点符号。 R&#39; gsub有一些很好的内部模式,你可以使用而不是严格的正则表达式。当然,如果你想做一些更精致的正则表达式,你总是可以指定特定的实例。

d <- gsub("[[:punct:]]", "", d)

答案 1 :(得分:2)

不需要额外的包

str <- c("How do I best try and try and try and find a way to to improve this code?",
         "And and here's a second one one and not a third One.")

原子功能:

rem_dup.one <- function(x){
  paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")

矢量化

rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)

结果

"how do i best try and find a way to improve this code" "and here's a second one not third" 

答案 2 :(得分:1)

我不确定字符串是否值得关注。此解决方案将qdap与附加qdapRegex包一起使用,以确保标点符号和起始字符串大小写不会干扰删除但保留:

str <- c("How do I best try and try and try and find a way to to improve this code?",
    "And and here's a second one one and not a third One.")

library(qdap)
library(dplyr) # so that pipe function (%>% can work) 

str %>% 
    tolower() %>%
    word_split() %>% 
    sapply(., function(x) unbag(unique(x))) %>% 
    rm_white_endmark() %>%  
    rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
    unname()

## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."

答案 3 :(得分:0)

删除重复的单词,除了任何特殊字符。使用此功能

rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse = 
" ")
}

输入数据:

duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg 
(Silver)"

rem_dup_word(duptest)

输出:三星wa80e5lec顶部装满6公斤钻石鼓(银)。

它将“三星”和“三星”视为重复