只是为了帮助那些刚刚自愿删除问题的人,按照他尝试的代码请求和其他评论。让我们假设他们尝试过这样的事情:
str <- "How do I best try and try and try and find a way to to improve this code?"
d <- unlist(strsplit(str, split=" "))
paste(d[-which(duplicated(d))], collapse = ' ')
并希望学习更好的方法。那么从字符串中删除重复单词的最佳方法是什么?
答案 0 :(得分:4)
如果您仍然对备用解决方案感兴趣,可以使用unique
,这会略微简化您的代码。
paste(unique(d), collapse = ' ')
根据Thomas的评论,您可能确实想删除标点符号。 R&#39; gsub
有一些很好的内部模式,你可以使用而不是严格的正则表达式。当然,如果你想做一些更精致的正则表达式,你总是可以指定特定的实例。
d <- gsub("[[:punct:]]", "", d)
答案 1 :(得分:2)
不需要额外的包
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
原子功能:
rem_dup.one <- function(x){
paste(unique(tolower(trimws(unlist(strsplit(x,split="(?!')[ [:punct:]]",fixed=F,perl=T))))),collapse = " ")
}
rem_dup.one("And and here's a second one one and not a third One.")
矢量化
rem_dup.vector <- Vectorize(rem_dup.one,USE.NAMES = F)
rem_dup.vector(str)
结果
"how do i best try and find a way to improve this code" "and here's a second one not third"
答案 2 :(得分:1)
我不确定字符串是否值得关注。此解决方案将qdap
与附加qdapRegex
包一起使用,以确保标点符号和起始字符串大小写不会干扰删除但保留:
str <- c("How do I best try and try and try and find a way to to improve this code?",
"And and here's a second one one and not a third One.")
library(qdap)
library(dplyr) # so that pipe function (%>% can work)
str %>%
tolower() %>%
word_split() %>%
sapply(., function(x) unbag(unique(x))) %>%
rm_white_endmark() %>%
rm_default(pattern="(^[a-z]{1})", replacement = "\\U\\1") %>%
unname()
## [1] "How do i best try and find a way to improve this code?"
## [2] "And here's a second one not third."
答案 3 :(得分:0)
删除重复的单词,除了任何特殊字符。使用此功能
rem_dup_word <- function(x){
x <- tolower(x)
paste(unique(trimws(unlist(strsplit(x,split=" ",fixed=F,perl=T)))),collapse =
" ")
}
输入数据:
duptest <- "Samsung WA80E5LEC samsung Top Loading with Diamond Drum, 6 kg
(Silver)"
rem_dup_word(duptest)
输出:三星wa80e5lec顶部装满6公斤钻石鼓(银)。