Question

我不太清楚如何说出这个问题。我刚刚开始研究一堆推文，我做了一些基本的清理，现在有些推文看起来像：

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

基本上我想通过检查字符串的第一部分是否匹配并返回最长的部分来删除重复。在这种情况下，我的结果应该是：

[1]"stackoverflow is a great site"
[2]"omg it is friday and so sunny"
[3]"arggh how annoying"

因为所有其他内容都被截断重复上述内容。我试过用过 unique()函数但它没有返回我想要的结果，因为它试图匹配字符串的整个长度。有什么指示吗？

我在Mac OSX 10.7上使用R 3.1.1版...

谢谢！

Answer 1

这是另一种选择。我在你的样本数据中添加了一个字符串。

x <- c("stackoverflow is a great site",
"stackoverflow is a great si",
"stackoverflow is a great",
"stackoverflow is an OK site",
"omg it is friday and so sunny",
"omg it is friday and so",
"arggh how annoying")

Filter(function(y) {
    x2 <- sapply(setdiff(x, y), substr, start=1, stop=nchar(y))
    ! duplicated(c(y, x2), fromLast=TRUE)[1]
}, x)


# [1] "stackoverflow is a great site" "stackoverflow is an OK site"   "omg it is friday and so sunny" [4] "arggh how annoying"

Answer 2

这是我的尝试：

library(stringr)
x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
[1] "stackoverflow is a great site" "omg it is friday and so sunny" "arggh how annoying"

基本上，我排除那些已经包含在其他任何字符串中的字符串。这或许与您描述的有些不同，但大致相同并且非常简单。

Answer 3

@tonytonov解决方案很好，但我建议使用stringi package：）

stringi <- function(x){
  x[!sapply(seq_along(x), function(i) any(stri_detect_fixed(x[-i], x[i])))]
}

stringr <- function(x){
  x[!sapply(seq_along(x), function(i) any(str_detect(x[-i], x[i])))]
}

require(microbenchmark)
microbenchmark(stringi(x), stringr(x))
Unit: microseconds
       expr     min       lq   median       uq      max neval
 stringi(x)  52.482  58.1760  64.3275  71.9630  120.374   100
 stringr(x) 538.482 551.0485 564.3445 602.7095 1736.601   100

从类似字符串的向量中获取唯一字符串

3 个答案: