Question

在字符串变量中，我想删除重复的两个部分;所以我只选择唯一的字符串。那就是：

我有一个字符串

MyString <- c("aaa", "bbb", "ccc", "ddd", "aaa", "ddd")

我想删除两对副本;然后选择：

[1] "bbb" "ccc"

运气不好，我试过了：

 unique((MyString)

Answer 1

RecyclerView

也：

x <- table(MyString)
names(x[x==1])
[1] "bbb" "ccc"

Answer 2

找到重复的集合

dups = MyString[ duplicated(MyString) ]

并删除集合中的所有匹配项

MyString[ !MyString %in% dups ]

替代：

setdiff(MyString, dups)

来自@Moody_Mudskipper的基于表格的解决方案提供了更大的灵活性，例如，选择出现两次的字符串。另一种选择（可能比table()解决方案更快，当MyString很长时），创建一个唯一字符串的索引，找到每个唯一字符串匹配的次数（{{1} }）并使用这些来对唯一字符串进行子集化：

tabulate() == 1

或者无需创建UString = unique(MyString) UString[ tabulate(match(MyString, UString)) == 1 ]

UString

替代方案：排序然后查找长度为1的运行。

MyString[ which(tabulate(match(MyString, MyString)) == 1) ]

对于性能，这里有一些实现各种解决方案的功能

r = rle(sort(MyString))
r$values[ r$lengths == 1 ]

证明他们产生相同的结果

f0 = function(x) x[ !x %in% x[duplicated(x)] ]
f1 = function(x) setdiff( x, x[duplicated(x)] )
f2 = function(x) { ux = unique(x); ux[ tabulate(match(x, ux)) == 1 ] }
f3 = function(x) x[ which( tabulate( match(x, x) ) == 1 ) ]
f4 = function(x) { r = rle(sort(x)); r$values[ r$lengths == 1] }
f5 = function(x) { x = table(x); names(x)[x==1] }
f6 = function(x) x[ !duplicated(x) & !duplicated(x, fromLast = TRUE) ]

> identical(f0(x), f1(x)) [1] TRUE > identical(f0(x), f2(x)) [1] TRUE > identical(f0(x), f3(x)) [1] TRUE > identical(f0(x), f4(x)) [1] TRUE > identical(f0(x), f5(x)) [1] TRUE > identical(f0(x), f6(x)) [1] TRUE（也是原始实现）因f5()

而失败

x = character(0)

> f1(character(0)) character(0) > f5(character(0)) NULL和f4()按字母顺序返回值，而其他值则保留输入中的顺序，例如f5()。除unique()之外的所有方法都适用于其他类型的向量，例如f5()（integer()总是返回一个字符向量，其他方法返回一个与输入类型相同的向量）。 f5()和f4()无法识别f5()的唯一身份。

和时间：

NA

这是10,000个独特单词的表现

> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x))
Unit: microseconds
  expr     min       lq      mean   median       uq      max neval
 f0(x)   9.195  10.9730  12.35724  11.8120  13.0580   29.100   100
 f1(x)  20.471  22.6625  50.15586  24.6750  25.9915 2600.307   100
 f2(x)  13.708  15.2265  58.58714  16.8180  18.4685 4180.829   100
 f3(x)   7.533   8.8775  52.43730   9.9855  11.0060 4252.063   100
 f4(x)  74.333  79.4305 124.26233  83.1505  87.4455 4091.371   100
 f5(x) 147.744 154.3080 196.05684 158.4880 163.6625 3721.522   100
 f6(x)  12.458  14.2335  58.11869  15.4805  17.0440 4250.500   100

并且有大量重复

> x = readLines("/usr/share/dict/words", 10000)
> microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x), times = 10)
Unit: microseconds
  expr       min        lq       mean    median        uq       max neval
 f0(x)   848.086   871.359   880.8841   873.637   899.669   916.528    10
 f1(x)  1440.904  1460.704  1556.7154  1589.405  1607.048  1640.347    10
 f2(x)  2143.997  2257.041  2288.1878  2288.329  2334.494  2372.639    10
 f3(x)  1420.144  1548.055  1547.8093  1562.927  1596.574  1601.176    10
 f4(x) 11829.680 12141.870 12369.5407 12311.334 12716.806 12952.950    10
 f5(x) 15796.546 15833.650 16176.2654 15858.629 15913.465 18604.658    10
 f6(x)  1219.036  1356.807  1354.3578  1363.276  1372.831  1407.077    10

> x = sample(head(x, 1000), 10000, TRUE) > microbenchmark(f0(x), f1(x), f2(x), f3(x), f4(x), f5(x), f6(x)) Unit: milliseconds expr min lq mean median uq max neval f0(x) 1.914699 1.922925 1.992511 1.945807 2.030469 2.246022 100 f1(x) 1.888959 1.909469 2.097532 1.948002 2.031083 5.310342 100 f2(x) 1.396825 1.404801 1.447235 1.420777 1.479277 1.820402 100 f3(x) 1.248126 1.257283 1.295493 1.285652 1.329139 1.427220 100 f4(x) 24.075280 24.298454 24.562576 24.459281 24.700579 25.752481 100 f5(x) 4.044137 4.120369 4.307893 4.174639 4.283030 7.740830 100 f6(x) 1.221024 1.227792 1.264572 1.243201 1.295888 1.462007 100似乎是复制品很少见的速度

f0()

> x = readLines("/usr/share/dict/words", 100000) > microbenchmark(f0(x), f1(x), f3(x), f6(x)) Unit: milliseconds expr min lq mean median uq max neval f0(x) 11.03298 11.17124 12.17688 11.36114 11.62769 19.83124 100 f1(x) 21.16154 21.33792 22.76237 21.67234 22.26473 31.99544 100 f3(x) 21.15801 21.49355 22.60749 21.77821 22.54203 31.17288 100 f6(x) 18.72260 18.97623 20.29060 19.46875 19.94892 28.17551 100和f3()看起来正确而快速; f6()可能更容易理解（但只处理保持恰好出现一次的单词的特殊情况）。

删除字符串中的重复对

2 个答案: