R:从表中删除双向命中

时间:2014-04-07 09:41:20

标签: r performance

我有下表,其中包含V1和V2中每个唯一V1的双向命中。我想删除每个bidrectional hit(随机选择)中的一个

V1 V2 V3
1  T  Y
1  Y  T
1  O  P
2  Q  E
2  E  Q
2  C  V
2  V  C
2  Y  T

结果表应该是这样的:

V1 V2 V3
1  T  Y
1  O  P
2  E  Q
2  V  C
2  Y  T

这可以使用for循环来完成,但我需要一种更有效的方法。

在R中执行此操作的最快方法是什么?

2 个答案:

答案 0 :(得分:1)

我认为随机选择意味着我们选择哪种双向匹配并不重要:

df <- read.table(textConnection("V1 V2 V3
1  T  Y
1  Y  T
1  O  P
2  Q  E
2  E  Q
2  C  V
2  V  C
2  Y  T"), header=TRUE)

rows1 <- apply(df, 1, paste0, collapse="")
## swap the order of column 2,3
rows2 <- apply(df[, c(1, 3:2)], 1, paste0, collapse="")

rows <- rbind(rows1, rows2)
rows
#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]
# rows1 "1TY" "1YT" "1OP" "2QE" "2EQ" "2CV" "2VC" "2YT"
# rows2 "1YT" "1TY" "1PO" "2EQ" "2QE" "2VC" "2CV" "2TY"

vrows <- as.vector(rows)
vrows
# [1] "1TY" "1YT" "1YT" "1TY" "1OP" "1PO" "2QE" "2EQ"
# [9] "2EQ" "2QE" "2CV" "2VC" "2VC" "2CV" "2YT" "2TY"

iunique <- which(!duplicated(vrows))
iunique
#  [1]  1  2  5  6  7  8 11 12 15 16

## because of the rbind above we have need only every second entry and 
## divide it by 2
i <- iunique[seq(2, length(iunqiue), by=2)]/2

df[i, ]
#   V1 V2 V3
# 1  1  T  Y
# 3  1  O  P
# 4  2  Q  E
# 6  2  C  V
# 8  2  Y  T

答案 1 :(得分:0)

不确定它是否最快(将取决于重复数量等),但您可以将两个数据副本连接在一起,然后删除重复项(中间行是随机化行序,所以选择的是真正随意的):

mirrored <- rbind (dframe, dframe[,c(1,3,2)])
mirrored <- mirrored[sample(nrow(mirrored)),]
dedup <- mirrored[!duplicated(mirrored),]