我有下表,其中包含V1和V2中每个唯一V1的双向命中。我想删除每个bidrectional hit(随机选择)中的一个
V1 V2 V3
1 T Y
1 Y T
1 O P
2 Q E
2 E Q
2 C V
2 V C
2 Y T
结果表应该是这样的:
V1 V2 V3
1 T Y
1 O P
2 E Q
2 V C
2 Y T
这可以使用for循环来完成,但我需要一种更有效的方法。
在R中执行此操作的最快方法是什么?
答案 0 :(得分:1)
我认为随机选择意味着我们选择哪种双向匹配并不重要:
df <- read.table(textConnection("V1 V2 V3
1 T Y
1 Y T
1 O P
2 Q E
2 E Q
2 C V
2 V C
2 Y T"), header=TRUE)
rows1 <- apply(df, 1, paste0, collapse="")
## swap the order of column 2,3
rows2 <- apply(df[, c(1, 3:2)], 1, paste0, collapse="")
rows <- rbind(rows1, rows2)
rows
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# rows1 "1TY" "1YT" "1OP" "2QE" "2EQ" "2CV" "2VC" "2YT"
# rows2 "1YT" "1TY" "1PO" "2EQ" "2QE" "2VC" "2CV" "2TY"
vrows <- as.vector(rows)
vrows
# [1] "1TY" "1YT" "1YT" "1TY" "1OP" "1PO" "2QE" "2EQ"
# [9] "2EQ" "2QE" "2CV" "2VC" "2VC" "2CV" "2YT" "2TY"
iunique <- which(!duplicated(vrows))
iunique
# [1] 1 2 5 6 7 8 11 12 15 16
## because of the rbind above we have need only every second entry and
## divide it by 2
i <- iunique[seq(2, length(iunqiue), by=2)]/2
df[i, ]
# V1 V2 V3
# 1 1 T Y
# 3 1 O P
# 4 2 Q E
# 6 2 C V
# 8 2 Y T
答案 1 :(得分:0)
不确定它是否最快(将取决于重复数量等),但您可以将两个数据副本连接在一起,然后删除重复项(中间行是随机化行序,所以选择的是真正随意的):
mirrored <- rbind (dframe, dframe[,c(1,3,2)])
mirrored <- mirrored[sample(nrow(mirrored)),]
dedup <- mirrored[!duplicated(mirrored),]