从数据框中选择代表性元素

时间:2015-08-17 11:44:18

标签: r

我使用以下代码根据第二和第三列对数据帧进行了排序:

EXP[rev(order(EXP$1, EXP$2)),]

其中EXP是数据框的名称。

现在,我只需要根据第二列排序的每个代表标识符的第一行。 R中最好的方法是什么?

数据结构如下:

A_1784709     10007 0.40446362
B_2329958     10006 0.22501015
A_1739081     10006 0.10621801
B_1679600     10005 0.51709792
A_1770963     10004 0.21095531
A_2067520 100033416 0.08301735
A_1740024     10003 0.40881969
B_1751882     10002 0.09964711
A_1667906     10002 0.08826233
B_1791916     10002 0.08408508
A_1775734     10044 0.28613624
B_1674440     10044 0.16204336
B_2321648     10044 0.15484888
B_1654543     10001 0.27293547
B_1733559 100008589 1.03071504
A_2325610     10000 0.29913509
A_1733598     10000 0.14406499
B_1757130     10000 0.12600686
A_1779228      1000 0.37764131
A_1803686       100 0.62712817
A_1670903        10 0.09947230

我需要这样的结果:

A_1784709     10007 0.40446362
B_2329958     10006 0.22501015
B_1679600     10005 0.51709792
A_1770963     10004 0.21095531
A_2067520 100033416 0.08301735
A_1740024     10003 0.40881969
B_1751882     10002 0.09964711
A_1775734     10044 0.28613624
B_1654543     10001 0.27293547
B_1733559 100008589 1.03071504
A_2325610     10000 0.29913509
A_1779228      1000 0.37764131
A_1803686       100 0.62712817
A_1670903        10 0.09947230

2 个答案:

答案 0 :(得分:1)

我们可以在这里使用duplicated()否定(!):

> EXP[!duplicated(EXP[,2]),]
          V1        V2         V3
1  A_1784709     10007 0.40446362
2  B_2329958     10006 0.22501015
4  B_1679600     10005 0.51709792
5  A_1770963     10004 0.21095531
6  A_2067520 100033416 0.08301735
7  A_1740024     10003 0.40881969
8  B_1751882     10002 0.09964711
11 A_1775734     10044 0.28613624
14 B_1654543     10001 0.27293547
15 B_1733559 100008589 1.03071504
16 A_2325610     10000 0.29913509
19 A_1779228      1000 0.37764131
20 A_1803686       100 0.62712817
21 A_1670903        10 0.09947230

数据

EXP <- structure(list(V1 = structure(c(9L, 21L, 4L, 15L, 6L, 11L, 5L, 
     17L, 1L, 19L, 7L, 14L, 20L, 13L, 16L, 12L, 3L, 18L, 8L, 10L, 2L), 
     .Label = c("A_1667906", "A_1670903", "A_1733598", "A_1739081", 
     "A_1740024", "A_1770963", "A_1775734", "A_1779228", "A_1784709", 
     "A_1803686", "A_2067520", "A_2325610", "B_1654543", "B_1674440", 
     "B_1679600", "B_1733559", "B_1751882", "B_1757130", "B_1791916", 
     "B_2321648", "B_2329958"), class = "factor"), V2 = c(10007L, 
     10006L, 10006L, 10005L, 10004L, 100033416L, 10003L, 10002L, 10002L, 
     10002L, 10044L, 10044L, 10044L, 10001L, 100008589L, 10000L, 10000L, 
     10000L, 1000L, 100L, 10L), V3 = c(0.40446362, 0.22501015, 0.10621801, 
     0.51709792, 0.21095531, 0.08301735, 0.40881969, 0.09964711, 0.08826233, 
     0.08408508, 0.28613624, 0.16204336, 0.15484888, 0.27293547, 1.03071504, 
     0.29913509, 0.14406499, 0.12600686, 0.37764131, 0.62712817, 0.0994723)), 
     .Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -21L))

答案 1 :(得分:1)

假设您有一个包含1,2,3的额外列 和一个重复的行

EXP$V4 <- c(rep(c(1,2,3),nrow(EXP)/3))
EXP <- rbind(EXP,data.frame(V1="B_1791916",V2=10002,V3=0.08408508,V4=3))

并且您希望来自第2列和此额外列的非重复行 !复制会给你这个:

EXP[!duplicated(EXP[,2]) & !duplicated(EXP[,4]),]
     V1    V2        V3 V4
1 A_1784709 10007 0.4044636  1
2 B_2329958 10006 0.2250101  2

而unique()为您提供以下内容

unique(EXP[c("V4", "V2")])
   V4        V2
1   1     10007
2   2     10006
3   3     10006
4   1     10005
5   2     10004
6   3 100033416
7   1     10003
8   2     10002
9   3     10002
10  1     10002
11  2     10044
12  3     10044
13  1     10044
14  2     10001
15  3 100008589
16  1     10000
17  2     10000
18  3     10000
19  1      1000
20  2       100
21  3        10

unique()允许两列不重复。然而,duplicated()擅长检测重复的观察结果。