将行保存在满足多个约束的数据框架中

时间:2016-06-16 15:14:18

标签: r dataframe

我有一个数据框如下:

> sampledput
           V1                                     V2             V3
1  GSM1010983                                adipose  Bisulfite-Seq
2  GSM1120330                                adipose  Bisulfite-Seq
3  GSM1120331                                adipose  Bisulfite-Seq
4  GSM1282348                                adipose  Bisulfite-Seq
5  GSM1282357                                adipose  Bisulfite-Seq
6   GSM906416                                adipose ChIP-Seq input
7   GSM906394                                adipose        H3K27ac
8  GSM1010958                                adipose       mRNA-Seq
9  GSM1120304                                adipose       mRNA-Seq
10 GSM1120305                                adipose       mRNA-Seq
11  GSM621443 adipose derived mesenchymal stem cells ChIP-Seq input
12  GSM621420 adipose derived mesenchymal stem cells       H3K27me3
13  GSM621446 adipose derived mesenchymal stem cells       H3K36me3
14  GSM621418 adipose derived mesenchymal stem cells        H3K4me1
15  GSM621458 adipose derived mesenchymal stem cells        H3K4me3
16  GSM670020 adipose derived mesenchymal stem cells         H3K9ac
17  GSM621398 adipose derived mesenchymal stem cells        H3K9me3

我想保留列V2中的值保持相同的行(例如,adipose),而列V3中的值应包含Bisulfite-Seq H3K27acChIP-Seq inputmRNA-Seq。如果V3中存在重复值,则只需取其中的1个,因为您可以看到我只选择了一行值为mRNA-Seq的行Bisulfite-Seq所以在这种情况下,我会得到输出:

5  GSM1282357                                adipose  Bisulfite-Seq
6   GSM906416                                adipose ChIP-Seq input
7   GSM906394                                adipose        H3K27ac
8  GSM1010958                                adipose       mRNA-Seq

这是dput:

structure(list(V1 = structure(c(2L, 5L, 6L, 7L, 8L, 17L, 16L, 
1L, 3L, 4L, 12L, 11L, 13L, 10L, 14L, 15L, 9L), .Label = c("GSM1010958", 
"GSM1010983", "GSM1120304", "GSM1120305", "GSM1120330", "GSM1120331", 
"GSM1282348", "GSM1282357", "GSM621398", "GSM621418", "GSM621420", 
"GSM621443", "GSM621446", "GSM621458", "GSM670020", "GSM906394", 
"GSM906416"), class = "factor"), V2 = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("adipose", 
"adipose derived mesenchymal stem cells"), class = "factor"), 
    V3 = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 3L, 10L, 10L, 10L, 
    2L, 4L, 5L, 6L, 7L, 8L, 9L), .Label = c("Bisulfite-Seq", 
    "ChIP-Seq input", "H3K27ac", "H3K27me3", "H3K36me3", "H3K4me1", 
    "H3K4me3", "H3K9ac", "H3K9me3", "mRNA-Seq"), class = "factor")), .Names = c("V1", 
"V2", "V3"), class = "data.frame", row.names = c(NA, -17L))

3 个答案:

答案 0 :(得分:1)

编辑:“更好”解决方案

我实际上更喜欢这个,因为我认为代码更符合逻辑:

library(dplyr)
sampledput %>% group_by(V2) %>% 
    filter(all(c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq") %in%  V3)) %>%
    distinct(V2,V3)

Source: local data frame [4 x 3]
Groups: V2 [1]

          V1      V2             V3
      (fctr)  (fctr)         (fctr)
1 GSM1010983 adipose  Bisulfite-Seq
2  GSM906416 adipose ChIP-Seq input
3  GSM906394 adipose        H3K27ac
4 GSM1010958 adipose       mRNA-Seq

这将测试所有期望的V3值是否包含在V2的每个值中。然后它仍将过滤掉任何重复项。

原始解决方案

dplyr解决方案

library(dplyr)
sampledput %>% group_by(V2) %>% 
    filter(V3 %in% c("Bisulfite-Seq","H3K27ac","ChIP-Seq input","mRNA-Seq")) %>%
    distinct(V2,V3) %>% filter(length(unique(V3))==4)

Source: local data frame [4 x 3]
Groups: V2 [2]

          V1                                     V2             V3
      (fctr)                                 (fctr)         (fctr)
1 GSM1010983                                adipose  Bisulfite-Seq
2  GSM906416                                adipose ChIP-Seq input
3  GSM906394                                adipose        H3K27ac
4 GSM1010958                                adipose       mRNA-Seq

请注意,在执行distinct(V2,V3)时,它会抓取该副本的第一个出现者。在您想要的输出中列出GSM1282357,而我的解决方案返回GSM1010983。不确定这是否是您的担忧。

您必须测试这是否会推广到您的整个数据集,但它确实会产生您想要的输出。

答案 1 :(得分:1)

也许有点过于简单但......

GIF

这将返回每个组的最后一个GSM,就像您的理想输出一样。

答案 2 :(得分:0)

我们也可以使用data.table

library(data.table)
setDT(sampledput)[,  .(V1 = last(V1)), .(V2, V3)]