R:基于唯一列的过滤数据集

时间:2011-05-19 13:33:58

标签: r select unique subset

  

可能重复:
  R: Finding patterns across multiple columns- possibly duplicated()?

亲爱的,

以下是我的数据集的一部分:

         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
63 uc003xlv.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG
67 uc010lwh.1  chr8  38387812  38445509      -     FLG
68 uc010lwj.1  chr8  38387812  38445509      -     FLG

我想根据唯一的start,stop和alias列过滤数据集。最终结果必须是这样的:

         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG

有人知道是否有解决方案吗? 谢谢!

2 个答案:

答案 0 :(得分:7)

使用duplicated功能:

复制数据:

x <- "         name   chr     start      stop strand   alias 
60 uc003vqx.2  chr7 130835560 130891916      -   PODXL
61 uc003xlp.1  chr8  38387812  38445509      -     FLG
62 uc003xlu.1  chr8  38400008  38445509      -     FLG
63 uc003xlv.1  chr8  38400008  38445509      -     FLG
64 uc003xtz.1  chr8  61263976  61356508      -     CA8
65 uc003xua.1  chr8  61283183  61356508      -     CA8
66 uc010lwg.1  chr8  38387812  38445509      -     FLG
67 uc010lwh.1  chr8  38387812  38445509      -     FLG
68 uc010lwj.1  chr8  38387812  38445509      -     FLG"

dat <- read.table(textConnection(x), header=TRUE)

删除重复项:

dat[!duplicated(dat[, c("start", "stop", "alias")]), ]

         name  chr     start      stop strand alias
60 uc003vqx.2 chr7 130835560 130891916      - PODXL
61 uc003xlp.1 chr8  38387812  38445509      -   FLG
62 uc003xlu.1 chr8  38400008  38445509      -   FLG
64 uc003xtz.1 chr8  61263976  61356508      -   CA8
65 uc003xua.1 chr8  61283183  61356508      -   CA8

答案 1 :(得分:1)

我认为您的示例输出有误,请尝试

dfrm$comb <-  with(dfrm, paste(start,stop, alias, sep="+"))
dfrm[!duplicated(dfrm$comb), 1:6]
#---
         name  chr     start      stop strand alias
60 uc003vqx.2 chr7 130835560 130891916      - PODXL
61 uc003xlp.1 chr8  38387812  38445509      -   FLG
62 uc003xlu.1 chr8  38400008  38445509      -   FLG
64 uc003xtz.1 chr8  61263976  61356508      -   CA8
65 uc003xua.1 chr8  61283183  61356508      -   CA8