对于唯一字段1,折叠另一个字段中的非唯一条目

时间:2013-10-02 06:32:31

标签: linux r bash while-loop

我有一个数据集,它是两个数据集的左外连接交集。我现在从第一个数据集中有多个条目,每个条目与第二个重叠。请注意,Assembly.1000重复三次,我想将其折叠为1

Assembly.1000 chrX 560000 575000 ABC1   20
Assembly.1000 chrX 560000 575000 IL15RA 3.2
Assembly.1000 chrX 560000 575000 BRCA1  20
Assembly.1038 chrX 780000 829000 .      .
Assembly.1338 chrX 960000 999000 ACTIN  3800
Assembly.1338 chrX 960000 999000 ACTIN  4000

正如您所看到的,对于每个文件2条目(ABC1,IL15RA,BRCA1),Assembly.1000的文件1条目重复三次

我想将输出解析为

Assembly.1000 chrX 560000 575000 ABC1;IL15RA;BRCA1   20;3.2;20
Assembly.1038 chrX 780000 829000 .      .
Assembly.1338 chrX 960000 999000 ACTIN,ACTIN 3800;4000

我可以使用$ while read命令并查看循环中的先前条目,但对于大文件(~1e6条目),这根本就不够有效。有没有人有任何建议可以有效地编程?

2 个答案:

答案 0 :(得分:4)

假设您的data.frame被称为“mydf”,定义如下:

mydf <- structure(list(V1 = c("Assembly.1000", "Assembly.1000", 
    "Assembly.1000", "Assembly.1038", "Assembly.1338", "Assembly.1338"), 
    V2 = c("chrX", "chrX", "chrX", "chrX", "chrX", "chrX"), 
    V3 = c(560000L, 560000L, 560000L, 780000L, 960000L, 960000L), 
    V4 = c(575000L, 575000L, 575000L, 829000L, 999000L, 999000L), 
    V5 = c("ABC1", "IL15RA", "BRCA1", ".", "ACTIN", "ACTIN"), 
    V6 = c("20", "3.2", "20", ".", "3800", "4000")), 
    .Names = c("V1", "V2", "V3", "V4", "V5", "V6"), 
    class = "data.frame", row.names = c(NA, -6L))
mydf
#              V1   V2     V3     V4     V5   V6
# 1 Assembly.1000 chrX 560000 575000   ABC1   20
# 2 Assembly.1000 chrX 560000 575000 IL15RA  3.2
# 3 Assembly.1000 chrX 560000 575000  BRCA1   20
# 4 Assembly.1038 chrX 780000 829000      .    .
# 5 Assembly.1338 chrX 960000 999000  ACTIN 3800
# 6 Assembly.1338 chrX 960000 999000  ACTIN 4000

以下是aggregate方法:

aggregate(cbind(V5, V6) ~ ., mydf, paste, collapse = "; ")
#              V1   V2     V3     V4                  V5          V6
# 1 Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2 Assembly.1038 chrX 780000 829000                   .           .
# 3 Assembly.1338 chrX 960000 999000        ACTIN; ACTIN  3800; 4000

这是“data.table”方法,使用相同的“mydf”作为起点:

library(data.table)
DT <- data.table(mydf)
DT[, lapply(.SD, paste, collapse = "; "), by = c("V1", "V2", "V3", "V4")]
#               V1   V2     V3     V4                  V5          V6
# 1: Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2: Assembly.1038 chrX 780000 829000                   .           .
# 3: Assembly.1338 chrX 960000 999000        ACTIN; ACTIN  3800; 4000

答案 1 :(得分:1)

根据@ AnandaMahto的建议使用data.table,但语法稍微简单。

library(data.table)

dataset <- data.table(
   a1 = c(1,1,3,3,5,5),
   b1 = c(1,1,3,3,5,5),
   c1 = c("a","b","c","d","e","f"),
   d1 = c("a","b","c","d","e","f")
)

dataset2 <- dataset[,
   list(
      c1d1 = paste(c1,d1, sep = "", collapse = "")
      d1 = paste(d1, collapse = ""),
      c1 = paste(c1, collapse = "")
   ),
   by = c("a1","b1")
]


#> dataset
#   a1 b1 c1 d1
#1:  1  1  a  a
#2:  1  1  b  b
#3:  3  3  c  c
#4:  3  3  d  d
#5:  5  5  e  e
#6:  5  5  f  f
#> dataset2
#   a1 b1 c1d1 d1 c1
#1:  1  1 aabb ab ab
#2:  3  3 ccdd cd cd
#3:  5  5 eeff ef ef