我有一个数据集,它是两个数据集的左外连接交集。我现在从第一个数据集中有多个条目,每个条目与第二个重叠。请注意,Assembly.1000重复三次,我想将其折叠为1
Assembly.1000 chrX 560000 575000 ABC1 20
Assembly.1000 chrX 560000 575000 IL15RA 3.2
Assembly.1000 chrX 560000 575000 BRCA1 20
Assembly.1038 chrX 780000 829000 . .
Assembly.1338 chrX 960000 999000 ACTIN 3800
Assembly.1338 chrX 960000 999000 ACTIN 4000
正如您所看到的,对于每个文件2条目(ABC1,IL15RA,BRCA1),Assembly.1000的文件1条目重复三次
我想将输出解析为
Assembly.1000 chrX 560000 575000 ABC1;IL15RA;BRCA1 20;3.2;20
Assembly.1038 chrX 780000 829000 . .
Assembly.1338 chrX 960000 999000 ACTIN,ACTIN 3800;4000
我可以使用$ while read命令并查看循环中的先前条目,但对于大文件(~1e6条目),这根本就不够有效。有没有人有任何建议可以有效地编程?
答案 0 :(得分:4)
假设您的data.frame
被称为“mydf”,定义如下:
mydf <- structure(list(V1 = c("Assembly.1000", "Assembly.1000",
"Assembly.1000", "Assembly.1038", "Assembly.1338", "Assembly.1338"),
V2 = c("chrX", "chrX", "chrX", "chrX", "chrX", "chrX"),
V3 = c(560000L, 560000L, 560000L, 780000L, 960000L, 960000L),
V4 = c(575000L, 575000L, 575000L, 829000L, 999000L, 999000L),
V5 = c("ABC1", "IL15RA", "BRCA1", ".", "ACTIN", "ACTIN"),
V6 = c("20", "3.2", "20", ".", "3800", "4000")),
.Names = c("V1", "V2", "V3", "V4", "V5", "V6"),
class = "data.frame", row.names = c(NA, -6L))
mydf
# V1 V2 V3 V4 V5 V6
# 1 Assembly.1000 chrX 560000 575000 ABC1 20
# 2 Assembly.1000 chrX 560000 575000 IL15RA 3.2
# 3 Assembly.1000 chrX 560000 575000 BRCA1 20
# 4 Assembly.1038 chrX 780000 829000 . .
# 5 Assembly.1338 chrX 960000 999000 ACTIN 3800
# 6 Assembly.1338 chrX 960000 999000 ACTIN 4000
以下是aggregate
方法:
aggregate(cbind(V5, V6) ~ ., mydf, paste, collapse = "; ")
# V1 V2 V3 V4 V5 V6
# 1 Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2 Assembly.1038 chrX 780000 829000 . .
# 3 Assembly.1338 chrX 960000 999000 ACTIN; ACTIN 3800; 4000
这是“data.table”方法,使用相同的“mydf”作为起点:
library(data.table)
DT <- data.table(mydf)
DT[, lapply(.SD, paste, collapse = "; "), by = c("V1", "V2", "V3", "V4")]
# V1 V2 V3 V4 V5 V6
# 1: Assembly.1000 chrX 560000 575000 ABC1; IL15RA; BRCA1 20; 3.2; 20
# 2: Assembly.1038 chrX 780000 829000 . .
# 3: Assembly.1338 chrX 960000 999000 ACTIN; ACTIN 3800; 4000
答案 1 :(得分:1)
根据@ AnandaMahto的建议使用data.table,但语法稍微简单。
library(data.table)
dataset <- data.table(
a1 = c(1,1,3,3,5,5),
b1 = c(1,1,3,3,5,5),
c1 = c("a","b","c","d","e","f"),
d1 = c("a","b","c","d","e","f")
)
dataset2 <- dataset[,
list(
c1d1 = paste(c1,d1, sep = "", collapse = "")
d1 = paste(d1, collapse = ""),
c1 = paste(c1, collapse = "")
),
by = c("a1","b1")
]
#> dataset
# a1 b1 c1 d1
#1: 1 1 a a
#2: 1 1 b b
#3: 3 3 c c
#4: 3 3 d d
#5: 5 5 e e
#6: 5 5 f f
#> dataset2
# a1 b1 c1d1 d1 c1
#1: 1 1 aabb ab ab
#2: 3 3 ccdd cd cd
#3: 5 5 eeff ef ef