我想要合并两个文件。我想合并它们,以便第二和第三列合并22个染色体中的每一个的文件开始和停止位置,并找到该文件的相应Cn(第八列)结果。
档案1
Chromosome Start End lengthMB probes snps imba log2 Cn mCn
chr1 0 121184898 121.185 11403 3272 0.263868683 -0.03922829 2 1
chr1 144028314 147376741 3.348 392 55 0.666732903 0.149629608 4 0
chr1 147376741 149815307 2.439 45 1 NA 0.081578404 3 0
chr1 149815307 152973261 3.158 355 98 NA 0.175954714 4 0
chr1 152973261 153223301 0.25 32 1 NA 0.250464238 5 0
chr1 153223301 164587468 11.364 910 270 NA 0.169542015 4 0
chr1 164587468 164680884 0.093 11 7 NA 0.110598177 3 0
chr1 164680884 167797512 3.117 265 82 0.619468523 0.178797081 4 0
chr1 167797512 168022812 0.225 10 1 NA 0.262534983 5 0
文件2
Chromosome Start End lengthMB probes snps imba log2 Cn mCn
chr1 0 121184898 121.185 11405 3273 0.267231258 -0.040215328 2 1
chr1 144028314 147376741 3.348 393 55 0.649314008 0.156409264 4 0
chr1 147376741 149573557 2.197 44 1 NA 0.118886434 4 0
chr1 149573557 158729529 9.156 837 221 NA 0.193681628 4 0
chr1 158729529 158809353 0.08 13 1 NA 0.031239059 4 0
chr1 158809353 164628199 5.819 451 141 0.610374455 0.182849884 4 0
chr1 164628199 164836103 0.208 25 12 NA 0.253876895 4 0
chr1 164836103 165418619 0.583 61 16 NA 0.186622113 4 0
输出
Chromosome Start End Cn_File_1 mCn_File_1 Cn_File_2 mCn_File_2
chr1 0 121184898 2 1 2 0
chr1 144028314 147376741 4 0 4 0
chr1 147376741 149573557 3 0 4 0
chr1 149573557 149815307 3 0 4 0
chr1 149815307 152973261 4 0 4 0
chr1 152973261 153223301 5 0 4 0
chr1 153223301 158729529 4 0 4 0
chr1 158729529 158809353 4 0 4 0
chr1 158809353 164587468 4 0 4 0
chr1 164587468 164628199 3 0 4 0
chr1 164628199 164680884 3 0 4 0
chr1 164680884 164836103 4 0 4 0
chr1 164836103 165418619 4 0 4 0
到目前为止,我正在循环遍历所有染色体,以找到两个文件中相应染色体的正确开始和停止值。然后我把两者的开始和停止放在一起,但我不知道如何找到正确的Cn值(在每个文件的第八列)。
for (i in 1:22) {
start1 <- file1$Chromosome == paste(chromosome,i, sep="")
start2 <- file2$Chromosome == paste(chromosome,i, sep="")
both_starts <- unique(sort(c(file1$Start[start1], file2$Start[start2])))
both_starts <- unique(both_starts)
both_stops <- unique(sort(c(file1$End[start1], file2$End[start2])))
both_stops <- unique(both_stops)
start <- append(start, both_starts)
stop <- append(stop, both_stops)
chr <- append(chr, rep(paste(chromosome, i, sep=""), length(both_starts)))
for (i in length(both_starts)) {
print(file1$Start[start1][i])
}
}
有什么想法吗?
答案 0 :(得分:1)
可以使用包survSplit
中的survival
函数,然后仍然使用merge
。
lst1 <- split(d1, d1$Chromosome)
lst2 <- split(d2, d2$Chromosome)
require(survival)
# merge
do.call(rbind, mapply(FUN = function(x, y)
{
x$event <- y$event <- 0
d1.spl <- survSplit(x, cut=y$End, start='Start', end='End', event='event')
d2.spl <- survSplit(y, cut=x$End, start='Start', end='End', event='event')
mrg <- merge(d1.spl, d2.spl,
by=c('Chromosome', 'Start', 'End'),
#all=TRUE,
suffixes = c("_File_1","_File_2"))
mrg[c('Chromosome', 'Start', 'End', 'Cn_File_1', 'mCn_File_1', 'Cn_File_2', 'mCn_File_2')]
},
lst1, lst2, SIMPLIFY=FALSE))
## Chromosome Start End Cn_File_1 mCn_File_1 Cn_File_2 mCn_File_2
## chr1.1 chr1 0 121184898 2 1 2 1
## chr1.2 chr1 144028314 147376741 4 0 4 0
## chr1.3 chr1 147376741 149573557 3 0 4 0
## chr1.4 chr1 149573557 149815307 3 0 4 0
## chr1.5 chr1 149815307 152973261 4 0 4 0
## chr1.6 chr1 152973261 153223301 5 0 4 0
## chr1.7 chr1 153223301 158729529 4 0 4 0
## chr1.8 chr1 158729529 158809353 4 0 4 0
## chr1.9 chr1 158809353 164587468 4 0 4 0
## chr1.10 chr1 164587468 164628199 3 0 4 0
## chr1.11 chr1 164628199 164680884 3 0 4 0
## chr1.12 chr1 164680884 164836103 4 0 4 0
## chr1.13 chr1 164836103 165418619 4 0 4 0
答案 1 :(得分:1)
这应该可以,只需要重命名列。这样,您可以提取您希望的任何其他列:
file_merged <- merge(file1, file2, by.x = c("Chromosome", "Start", "End"), by.y = c("Chromosome", "Start", "End"))
file_merged[,colnames(file_merged) %in% c("Chromosome", "Start", "End", "Cn.x", "mCn.x", "Cn.y", "mCn.y")]
Chromosome Start End Cn.x mCn.x Cn.y mCn.y
1 chr1 0 121184898 2 1 2 1
2 chr1 144028314 147376741 4 0 4 0
3 chr1 147376741 149815307 3 0 4 0