合并r中的文件没有merge()

时间:2014-09-15 12:18:10

标签: r merge

我想要合并两个文件。我想合并它们,以便第二和第三列合并22个染色体中的每一个的文件开始和停止位置,并找到该文件的相应Cn(第八列)结果。

档案1

Chromosome  Start   End lengthMB    probes  snps    imba    log2    Cn  mCn
chr1    0   121184898   121.185 11403   3272    0.263868683 -0.03922829 2   1
chr1    144028314   147376741   3.348   392 55  0.666732903 0.149629608 4   0
chr1    147376741   149815307   2.439   45  1   NA  0.081578404 3   0
chr1    149815307   152973261   3.158   355 98  NA  0.175954714 4   0
chr1    152973261   153223301   0.25    32  1   NA  0.250464238 5   0
chr1    153223301   164587468   11.364  910 270 NA  0.169542015 4   0
chr1    164587468   164680884   0.093   11  7   NA  0.110598177 3   0
chr1    164680884   167797512   3.117   265 82  0.619468523 0.178797081 4   0
chr1    167797512   168022812   0.225   10  1   NA  0.262534983 5   0

文件2

Chromosome  Start   End lengthMB    probes  snps    imba    log2    Cn  mCn
chr1    0   121184898   121.185 11405   3273    0.267231258 -0.040215328    2   1
chr1    144028314   147376741   3.348   393 55  0.649314008 0.156409264 4   0
chr1    147376741   149573557   2.197   44  1   NA  0.118886434 4   0
chr1    149573557   158729529   9.156   837 221 NA  0.193681628 4   0
chr1    158729529   158809353   0.08    13  1   NA  0.031239059 4   0
chr1    158809353   164628199   5.819   451 141 0.610374455 0.182849884 4   0
chr1    164628199   164836103   0.208   25  12  NA  0.253876895 4   0
chr1    164836103   165418619   0.583   61  16  NA  0.186622113 4   0

输出

Chromosome  Start   End Cn_File_1   mCn_File_1  Cn_File_2   mCn_File_2
chr1    0   121184898   2   1   2   0
chr1    144028314   147376741   4   0   4   0
chr1    147376741   149573557   3   0   4   0
chr1    149573557   149815307   3   0   4   0
chr1    149815307   152973261   4   0   4   0
chr1    152973261   153223301   5   0   4   0
chr1    153223301   158729529   4   0   4   0
chr1    158729529   158809353   4   0   4   0
chr1    158809353   164587468   4   0   4   0
chr1    164587468   164628199   3   0   4   0
chr1    164628199   164680884   3   0   4   0
chr1    164680884   164836103   4   0   4   0
chr1    164836103   165418619   4   0   4   0

到目前为止,我正在循环遍历所有染色体,以找到两个文件中相应染色体的正确开始和停止值。然后我把两者的开始和停止放在一起,但我不知道如何找到正确的Cn值(在每个文件的第八列)。

for (i in 1:22) {
    start1 <- file1$Chromosome == paste(chromosome,i, sep="")
    start2 <- file2$Chromosome == paste(chromosome,i, sep="")
    both_starts <- unique(sort(c(file1$Start[start1], file2$Start[start2])))
    both_starts <- unique(both_starts)
    both_stops <- unique(sort(c(file1$End[start1], file2$End[start2])))
    both_stops <- unique(both_stops)
    start <- append(start, both_starts)
    stop <- append(stop, both_stops)
    chr <- append(chr, rep(paste(chromosome, i, sep=""), length(both_starts)))
        for (i in length(both_starts)) {
            print(file1$Start[start1][i])
        }
    }

有什么想法吗?

2 个答案:

答案 0 :(得分:1)

可以使用包survSplit中的survival函数,然后仍然使用merge

lst1 <- split(d1, d1$Chromosome)
lst2 <- split(d2, d2$Chromosome)
require(survival)
# merge
do.call(rbind, mapply(FUN = function(x, y) 
{
  x$event <- y$event <- 0
  d1.spl <- survSplit(x, cut=y$End, start='Start', end='End', event='event')
  d2.spl <- survSplit(y, cut=x$End, start='Start', end='End', event='event')
  mrg <- merge(d1.spl, d2.spl, 
               by=c('Chromosome', 'Start', 'End'), 
               #all=TRUE, 
               suffixes = c("_File_1","_File_2"))
  mrg[c('Chromosome', 'Start', 'End', 'Cn_File_1', 'mCn_File_1', 'Cn_File_2', 'mCn_File_2')]
},
lst1, lst2, SIMPLIFY=FALSE))


##          Chromosome     Start       End Cn_File_1 mCn_File_1 Cn_File_2 mCn_File_2
##  chr1.1        chr1         0 121184898         2          1         2          1
##  chr1.2        chr1 144028314 147376741         4          0         4          0
##  chr1.3        chr1 147376741 149573557         3          0         4          0
##  chr1.4        chr1 149573557 149815307         3          0         4          0
##  chr1.5        chr1 149815307 152973261         4          0         4          0
##  chr1.6        chr1 152973261 153223301         5          0         4          0
##  chr1.7        chr1 153223301 158729529         4          0         4          0
##  chr1.8        chr1 158729529 158809353         4          0         4          0
##  chr1.9        chr1 158809353 164587468         4          0         4          0
##  chr1.10       chr1 164587468 164628199         3          0         4          0
##  chr1.11       chr1 164628199 164680884         3          0         4          0
##  chr1.12       chr1 164680884 164836103         4          0         4          0
##  chr1.13       chr1 164836103 165418619         4          0         4          0

答案 1 :(得分:1)

这应该可以,只需要重命名列。这样,您可以提取您希望的任何其他列:

file_merged <- merge(file1, file2, by.x = c("Chromosome", "Start", "End"), by.y = c("Chromosome", "Start", "End")) 
file_merged[,colnames(file_merged) %in% c("Chromosome", "Start", "End", "Cn.x", "mCn.x", "Cn.y", "mCn.y")]

  Chromosome     Start       End Cn.x mCn.x Cn.y mCn.y
1       chr1         0 121184898    2     1    2     1
2       chr1 144028314 147376741    4     0    4     0
3       chr1 147376741 149815307    3     0    4     0