使用pandas查找两个数据集之间的重叠/范围或枚举

时间:2017-05-12 14:07:34

标签: python intervals

我正在尝试对具有以下条件的两个文件执行一些间隔范围操作 检查chrom是否相等,然后检查我的co0rdinatefile的开始和结束是否在gene_annotation文件的开始和结束之内(条件是如果是" +"开始和结束将是例如10-20,如果它的" - "它将是20-10),如果匹配打印起始结束来自坐标和gene_id,gene_name来自geneannotation文件。 (为表示目的,我有头部annoataion文件)

注释文件中的行数~50000 协调文件中的行数~200,000

gene_annotationfile

chrom     start       end             gene_id    gene_name strand
17  71223692  71274336  ENSMUSG00000085299      Gm16627      -
17  18186448  18211184  ENSMUSG00000067978  Vmn2r-ps113      +
11  84645863  84684319  ENSMUSG00000020530       Ggnbp2      -
 7  51097639  51106551  ENSMUSG00000074155         Klk5      +
13  31711037  31712238  ENSMUSG00000087276      Gm11378      +

coordinates_file

  chrom start   end strand
  1 4247322 4247912 -
  1 4427449 4432604 +
  1 4763414 4764404 -
  1 4764597 4767606 -
  1 4764597 4766491 -
  1 4766882 4767606 -
  1 4767729 4772649 -
  1 4767729 4768829 -
  1 4767729 4775654 -
  1 4772382 4772649 -
  1 4772814 4774032 -
  1 4772814 4774159 -
  1 4772814 4775654 -
  1 4772814 4774032 +
  1 4774186 4775654 -
  1 4774186 4775654 
  1 4774186 4775699 -

期望的输出

 chrom, start, end,strand, gene_id, gene_name
 1      4427432 4432686 + ENSMUSG0001 abcd

另一个问题是在某些情况下,如果有匹配,它可能会映射到gene_id,在这种情况下我想写

 chrom, start, end,strand, gene_id, gene_name
 1      4427432 4432686 + ENSMUSG0001,ENSMUSG0002 abcd,efgh

到目前为止我的代码:

 import csv 

 with open('coordinates.txt', 'r') as source:
      coordinates = list(csv.reader(source, delimiter="\t"))

 with open('/gene_annotations.txt', 'rU') as source:
      #if i do not use 'rU' i get this error Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
      annotations = list(csv.reader(source, delimiter="\t"))

for index,line in enumerate(coordinates):

    for index2, line2 in enumerate(annotations):


        if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
         print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
         break

我得错误

---> 15         if coordinates[line][0] == annotations[line2][0] and coordinates[line][1] <= annotations[line2][1] and annotations[line2][2] >= coordinates[line][2] :
16              print "%s\t%s\t%s\t%s\t%s" % (coordinates[line][0],coordinates[line][1],coordinates[line][2], annotations[line2][3], annotations[line2][4])
17              break

TypeError: list indices must be integers, not list
大熊猫是一个很好的方法吗?

1 个答案:

答案 0 :(得分:0)