我正在尝试使用Python 2.7.2对包含TAB分隔值列表的文件执行某些操作。有关更多信息,文件格式称为BED,表示每个基因由一条线表示的基因列表。每行的前三个字段代表坐标。另一个字段包含的描述可能与多行相同。
在具有相同描述的行中,我需要将具有重叠坐标的所有行组合在一起,并以某种方式明确地命名此子组。问题是我实际上需要将具有重叠坐标的所有行分组为,例如:
chr1 1 3 geneA 1000 +
chr1 3 5 geneA 1000 +
chr1 4 6 geneA 1000 +
chr1 8 9 geneA 1000 +
应该将基因分组如下:
chr1 1 3 geneA 1000 +
chr1 3 5 geneA 1000 +
chr1 4 6 geneA 1000 +
和
chr1 8 9 geneA 1000 +
最终目标是为每个子组输出一个(新行),例如:
chr1 1 6 geneA 1000 +
chr1 8 9 geneA 1000 +
第一个字段(chr)的值是可变的,子组应该在具有相同chr值的行中构建。
到目前为止,我试图用这种(错误的)方法解决问题:
#key = description
#values = list of lines (genes) with same description
#self.raw_groups_of_overlapping.items = attribute (dict) containing, given a description, all genes whose description matches the key
#self.picked_cherries = attribute (dict) in which I would like to store, for a given unique identifier, all genes in a specific subgroup (sub-grouping lines according to the aformentioned rule)
#self.__overlappingGenes__(j,k) method evaluating if lines (genes) j and k overlap
for key,values in self.raw_groups_of_overlapping.items():
for j in values:
#Remove gene j from list:
is_not_j = lambda x: x is not j
other_genes = filter(is_not_j, values)
for k in other_genes:
if self.__overlappingGenes__(j,k):
intersection = [x for x in j.overlaps_with if x in k.overlaps_with]
identifier = ''
for gene in intersection:
identifier += gene.chr.replace('chr', '') + str(gene.start) + str(gene.end) + gene.description + gene.strand.replace('\n', '')
try:
self.picked_cherries[identifier].append(j)
except:
self.picked_cherries[identifier] = []
self.picked_cherries[identifier].append(j)
break
据我所知,我不是在考虑所有基因,我很感激你的意见。