在Python中将相同染色体区域的不同基因添加到单行

时间:2014-06-26 09:03:08

标签: python string list printing

您好我是python的新手,我有一个带有染色体区域和该区域相应基因的文件,我需要在同一行中包含相同区域的不同基因,如

chr12   10954262    10962540    chr12   10880241    11502235    100.0       ACACB   -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       RAD52   -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       RAD52   -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       TAS2R8  -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       TAS2R9  -

从以上几行我想把它放在单行(如下)中,所有基因名称都在染色体区域而不是多行

chr12   10954262    10962540    chr12   10880241    11502235    100.0 ACACB, RAD52, RAD52, TAS2R8, TAS2R9

非常感谢您的帮助。

此Jyothi

2 个答案:

答案 0 :(得分:0)

假设filename是包含以下内容的文件:

chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB -  
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -  
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -  
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R8 - 
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R9 - 
chr12 10977955 10999847 chr12 10880241 11502235 100.0 ERC1 -   
chr12 10977955 10999847 chr12 10880241 11502235 100.0 KCTD10 - 
chr12 10977955 10999847 chr12 10880241 11502235 100.0 MMAB -   
chr12 10977955 10999847 chr12 10880241 11502235 100.0 MYO1H -  
chr12 10977955 10999847 chr12 10880241 11502235 100.0 PRR4 -   
chr12 10977955 10999847 chr12 10880241 11502235 100.0 RAD52 -  

<强> script.py

from collections import defaultdict       
genes_dict = defaultdict(list)            
for line in open("filename",'r'):                            
    _,val,key = line[::-1].split(" ",2)   
    genes_dict[key[::-1]].append(val[::-1])           

for key in genes_dict:                    
    vals = ""                             
    for val in genes_dict[key]:           
        vals +=","+val               
    print key,vals.lstrip(",")       

<强>输出

chr12 10977955 10999847 chr12 10880241 11502235 100.0 ERC1,KCTD10,MMAB,MYO1H,PRR4,RAD52
chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB,RAD52,RAD52,TAS2R8,TAS2R9

答案 1 :(得分:0)

这是使用itertools.groupby()的一种方法。它更容忍数据列之间的空白变化,但由于groupby()的初始排序要求,在非常大的输入文件上可能无法正常工作。

from itertools import groupby

def keyfunc(row):
    # key is assumed to be all fields excluding the gene identifier
    return row[:-1]

rows = (row.split()[:-1] for row in open('chr_regions.txt'))
for k, g in groupby(sorted(rows, key=keyfunc), keyfunc):
    print '%s %s' % ('    '.join(k), ', '.join(x[-1] for x in g))

输入此未排序的输入(chr_regions.txt):

chr15   58887403    59042177    chr15   58887403    59042177    100.0       ADAM10  -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       ACACB   -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       RAD52   -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       RAD52   -
chr21   43619799    43717354    chr21   43619799    43717354    100.0       ABCG1   -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       TAS2R8  -
chr12   10954262    10962540    chr12   10880241    11502235    100.0       TAS2R9  -
chr21   43619799    43717354    chr21   43619799    43717354    100.0       ABCG2   -

产地:

chr12    10954262    10962540    chr12    10880241    11502235    100.0 ACACB, RAD52, RAD52, TAS2R8, TAS2R9
chr15    58887403    59042177    chr15    58887403    59042177    100.0 ADAM10
chr21    43619799    43717354    chr21    43619799    43717354    100.0 ABCG1, ABCG2