您好我是python的新手,我有一个带有染色体区域和该区域相应基因的文件,我需要在同一行中包含相同区域的不同基因,如
chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R8 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R9 -
从以上几行我想把它放在单行(如下)中,所有基因名称都在染色体区域而不是多行
chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB, RAD52, RAD52, TAS2R8, TAS2R9
非常感谢您的帮助。
此Jyothi
答案 0 :(得分:0)
假设filename
是包含以下内容的文件:
chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R8 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R9 -
chr12 10977955 10999847 chr12 10880241 11502235 100.0 ERC1 -
chr12 10977955 10999847 chr12 10880241 11502235 100.0 KCTD10 -
chr12 10977955 10999847 chr12 10880241 11502235 100.0 MMAB -
chr12 10977955 10999847 chr12 10880241 11502235 100.0 MYO1H -
chr12 10977955 10999847 chr12 10880241 11502235 100.0 PRR4 -
chr12 10977955 10999847 chr12 10880241 11502235 100.0 RAD52 -
<强> script.py 强>
from collections import defaultdict
genes_dict = defaultdict(list)
for line in open("filename",'r'):
_,val,key = line[::-1].split(" ",2)
genes_dict[key[::-1]].append(val[::-1])
for key in genes_dict:
vals = ""
for val in genes_dict[key]:
vals +=","+val
print key,vals.lstrip(",")
<强>输出强>
chr12 10977955 10999847 chr12 10880241 11502235 100.0 ERC1,KCTD10,MMAB,MYO1H,PRR4,RAD52
chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB,RAD52,RAD52,TAS2R8,TAS2R9
答案 1 :(得分:0)
这是使用itertools.groupby()
的一种方法。它更容忍数据列之间的空白变化,但由于groupby()
的初始排序要求,在非常大的输入文件上可能无法正常工作。
from itertools import groupby
def keyfunc(row):
# key is assumed to be all fields excluding the gene identifier
return row[:-1]
rows = (row.split()[:-1] for row in open('chr_regions.txt'))
for k, g in groupby(sorted(rows, key=keyfunc), keyfunc):
print '%s %s' % (' '.join(k), ', '.join(x[-1] for x in g))
输入此未排序的输入(chr_regions.txt):
chr15 58887403 59042177 chr15 58887403 59042177 100.0 ADAM10 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 RAD52 -
chr21 43619799 43717354 chr21 43619799 43717354 100.0 ABCG1 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R8 -
chr12 10954262 10962540 chr12 10880241 11502235 100.0 TAS2R9 -
chr21 43619799 43717354 chr21 43619799 43717354 100.0 ABCG2 -
产地:
chr12 10954262 10962540 chr12 10880241 11502235 100.0 ACACB, RAD52, RAD52, TAS2R8, TAS2R9
chr15 58887403 59042177 chr15 58887403 59042177 100.0 ADAM10
chr21 43619799 43717354 chr21 43619799 43717354 100.0 ABCG1, ABCG2