我有一个基因列表(gene1,gene2,...),其中包含我感兴趣的所有基因。我想现在为每个基因分别提取自由能量数据来单独处理它。
我的数据集看起来像这样,包含500多个基因的信息:
==> data/gene1_free_energy.dat <==
0 0 0
1 0 0
2 0 2.3
3 0 5.4
.
.
.
==> data/gene1_rare_enrichment.dat <==
7 0.166667 0.939498
8 0.222222 0.930714
9 0.0555556 0.998125
10 0.166667 0.826133
.
.
.
==> data/gene2_free_energy.dat <==
0 0 0
1 0 0
2 0 2.3
3 0 5.4
.
.
.
==> data/gene2_rare_enrichment.dat <==
7 0.166667 0.939498
8 0.222222 0.930714
9 0.0555556 0.998125
10 0.166667 0.826133
.
.
.
要立即提取两个分隔符之间的数据,我发现这个答案非常有用: Repeatedly extract a line between two delimiters in a text file, Python但我无法弄清楚如何将基因名称实现为可变。
import re
with open(input1) as fp:
for result in re.findall('==> data/gene1_free_energy.dat <==(.*?)==> data/gene1_rare_enrichment.dat <==', fp.read(), re.S):
print (result) #or save this in a dictionary or whatever
很好地为gene1打印它。
我尝试了以下操作,但它不起作用。
import re
for name in gene_list: # this is my list of included genes
with open(input1) as fp:
for result in re.findall('==> data/' + name + '_free_energy.dat <==(.*?)==> data/'+ name +'_rare_enrichment.dat <==', fp.read(), re.S):
print (result)
有没有办法编写这样的循环?还是有另一种更聪明的方法来提取我需要的数据吗?
答案 0 :(得分:0)
with open('data.txt') as f:
RC = False
D = []
key = []
d = []
for line in f:
if 'free_energy' in line:
RC = True
key.append(line.split('/')[1].split('_')[0])
if RC:
if '==>' not in line:
d.append(line.split())
if 'rare_enrichment' in line:
RC = False
D.append(d)
d = []
data = {k: a for k, a in zip(key, D)}
output: {'gene1': [['0', '0', '0'],
['1', '0', '0'],
['2', '0', '2.3'],
['3', '0', '5.4']],
'gene2': [['0', '0', '0'],
['1', '0', '0'],
['2', '0', '2.3'],
['3', '0', '5.4']]}