我需要解析一个内容如下的文件:
20 31022550 G 1396 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00:0.98 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:1391:60.00:36.08:36.97:719:672:0.51:0.01:7.59:719:0.49:126.00:0.50 T:1:60.00:33.00:37.00:0:1:0.37:0.02:47.00:0:0.00:126.00:0.18 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 +A:2:60.00:0.00:37.00:2:0:0.67:0.01:0.00:2:0.65:126.00:0.65
20 31022551 A 1271 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:960:60.00:35.23:36.99:496:464:0.50:0.00:6.38:496:0.49:126.00:0.52 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:13:60.00:35.00:35.92:4:9:0.13:0.02:44.92:4:0.98:126.00:0.37 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 +G:288:60.00:0.00:37.00:171:117:0.57:0.01:8.17:171:0.54:126.00:0.53 +GG:9:60.00:0.00:37.00:5:4:0.71:0.03:23.67:5:0.50:126.00:0.57 +GGG:1:60.00:0.00:37.00:1:0:0.51:0.03:14.00:1:0.24:126.00:0.24
解析后我希望它看起来
20 31022550 G 1396 = 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 A 2 60 33 37 2 0 0.02 0.02 40 2 0.98 126
20 31022550 G 1396 C 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 G 1391 60 36.08 36.97 719 672 0.51 0.01 7.59 719 0.49 126
20 31022550 G 1396 T 1 60 33 37 0 1 0.37 0.02 47 0 0 126
20 31022550 G 1396 N 0 0 0 0 0 0 0 0 0 0 0 0
20 31022550 G 1396 +A 2 60 0 37 2 0 0.67 0.01 0 2 0.65 126
20 31022551 A 1271 = 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 A 960 60 35.23 36.99 496 464 0.5 0 6.38 496 0.49 126
20 31022551 A 1271 C 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 G 13 60 35 35.92 4 9 0.13 0.02 44.92 4 0.98 126
20 31022551 A 1271 T 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 N 0 0 0 0 0 0 0 0 0 0 0 0
20 31022551 A 1271 +G 288 60 0 37 171 117 0.57 0.01 8.17 171 0.54 126
20 31022551 A 1271 +GG 9 60 0 37 5 4 0.71 0.03 23.67 5 0.5 126
20 31022551 A 1271 +GGG 1 60 0 37 1 0 0.51 0.03 14 1 0.24 126
我有更多行根据column[1]
31022550 ... 31022NNN
我在这里要做的是仅使用此伪代码打印文件的某些部分,并将column[1]
作为密钥
from collections import defaultdict
ids = defaultdict(list)
with open('~/file.tsv', 'r') as f:
for line in f:
lines = line.strip().split('\t')
pos = (lines[0:3])
for ele in lines[4:]:
# print pos
p = pos[1].strip()
base = ele.split(':')[0]
ids[p] = {
'pos': pos[0].strip(),
'base': base,
'count': ele.split(':')[1],
'_pos': ele.split(':')[5],
'_neg': ele.split(':')[6]
}
\
for k,v in ids.iteritems():
print k,v
31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}
不确定为什么我没有看到31022550作为键值对保存的所有字段。
答案 0 :(得分:1)
您只将最后一个字典分配给p
键:
ids[p] = {
'pos': pos[0].strip(),
'base': base,
'count': ele.split(':')[1],
'_pos': ele.split(':')[5],
'_neg': ele.split(':')[6]
}
这完全绕过工厂的新钥匙;你只需要分配字典值。如果您想为每个密钥构建一个列表字典,则需要使用list.append()
:
ids[p].append({
'pos': pos[0].strip(),
'base': base,
'count': ele.split(':')[1],
'_pos': ele.split(':')[5],
'_neg': ele.split(':')[6]
})
这将查找ids[p]
值(如果该键尚不存在,则将其创建为空列表),然后将字典附加到该列表的末尾。
我使用csv
模块稍微简化代码来处理行的分割:
import csv
from collections import defaultdict
ids = defaultdict(list)
with open('~/file.tsv', 'rb') as f:
reader = csv.reader(f, delimiter='\t')
for row in reader:
pos, key = row[:2]
for elems in row[4:]:
elems = elems.split(':')
ids[key].append({
'pos': pos,
'base': elems[0],
'count': elems[1],
'_pos': elems[5],
'_neg': elems[6]
})
for key, rows in ids.iteritems():
for row in rows:
print '{}\t{}'.format(key, row)
这会产生:
31022550 {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550 {'count': '2', 'base': 'A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022550 {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550 {'count': '1391', 'base': 'G', 'pos': '20', '_neg': '672', '_pos': '719'}
31022550 {'count': '1', 'base': 'T', 'pos': '20', '_neg': '1', '_pos': '0'}
31022550 {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '960', 'base': 'A', 'pos': '20', '_neg': '464', '_pos': '496'}
31022551 {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '13', 'base': 'G', 'pos': '20', '_neg': '9', '_pos': '4'}
31022551 {'count': '0', 'base': 'T', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551 {'count': '288', 'base': '+G', 'pos': '20', '_neg': '117', '_pos': '171'}
31022551 {'count': '9', 'base': '+GG', 'pos': '20', '_neg': '4', '_pos': '5'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}