使用defaultdict解析多分隔符文件

时间:2017-09-17 13:06:19

标签: python python-2.7 defaultdict

我需要解析一个内容如下的文件:

20  31022550    G   1396    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:2:60.00:33.00:37.00:2:0:0.02:0.02:40.00:2:0.98:126.00:0.98    C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:1391:60.00:36.08:36.97:719:672:0.51:0.01:7.59:719:0.49:126.00:0.50    T:1:60.00:33.00:37.00:0:1:0.37:0.02:47.00:0:0.00:126.00:0.18    N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +A:2:60.00:0.00:37.00:2:0:0.67:0.01:0.00:2:0.65:126.00:0.65
20  31022551    A   1271    =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  A:960:60.00:35.23:36.99:496:464:0.50:0.00:6.38:496:0.49:126.00:0.52 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  G:13:60.00:35.00:35.92:4:9:0.13:0.02:44.92:4:0.98:126.00:0.37   T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00  +G:288:60.00:0.00:37.00:171:117:0.57:0.01:8.17:171:0.54:126.00:0.53 +GG:9:60.00:0.00:37.00:5:4:0.71:0.03:23.67:5:0.50:126.00:0.57   +GGG:1:60.00:0.00:37.00:1:0:0.51:0.03:14.00:1:0.24:126.00:0.24

解析后我希望它看起来

20  31022550    G   1396    =   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    A   2   60  33  37  2   0   0.02    0.02    40  2   0.98    126
20  31022550    G   1396    C   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    G   1391    60  36.08   36.97   719 672 0.51    0.01    7.59    719 0.49    126
20  31022550    G   1396    T   1   60  33  37  0   1   0.37    0.02    47  0   0   126
20  31022550    G   1396    N   0   0   0   0   0   0   0   0   0   0   0   0
20  31022550    G   1396    +A  2   60  0   37  2   0   0.67    0.01    0   2   0.65    126
20  31022551    A   1271    =   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    A   960 60  35.23   36.99   496 464 0.5 0   6.38    496 0.49    126
20  31022551    A   1271    C   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    G   13  60  35  35.92   4   9   0.13    0.02    44.92   4   0.98    126
20  31022551    A   1271    T   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    N   0   0   0   0   0   0   0   0   0   0   0   0
20  31022551    A   1271    +G  288 60  0   37  171 117 0.57    0.01    8.17    171 0.54    126
20  31022551    A   1271    +GG 9   60  0   37  5   4   0.71    0.03    23.67   5   0.5 126
20  31022551    A   1271    +GGG    1   60  0   37  1   0   0.51    0.03    14  1   0.24    126

我有更多行根据column[1] 31022550 ... 31022NNN

递增

代码

我在这里要做的是仅使用此伪代码打印文件的某些部分,并将column[1]作为密钥

from collections import defaultdict
ids = defaultdict(list)

with open('~/file.tsv', 'r') as f:
    for line in f:
        lines = line.strip().split('\t')
        pos = (lines[0:3])
        for ele in lines[4:]:
            # print pos
            p = pos[1].strip()
            base = ele.split(':')[0]
            ids[p] = {
                'pos': pos[0].strip(),
                'base': base,
                'count': ele.split(':')[1],
                '_pos': ele.split(':')[5],
                '_neg': ele.split(':')[6]
                }
\
for k,v in ids.iteritems():
    print k,v

输出

31022550 {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551 {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}

不确定为什么我没有看到31022550作为键值对保存的所有字段。

1 个答案:

答案 0 :(得分:1)

您只将最后一个字典分配给p键:

ids[p] = {
    'pos': pos[0].strip(),
    'base': base,
    'count': ele.split(':')[1],
    '_pos': ele.split(':')[5],
    '_neg': ele.split(':')[6]
}

这完全绕过工厂的新钥匙;你只需要分配字典值。如果您想为每个密钥构建一个列表字典,则需要使用list.append()

ids[p].append({
    'pos': pos[0].strip(),
    'base': base,
    'count': ele.split(':')[1],
    '_pos': ele.split(':')[5],
    '_neg': ele.split(':')[6]
})

这将查找ids[p]值(如果该键尚不存在,则将其创建为空列表),然后将字典附加到该列表的末尾。

我使用csv模块稍微简化代码来处理行的分割:

import csv
from collections import defaultdict
ids = defaultdict(list)

with open('~/file.tsv', 'rb') as f:
    reader = csv.reader(f, delimiter='\t')
    for row in reader:
        pos, key = row[:2]
        for elems in row[4:]:
            elems = elems.split(':')
            ids[key].append({
                'pos': pos,
                'base': elems[0],
                'count': elems[1],
                '_pos': elems[5],
                '_neg': elems[6]
            })

for key, rows in ids.iteritems():
    for row in rows:
        print '{}\t{}'.format(key, row)

这会产生:

31022550    {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '2', 'base': 'A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022550    {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '1391', 'base': 'G', 'pos': '20', '_neg': '672', '_pos': '719'}
31022550    {'count': '1', 'base': 'T', 'pos': '20', '_neg': '1', '_pos': '0'}
31022550    {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022550    {'count': '2', 'base': '+A', 'pos': '20', '_neg': '0', '_pos': '2'}
31022551    {'count': '0', 'base': '=', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '960', 'base': 'A', 'pos': '20', '_neg': '464', '_pos': '496'}
31022551    {'count': '0', 'base': 'C', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '13', 'base': 'G', 'pos': '20', '_neg': '9', '_pos': '4'}
31022551    {'count': '0', 'base': 'T', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '0', 'base': 'N', 'pos': '20', '_neg': '0', '_pos': '0'}
31022551    {'count': '288', 'base': '+G', 'pos': '20', '_neg': '117', '_pos': '171'}
31022551    {'count': '9', 'base': '+GG', 'pos': '20', '_neg': '4', '_pos': '5'}
31022551    {'count': '1', 'base': '+GGG', 'pos': '20', '_neg': '0', '_pos': '1'}