Question

我有一个包含此信息的文件：

#chrom    start    end    isoform
chr1    75  90  NM_100
chr1    100 120 NM_100
chr2    25  50  NM_200
chr2    55  75  NM_200
chr2    100 125 NM_200
chr2    155 200 NM_200

从这个文件我想创建一个字典，其中NM_是键，开始和结束是值。像这样：

dictionary = {NM_100: [(75, 90), (100,120)], NM_200: [(25, 50), (55,75), (100, 125), (155, 200)]}

我一直在尝试使用此代码生成一个允许我压缩开头和结尾的函数，但我似乎无法使其正常工作。

def read_exons(line):
    parts = iter(line.split())
    chrom = next(parts)
    start = next(parts)
    end = next(parts)
    isoform = next(parts)
    return isoform, [(s, e) for s, e in zip(start, end)]

with open('test_coding.txt') as f:
    exons = dict(read_exons(line) for line in f
        if not line.strip().startswith('#'))

据我所知，该函数不允许我附加值，但我正在努力弄清楚如何在字典中正确显示一行的开始和结束。有任何想法吗？问题是iter()还是zip？

Answer 1

collections.defaultdict可能有所帮助：

import collections

exons = collections.defaultdict(list)
for line in f:
    chrom, start, end, isoform = line.split()
    exons[isoform].append((int(start), int(end)))

简单！

这利用了一些东西：

使用元组解压缩解压缩行列，而不是上面的iter()解决方案。通常，元组解包更简单，更容易阅读。
它会逐步构建字典，而不是尝试在当前解决方案尝试时立即执行此操作（请注意，如果您正在处理数据行，则无法立即收集所有开始/结束对-line！）
它使用collections.defaultdict有效地将每个键映射到一个空列表（最初），这使您无需检查每个键是否已映射。如果没有defaultdict，你可以
```
exons = {}
...
    if isoform not in exons:
        exons[isoform] = []
    exons[isoform].append(...)
```

通过定义函数

1 个答案: