嗨,我有一个这样的文本文件:
>NM_145914.2:212
TCTGATGGTAAAAGTCGAGGAGAAAGAAGA
>NM_000614.3:1086
ATTCAATTTAAAATCAGACTCTTTAGTTGA
>NM_012096.2:2808
CAGTTAAGGTTTCAAATTGTGGCAGGTGGT
>NM_173465.3:1682
GTGCGTCGGGTGAGAGAGGCCCCAGCGGCC
>NM_001198858.1:490
CAACCACCACAACCTGCTGGTCTGCTCGGT
......more lines in same style......
我想要的是:
从上面的文件中读取,将行1,3,5,7 ...转换为字典键,将行2,4,5,8 ...转换为字典值。
我的代码是:
query_dict = {}
nameAt = 1
sequenceAt = 2
while name in range(totalLines):
line1 = linecache.getline(filename, nameAt)
line2 = linecache.getline(filename, sequenceAt)
query_dict[line1] = line2
nameAt = nameAt + 2
sequenceAt = sequenceAt + 2
代码工作正常,但速度很慢,因为我的文本文件的最小行是200,000行。有没有人有更好的方法来做到这一点?
非常感谢。
==============添加后续问题==================
这里是fastq格式,每次读取4行(记录):
@>NM_052972.2:11:1054:1780:889
CTTCGACATCTCCGGCAACCCCTGGATCTG
+>NM_052972.2:11:1054:1780:889
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@>NM_080660.3:12:914:1802:542
CCTGTATGGCTACTGCAACCTCAAGGATAA
+>NM_080660.3:12:914:1802:542
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@>NM_176814.3:712:2706:4242:98
ACAGAGTAAAAGAGAGGCTGACTTAATAAA
+>NM_176814.3:712:2706:4242:98
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII
...... more lines in same style ......
我想创建一个字典,键是第1行,每4行记录中的值是第2行。
字典看起来像:
{'@>NM_052972.2:11:1054:1780:889':'CTTCGACATCTCCGGCAACCCCTGGATCTG',
'@>NM_080660.3:12:914:1802:542':'CCTGTATGGCTACTGCAACCTCAAGGATAA',
'@>NM_176814.3:712:2706:4242:98':'ACAGAGTAAAAGAGAGGCTGACTTAATAAA',
..... more keys and values ......
}
感谢。
答案 0 :(得分:5)
这样的事情:
with open('filename') as f:
query_dict = {line.strip():next(f).strip() for line in f}
<强>输出:强>
>>> from pprint import pprint
>>> pprint(query_dict)
{'>NM_000614.3:1086': 'ATTCAATTTAAAATCAGACTCTTTAGTTGA',
'>NM_001198858.1:490': 'CAACCACCACAACCTGCTGGTCTGCTCGGT',
'>NM_012096.2:2808': 'CAGTTAAGGTTTCAAATTGTGGCAGGTGGT',
'>NM_145914.2:212': 'TCTGATGGTAAAAGTCGAGGAGAAAGAAGA',
'>NM_173465.3:1682': 'GTGCGTCGGGTGAGAGAGGCCCCAGCGGCC'}
<强>更新强>
with open('foo.txt') as f:
dic = {}
for line in f:
dic[line.strip()] = next(f).strip()
next(f);next(f) #Drop next two lines
from pprint import pprint
pprint(dic)
<强>输出:强>
{'@>NM_052972.2:11:1054:1780:889': 'CTTCGACATCTCCGGCAACCCCTGGATCTG',
'@>NM_080660.3:12:914:1802:542': 'CCTGTATGGCTACTGCAACCTCAAGGATAA',
'@>NM_176814.3:712:2706:4242:98': 'ACAGAGTAAAAGAGAGGCTGACTTAATAAA'}
答案 1 :(得分:5)
这是一个FASTA文件。安装Biopython(pip install biopython
)并解析它:
from Bio import SeqIO
with open('filename.fasta', 'rU') as handle:
for record in SeqIO.parse(handle, 'fasta'):
print(record)
看看那个可读的输出:
ID: NM_145914.2:212
Name: NM_145914.2:212
Description: NM_145914.2:212
Number of features: 0
Seq('TCTGATGGTAAAAGTCGAGGAGAAAGAAGA', SingleLetterAlphabet())
...
答案 2 :(得分:3)
或者,而不是dict-comp:
from itertools import izip
with open('somefile') as fin:
lines = (line.strip() for line in fin)
query_dict = dict(izip(lines, lines))
答案 3 :(得分:1)
>>> s = """>NM_145914.2:212
... TCTGATGGTAAAAGTCGAGGAGAAAGAAGA
... >NM_000614.3:1086
... ATTCAATTTAAAATCAGACTCTTTAGTTGA
... >NM_012096.2:2808
... CAGTTAAGGTTTCAAATTGTGGCAGGTGGT
... >NM_173465.3:1682
... GTGCGTCGGGTGAGAGAGGCCCCAGCGGCC
... >NM_001198858.1:490
... CAACCACCACAACCTGCTGGTCTGCTCGGT""".splitlines()
>>> {i: j for i, j in zip(s[::2], s[1::2])}
{'>NM_145914.2:212': 'TCTGATGGTAAAAGTCGAGGAGAAAGAAGA', '>NM_000614.3:1086': 'ATTCAATTTAAAATCAGACTCTTTAGTTGA', '>NM_001198858.1:490': 'CAACCACCACAACCTGCTGGTCTGCTCGGT', '>NM_012096.2:2808': 'CAGTTAAGGTTTCAAATTGTGGCAGGTGGT', '>NM_173465.3:1682': 'GTGCGTCGGGTGAGAGAGGCCCCAGCGGCC'}
如果内存有问题,请使用itertools.islice
:
{i: j for i, j in zip(islice(s, 0, len(s), 2), islice(s, 1, len(s), 2))}