我有以下格式的DNA文件:
>gi|5524211|gb|AAD44166.1| cytochrome
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGC
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
ACACCCCCCCCGGTGTGTGTGGGGGGTTAAAAATGATGAGTGATGAGTGAGTTGTGTG
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
TTCTATCATCATTCGGCGGGGGGATATATTATAGCGCGCGATTATTGCGCAGTCTACG
TCATCGACTACGATCAGCATCAGCATCAGCATCAGCATCGACTAGCATCAGCTACGAC
如何阅读此文件并提取DNA序列部分(ACCAGAGCGG...
)而不添加任何换行符,例如:
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGCCTACATCATCACAGCAGCATCA
也许不需要正则表达式?
答案 0 :(得分:8)
如果总是只有一行标题:
dnalines = text.split('\n')[1:]
dna = ''.join(dnalines)
使用text =文件的内容(例如,text = open('yourfile').read()
)
答案 1 :(得分:3)
我做了一些测试,看起来以下效果比delroth's answer更有效:
text.split('\n', 1)[1].replace('\n', '')
编辑等等,这不是那么简单。我使用Python 2.6.4和3.1.1在两个方法上对两个方法进行了两次计时:〜30MB文件:
Python 2.6.4,我的版本:
$ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 221 msec per loop
$ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 219 msec per loop
Python 2.6.4,delroth的版本:
$ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 392 msec per loop
$ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 390 msec per loop
Python 3.1.1,我的版本:
$ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 803 msec per loop
$ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 798 msec per loop
Python 3.1.1,delroth的版本:
$ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 610 msec per loop
$ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 610 msec per loop
结论: Python 3 多慢,这取决于Python版本两个代码片段中哪一个更快!