文件:
>1
ATTTTttttGGGG
ccCgCgGAgggGGT
gggggttttTTTTTTTTT
>2
ATcggGGGGGGA
>3
ATCGGGGGGATTT
gggggttAGTAttt
我正在构建一个读取具有此格式的文件的函数。 格式中嵌入了多个文件,用'>'+名称分隔(例如'> 1','> 2')
我正试图让'>'侧面的文字行行并将它们编译成每个部分一个字符串
所以这看起来像
name_list = ['>1','>2','>3']
sequence_list = ['ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT','ATcggGGGGGGA','ATCGGGGGGATTTgggggttAGTAttt']
import os
import re
# Open File
in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
# Generating lines
lines = []
name_list = []
seq_list = []
for line in in_file:
line = line.strip()
lines.append(line)
in_file.close()
indx = range(0,len(lines))
# Organizing the elements
for line in lines:
for i in line:
if i == '>':
name_list.append(line)
else:
break
我不知道该怎么做else:声明 我尝试创建一个范围索引(0,len(行)) 所以也许我可以做一些找到'>'的事情并编译以下索引的所有行,直到找到下一个'>'并将它们添加到名为seq_list
的列表中任何帮助将不胜感激
答案 0 :(得分:2)
您应该查看具有FASTA
解析器的Biopython,但这是使用标准库的示例:
import re
with open('filename') as f:
print [i.replace('\n','') for i in re.split(r'\>\d+',f.read()) if i]
出:
['ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT',
'ATcggGGGGGGA',
'ATCGGGGGGATTTgggggttAGTAttt']
使用Biopython
[sudo pip install biopython
]:
from Bio import SeqIO
with open("example.fasta", "rU") as handle:
print list(SeqIO.parse(handle, "fasta"))
出:
[SeqRecord(seq=Seq('ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT', SingleLetterAlphabet()), id='1', name='1', description='1', dbxrefs=[]),
SeqRecord(seq=Seq('ATcggGGGGGGA', SingleLetterAlphabet()), id='2', name='2', description='2', dbxrefs=[]),
SeqRecord(seq=Seq('ATCGGGGGGATTTgggggttAGTAttt', SingleLetterAlphabet()), id='3', name='3', description='3', dbxrefs=[])]
答案 1 :(得分:1)
字典会让生活更轻松:
>>> d = {}
>>> with open('t.txt') as f:
... for line in f:
... if line.startswith('>'):
... key = line.strip()
... if key not in d:
... d[key] = []
... else:
... d[key].append(line.strip())
...
>>> d
{'>1': ['ATTTTttttGGGG', 'ccCgCgGAgggGGT', 'gggggttttTTTTTTTTT'],
'>2': ['ATcggGGGGGGA'], '>3': ['ATCGGGGGGATTT', 'gggggttAGTAttt']}
>>> sequence_list = [''.join(k) for k in d.values()]
>>> sequence_list
['ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT',
'ATcggGGGGGGA', 'ATCGGGGGGATTTgggggttAGTAttt']