我有一个包含多个序列的FASTA文件,如下所示:
> AT1G01250.1 | Symbols: | Integrase-type DNA-binding superfamily protein
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEI
REPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQV
AAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVD
GDLYQYQFMACL
> AT1G03800.1 | Symbols: ERF10, ATERF10 | ERF domain protein 10
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRP
WGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSAT
PVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAV
AGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLD
LNASP
我想从一个范围内的文件中提取不同序列的不同部分,如下所示:
AT1G01250 45 102
AT1G03800 65 109
我发现了一个基于Python的程序atgc-tools,但是采用任何输入文件格式都非常挑剔,所以对大型数据集来说不方便。任何人都可以建议使用基于Perl的解决方案吗?
答案 0 :(得分:0)
使用Biopython很容易完成此任务:
from Bio import SeqIO
records = SeqIO.parse('sequences.fasta', 'fasta')
record_dict = SeqIO.to_dict(records) # dict of records, accessible by accession
ranges_by_accession = {'AT1G01250.1': (45, 102),
'AT1G03800.1': (65, 109)}
for accession, (start, end) in ranges_by_accession.items():
record = record_dict[accession]
subseq_record = record[start:end] # For the full record (with header)
subseq = record.seq[start:end] # For only the subsequence
print subseq