提取FASTA序列子集

时间:2013-10-03 12:21:43

标签: perl bioinformatics fasta

我有一个包含多个序列的FASTA文件,如下所示:

> AT1G01250.1 | Symbols: | Integrase-type DNA-binding superfamily protein
MSPQRMKLSSPPVTNNEPTATASAVKSCGGGGKETSSSTTRHPVYHGVRKRRWGKWVSEI
REPRKKSRIWLGSFPVPEMAAKAYDVAAFCLKGRKAQLNFPEEIEDLPRPSTCTPRDIQV
AAAKAANAVKIIKMGDDDVAGIDDGDDFWEGIELPELMMSGGGWSPEPFVAGDDATWLVD
GDLYQYQFMACL

> AT1G03800.1 | Symbols: ERF10, ATERF10 | ERF domain protein 10
MTTEKENVTTAVAVKDGGEKSKEVSDKGVKKRKNVTKALAVNDGGEKSKEVRYRGVRRRP
WGRYAAEIRDPVKKKRVWLGSFNTGEEAARAYDSAAIRFRGSKATTNFPLIGYYGISSAT
PVNNNLSETVSDGNANLPLVGDDGNALASPVNNTLSETARDGTLPSDCHDMLSPGVAEAV
AGFFLDLPEVIALKEELDRVCPDQFESIDMGLTIGPQTAVEEPETSSAVDCKLRMEPDLD
LNASP

我想从一个范围内的文件中提取不同序列的不同部分,如下所示:

AT1G01250   45  102
AT1G03800   65  109

我发现了一个基于Python的程序atgc-tools,但是采用任何输入文件格式都非常挑剔,所以对大型数据集来说不方便。任何人都可以建议使用基于Perl的解决方案吗?

1 个答案:

答案 0 :(得分:0)

使用Biopython很容易完成此任务:

from Bio import SeqIO


records = SeqIO.parse('sequences.fasta', 'fasta')
record_dict = SeqIO.to_dict(records)  # dict of records, accessible by accession

ranges_by_accession = {'AT1G01250.1': (45, 102),
                       'AT1G03800.1': (65, 109)}

for accession, (start, end) in ranges_by_accession.items():
    record = record_dict[accession]
    subseq_record = record[start:end]  # For the full record (with header)
    subseq = record.seq[start:end]  # For only the subsequence
    print subseq

如果您更喜欢Perl,我建议您查看BioPerl及其相应的模块SeqIO