如何从基于gff文件的基因组fasta中提取fasta,然后将fasta合并到一个转录本下输出

时间:2014-10-29 09:52:37

标签: python merge biopython fasta

感谢您的帮助。我想提取特定的内含子fasta,然后将内含子fasta与CDS fasta合并以输出我的特定记录。我可以用biopython或python做这个吗?

我的gff file.example:

1   ensembl intron  7904    9192    .   -   .   Parent=GRMZM2G059865_T01;Name=intron.71462
1   ensembl intron  6518    6638    .   -   .   Parent=GRMZM2G059865_T01;Name=intron.71465
1   ensembl intron  6266    6361    .   -   .   Parent=GRMZM2G059865_T01;Name=intron.71466
1   ensembl intron  5976    6107    .   -   .   Parent=GRMZM2G059865_T01;Name=intron.71467
1   ensembl intron  5189    5341    .   -   .   Parent=GRMZM2G059865_T01;Name=intron.71469
1   ensembl CDS 9193    9519    .   -   .   Parent=GRMZM2G059865_T01;Name=CDS.71479
1   ensembl CDS 7594    7903    .   -   0   Parent=GRMZM2G059865_T01;Name=CDS.71480
1   ensembl CDS 6918    7120    .   -   1   Parent=GRMZM2G059865_T01;Name=CDS.71481
1   ensembl CDS 6639    6797    .   -   0   Parent=GRMZM2G059865_T01;Name=CDS.71482
1   ensembl CDS 6362    6517    .   -   0   Parent=GRMZM2G059865_T01;Name=CDS.71483
1   ensembl CDS 6108    6265    .   -   0   Parent=GRMZM2G059865_T01;Name=CDS.71484
1   ensembl CDS 5857    5975    .   -   2   Parent=GRMZM2G059865_T01;Name=CDS.71485
1   ensembl CDS 5342    5407    .   -   1   Parent=GRMZM2G059865_T01;Name=CDS.71486
1   ensembl CDS 5127    5188    .   -   1   Parent=GRMZM2G059865_T01;Name=CDS.71487
1   ensembl intron  39443409    39443716    .   +   .   Parent=GRMZM2G441511_T01;Name=intron.100057
1   ensembl intron  39445109    39445314    .   +   .   Parent=GRMZM2G441511_T01;Name=intron.100061
1   ensembl intron  39450586    39450706    .   +   .   Parent=GRMZM2G441511_T01;Name=intron.100066
1   ensembl CDS 39443355    39443408    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100082    
1   ensembl CDS 39443717    39443785    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100083
1   ensembl CDS 39444013    39444161    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100084
1   ensembl CDS 39444634    39444721    .   +   2   Parent=GRMZM2G441511_T01;Name=CDS.100085
1   ensembl CDS 39445026    39445108    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100086
1   ensembl CDS 39445315    39445486    .   +   2   Parent=GRMZM2G441511_T01;Name=CDS.100087
1   ensembl CDS 39447442    39447548    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100088
1   ensembl CDS 39449775    39449850    .   +   2   Parent=GRMZM2G441511_T01;Name=CDS.100089
1   ensembl CDS 39449938    39450049    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100090
1   ensembl CDS 39450433    39450585    .   +   1   Parent=GRMZM2G441511_T01;Name=CDS.100091
1   ensembl CDS 39450707    39450822    .   +   1   Parent=GRMZM2G441511_T01;Name=CDS.100092
1   ensembl CDS 39450992    39451159    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100093
1   ensembl CDS 39451204    39451266    .   +   0   Parent=GRMZM2G441511_T01;Name=CDS.100094
........

1 个答案:

答案 0 :(得分:0)

这太模糊了,答案也是如此。您可以使用Biopython中的简单Seq对象,加载初始或源(完整基因?)序列:

from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

seq = Seq("ATCAGCATCAGCATCGACTAGCATCGCATCAGC", IUPAC.unambiguous_dna)
# Select this ^^^^^^^^          ^^    

print seq[3:10] + seq[20:23]
# AGCATCAGCA