perl基本fasta标题和序列分离

时间:2014-02-10 13:36:34

标签: arrays perl lines fasta

我有一个类似下面的

格式的fasta文件
>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF

从标题中我将ID分隔为第4位(DU991381.1),用管道(|)和信息分隔(例如KBrH087L18R KBrH(HindIII)BAC文库Brassica rapa subsp.pekinensis Brassica rapa subsp.pekinensis基因组克隆KBrH087L18,基因组调查序列)通过分割标题。

但是我无法将序列排成一行并将它们推送到数组中。

这是我的代码

#!/usr/bin/perl

use warnings;
use strict;

my $stData = "C:\\data.txt";

open (DATA, $stData);

my @stID   = "";
my @stInfo = "";
my @stSeq  = "";

while (my $stLine = <DATA>)
{
    chomp($stLine);
    if ($stLine =~ /^>+/)
    {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);
    }
    else #This is the part I cannot figure out
    {
        my $stSeq = $stLine.$stLine;
        print $stSeq,"\n";
    }
}


close (DATA);

所以我想要的结果是这样的,一个包含一行序列的数组。

@stSeq  = [ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC, GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF]

请帮帮我!欢呼声。

3 个答案:

答案 0 :(得分:1)

...

my @stID   = (); # empty arrays
my @stInfo = ();
my @stSeq  = ();
my $tmpStSeq = ""; # temporary empty string


while (my $stLine = <DATA>) {
    chomp($stLine);
    if ($stLine =~ /^>+/) {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);

        if ($tmpStSeq ne "") { #previous seq is pushed when a next ">" is found
          push @stSeq, $tmpStSeq;
          $tmpStSeq = ""; # and previous accumulations are flushed
        }
        next; # adding this you avoid `else block`
    }
    # else {
    $tmpStSeq .= $stLine; # accumulates into temporary (of course, no `my` needed)
    # }        
}

if ($tmpStSeq ne "") { # pushes last seq into stSeq
  push @stSeq, $tmpStSeq;
  $tmpStSeq = "";
}

# here you can use your arrays.

...

也许使用一系列哈希会更好。

答案 1 :(得分:1)

我会这样做:

my %data;
my ($id, $info);
while (my $stLine = <DATA>) {
    chomp($stLine);
    if ($stLine =~ /^>+/) {
        ($id, $info) = (split/\|/,$stLine)[3,4];
    }
    else {
        $data{$id}{$info} .= $stLine;
    }
}
dump%data;

__DATA__
>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF

<强>输出:

(
  "DU991381.1",
  {
    " KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence" => "ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC",
  },
  "DU991380.1",
  {
    " KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence" => "GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF",
  },
)

答案 2 :(得分:0)

while (my $stLine = <DATA>)
{
    chomp($stLine);
    if ($stLine =~ /^>+/)
    {
        my @lArray = split/\|/,$stLine;
        push (@stID,$lArray[3]);
        push (@stInfo,$lArray[4]);
        print "$stSeq\n";           # You could print here to get the whole last
                                    # seq before you start a new one.
                                    # Of course this would be the one from before
                                    # the header parsed here.

        $stSeq = "";                # Start a new sequence when you hit a header
    }
    else #This is the part I cannot figure out
    {
        $stSeq = $stSeq.$stLine;    # No my here and append to the sequence so far.
        print $stSeq,"\n";          # This will of course only print the seq so far.
                                    # Probably the above print is better.
    }
}