我有一个类似下面的
格式的fasta文件>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF
从标题中我将ID分隔为第4位(DU991381.1),用管道(|)和信息分隔(例如KBrH087L18R KBrH(HindIII)BAC文库Brassica rapa subsp.pekinensis Brassica rapa subsp.pekinensis基因组克隆KBrH087L18,基因组调查序列)通过分割标题。
但是我无法将序列排成一行并将它们推送到数组中。
这是我的代码
#!/usr/bin/perl
use warnings;
use strict;
my $stData = "C:\\data.txt";
open (DATA, $stData);
my @stID = "";
my @stInfo = "";
my @stSeq = "";
while (my $stLine = <DATA>)
{
chomp($stLine);
if ($stLine =~ /^>+/)
{
my @lArray = split/\|/,$stLine;
push (@stID,$lArray[3]);
push (@stInfo,$lArray[4]);
}
else #This is the part I cannot figure out
{
my $stSeq = $stLine.$stLine;
print $stSeq,"\n";
}
}
close (DATA);
所以我想要的结果是这样的,一个包含一行序列的数组。
@stSeq = [ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC, GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF]
请帮帮我!欢呼声。
答案 0 :(得分:1)
...
my @stID = (); # empty arrays
my @stInfo = ();
my @stSeq = ();
my $tmpStSeq = ""; # temporary empty string
while (my $stLine = <DATA>) {
chomp($stLine);
if ($stLine =~ /^>+/) {
my @lArray = split/\|/,$stLine;
push (@stID,$lArray[3]);
push (@stInfo,$lArray[4]);
if ($tmpStSeq ne "") { #previous seq is pushed when a next ">" is found
push @stSeq, $tmpStSeq;
$tmpStSeq = ""; # and previous accumulations are flushed
}
next; # adding this you avoid `else block`
}
# else {
$tmpStSeq .= $stLine; # accumulates into temporary (of course, no `my` needed)
# }
}
if ($tmpStSeq ne "") { # pushes last seq into stSeq
push @stSeq, $tmpStSeq;
$tmpStSeq = "";
}
# here you can use your arrays.
...
也许使用一系列哈希会更好。
答案 1 :(得分:1)
我会这样做:
my %data;
my ($id, $info);
while (my $stLine = <DATA>) {
chomp($stLine);
if ($stLine =~ /^>+/) {
($id, $info) = (split/\|/,$stLine)[3,4];
}
else {
$data{$id}{$info} .= $stLine;
}
}
dump%data;
__DATA__
>gi|84341511|gb|DU991381.1| KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence
ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGT
AAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGG
AGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTAC
GAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAA
CTTCAAAGTAAACAATGTGTTGTGTC
>gi|84341510|gb|DU991380.1| KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence
GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTA
TCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCC
CAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTT
ATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF
<强>输出:强>
(
"DU991381.1",
{
" KBrH087L18R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087L18, genomic survey sequence" => "ATGCCTTCAATGTCAAAGGCCGGTGTGATTTTGATCTTTACTTCAATCAGGTTCTCTTCTTTTCTTTGGTAAGAACTTTTGTTCAGTTTATTTTGATCCTTACATGCTTCGTTTTGTGCTTTACAGAGGAACCCTATAGGAGCTGCAGAGTTTGCCTGGAACATAATGAATTTTAAGGAAGATCAGGATGTTAGGATCAAAGTTGGCTACGAAATGTTTGATAAGGTATCTCTCTCTCTCTCTCTCTCTCTAGTAGCTGAAGCATGTTAACTGTTCCAAACTTCAAAGTAAACAATGTGTTGTGTC",
},
"DU991380.1",
{
" KBrH087D08R KBrH (HindIII) BAC library Brassica rapa subsp. pekinensis Brassica rapa subsp. pekinensis genomic clone KBrH087D08, genomic survey sequence" => "GGATAACTTCTTCTTGCCAACTCCTATGAGATTTATTCAACTTCCTGGTGATTCTCCACCACTTTATGTATCCAAATCAAGTTTCTCACAAAGTGAGTCATCCTGGTTTGATTGGAACGACGAAGAATCTGTCTCATTCCCAAACTGGGAAACTGGAATCACCTGATTTCAAAGTGGGATAACTTCGTCTTGCTAACTCCTATGATATTTATTCAACTTCCTGGTGATTCTCCACCAGTTTATEESSF",
},
)
答案 2 :(得分:0)
while (my $stLine = <DATA>)
{
chomp($stLine);
if ($stLine =~ /^>+/)
{
my @lArray = split/\|/,$stLine;
push (@stID,$lArray[3]);
push (@stInfo,$lArray[4]);
print "$stSeq\n"; # You could print here to get the whole last
# seq before you start a new one.
# Of course this would be the one from before
# the header parsed here.
$stSeq = ""; # Start a new sequence when you hit a header
}
else #This is the part I cannot figure out
{
$stSeq = $stSeq.$stLine; # No my here and append to the sequence so far.
print $stSeq,"\n"; # This will of course only print the seq so far.
# Probably the above print is better.
}
}