我在下面有这个perl脚本来计算序列长度及其频率以及核苷酸频率(A
,T
,G
和C
)。此脚本适用于具有大量序列的文件,但它不能为这样的小文件提供正确的结果:
infile.fasta
>NC_013116_1051_1114
TTGTCCCTTTGAGTCTCTGG
>NC_013116_1051_1114
GCGCAGCCGATATGGATAA
>NC_013116_1051_1114
TCGAGACTTTGTAATGTTTGGG
>NC_013116_1051_1114
TATTCCACGTCAGGTGCTTTT
>NC_013116_1051_1114
TAGAGCCGATTCCAGACTGTTCC
>NC_013116_1051_1114
TACAGGACCAAGCTCTTCACTC
>NC_013116_1051_1114
CCGTCAAGTTCAGCTCCAATAA
>NC_013116_6_301
CCACGCAACGGACAATCAAACA
>NC_013116_6_301
GGACACTTCCAACTATAAATA
>NC_013116_6_301
CCACGCAACGGACAATCAAACA
>NC_013116_1051_1114
GCTCTTCACTCTTCCTCGTCT
>NC_013116_1051_1114
TTGGGAAAAAGAAGTTGCTGCAGC
>NC_013116_1051_1114
TCGCAGTATCTCTGAAGTTG
count.pl
#!/usr/bin/perl -w
#usage ./count.pl infile min_length max_length
#usage ./count.pl infile 18 34
my $min_len = $ARGV[1];
my $max_len = $ARGV[2];
my $read_len = 0;
my @lines = ("header1","sequence","header2","quality");
my @lray = ();
my $count = 0;
my $total = 0;
my $i = 0;
my @Aray = ();
my @Cray = ();
my @Gray = ();
my @Tray = ();
my$FN = "";
for ($i=$min_len; $i<=$max_len; $i++){
$lray[$i] = 0;
}
open (INFILE, "<$ARGV[0]") || die "couldn't open input file!";
while (<INFILE>) {
$lines[$count] = $_;
chomp($lines[$count]);
$count++;
if($count eq 4){
$read_len = length($lines[1]);
# print "$read_len $lines[1]\n";
$FN = substr $lines[1], 0, 1;
$lray[$read_len]++;
if ($FN eq "T") { $Tray[$read_len]++;}
else {
if ($FN eq "A"){ $Aray[$read_len]++;}
else {
if ($FN eq "C"){ $Cray[$read_len]++;}
else {
if ($FN eq "G"){ $Gray[$read_len]++;}
}
}
}
$count = 0;
}
}
print "length\tnumber\tA\tC\tG\tT\n";
for ($i=$min_len; $i<=$max_len; $i++){
print "$i\t$lray[$i]\t$Aray[$i]\t$Cray[$i]\t$Gray[$i]\t$Tray[$i]\n";
}
exit;
这是我从一个包含许多序列的大文件中得到的结果类型。
length number A C G T
18 4473 542 710 471 2750
19 12647 990 1680 1103 8874
20 31194 3010 3354 2743 22087
21 61214 6288 7196 5784 41946
22 128642 14596 11902 12518 89626
23 65190 6859 6525 7773 44033
24 10012 611 1401 1112 6888
25 1406 231 192 435 548
26 661 169 91 105 296
27 407 126 81 65 135
28 602 391 49 68 94
29 520 54 30 370 66
30 175 26 93 18 38
31 156 35 28 29 64
32 106 22 16 24 44
33 97 45 17 16 19
34 0
如果您能帮助我更正此代码,我将非常感激。谢谢
答案 0 :(得分:3)
尝试不重新发明轮子,因此,使用FAST模块得到:
use 5.014;
use warnings;
use FAST::Bio::SeqIO;
my $fasta = FAST::Bio::SeqIO->new(-file => "infile.fasta", -format => 'Fasta');
my $seqnum=0;
while ( my $seq = $fasta->next_seq() ) {
my $stats;
$stats->{len} = length($seq->seq);
$stats->{$_}++ for split //, $seq->seq;
say ++$seqnum, " @$stats{qw(len A C G T)}";
}
以上,对于您的演示infile.fasta
打印:
1 20 1 5 5 9
2 19 6 4 6 3
3 22 4 2 7 9
4 21 3 5 4 9
5 23 5 7 5 6
6 22 6 8 3 5
7 22 7 7 3 5
8 22 10 8 3 1
9 21 9 5 2 5
10 22 10 8 3 1
11 21 1 9 2 9
12 24 8 3 8 5
13 20 4 4 5 7
或
use 5.014;
use warnings;
use FAST::Bio::SeqIO;
my $fasta = FAST::Bio::SeqIO->new(-file => "file.fasta", -format => 'Fasta');
my $stats;
while ( my $seq = $fasta->next_seq() ) {
my $len = length($seq->seq);
$stats->{$len}{count}++;
$stats->{$len}{$_}++ for split //, $seq->seq;
}
say "Length $_ ($stats->{$_}->{count} times) Letters freq: @{$stats->{$_}}{qw(A C G T)}" for sort { $a <=> $b } keys %$stats;
产生
Length 19 (1 times) Letters freq: 6 4 6 3
Length 20 (2 times) Letters freq: 5 9 10 16
Length 21 (3 times) Letters freq: 13 19 8 23
Length 22 (5 times) Letters freq: 37 33 19 21
Length 23 (1 times) Letters freq: 5 7 5 6
Length 24 (1 times) Letters freq: 8 3 8 5
依旧......