使用Perl计算fasta文件中的核苷酸频率

时间:2014-01-13 12:17:45

标签: perl fasta

请帮助改进以下代码。我无法在一行中打印序列。想要将输出打印成四行,每行具有四个字符之一的核苷酸频率。提前致谢。enter code here

#!/usr/bin/perl
use strict;
use warnings;
my $A;    
my $T;
my $G;
my $C;
my $fileIN;
my $fileOUT;

my $seq ;
open ($fileIN, "basecount.nfasta") or die "can't open file ";
open ($fileOUT, ">basecount.out") or die "can't open file ";

while (<$fileIN>)
{

             if ($_ =~/^>/)  #ignore header line
             {next;}

             else
                   {
                    $seq  = $_; #copy the all line with only nucleotide characters ATGC
                   }
            $seq  =~ s/\n//g; #create one single line containing all ATGC characters

             print "$seq\n"; # verify previous step

             my @dna = split ("",$seq); #create an array to include each nucleotide as array element

             foreach my $element (@dna)

            {
            if ($element =~/A/) # match nucleotide pattern and countstrong text
                            {
                             $A++;
                            }
             if ($element =~/T/)
                            {
                             $T++;
                            }
             if ($element =~/G/)
                            {
                             $G++;
                            }
             if ($element =~/C/)
                            {
                             $C++;
                            }

            }

            print $fileOUT "A=$A\n";
            print $fileOUT "T=$T\n";
            print $fileOUT "G=$G\n";
            print $fileOUT "C=$C\n";
}

close ($fileIN);
close ($fileOUT);

1 个答案:

答案 0 :(得分:1)

首先,我会使用一些快捷方式。它更容易阅读:

use strict;
use warnings;
use feature 'say';
my $A;
my $T;
my $G;
my $C;
my $fileIN;
my $fileOUT;

open $fileIN,  '<',"basecount.nfasta" or die "can't open file basecount.nfasta for reading";
open $fileOUT, '>','basecount.out' or die "can't open file basecount.out for writing";

while ( my $seq = <$fileIN> ) {

  next if $seq =~ /^>/;
  $seq =~ s/\n//g;
  say $seq;

  my @dna = split //, $seq;

  foreach my $element ( @dna ) {
    $A++ if $element =~ m/A/;
    $T++ if $element =~ m/T/;
    $G++ if $element =~ m/G/;
    $C++ if $element =~ m/C/;
  }

  say $fileOUT "A=$A";
  say $fileOUT "T=$T";
  say $fileOUT "G=$G";
  say $fileOUT "C=$C";
}

close $fileIN;
close $fileOUT;

还建议使用3语句打开(以及良好的模具警告)。

编辑: 我在这里使用了use feature 'say',因为你的所有打印都以换行符结尾。 sayprint完全相同,只是在最后添加换行符。