从每个不同的生物获得序列长度 - genbank文件

时间:2016-05-29 19:37:50

标签: perl

我希望从包含不同生物类型的genbank文件中获取最大,最小和平均长度,我想为每个生物体获取此信息。

示例:

Organism: Homo sapiens
average length = 160 
shortest length = 20 | id of shortest seq
longest length = 500 | id of longest seq

Organism: Caenorhabditis elegans
average length = 140 
shortest length = 40 | id of shortest seq
longest length = 300 | id of longest seq

我设法得到了长度,但我无法通过有机体将它们分开

use strict;
use warnings;

print "enter file path: ";
my $filename = <>;
chomp ($filename);
open(IN, $filename) or die "\n error opening file \n;/\n";

$/ = "//";

my %organisms ;
while (my $block = <IN>)  {
    next if $block =~ /^\s*\n\s*$/;
    my ($definition , $sequence) = split "ORIGIN", $block; 

    my $accession; 
    $definition =~ m/(ACCESSION.+[0-9])/x
        ? $accession = $1
        : die "No ACCESSION";

    my $organism;
    $definition =~ m/(ORGANISM\s+.+\n)/x
        ? $organism = $1
        : die "No ORGANISM";


    $sequence =~ s/[\d\n\s\t\/]//g;
    $organisms{ $sequence } = [ $organism, $accession ];
    }

my $sum = 0;
foreach my $sequence( keys %organisms) {

    my $current_len = length($sequence);
         $sum += $current_len;
};
my $number_seqs = scalar keys %organisms;
my $average = ($sum / $number_seqs);
print "average length = $average \n";

my @sorted_keys = sort { length $a  <=>  length $b } keys %organisms ;
my $shortest = $sorted_keys[0];
my $longest  = $sorted_keys[-1];
my $short = length $shortest;
my $long = length $longest;
my $short_id = $organisms{$shortest}->[1];
my $long_id = $organisms{$longest}->[1];
my $short_type = $organisms{$shortest}->[0];
my $long_type = $organisms{$longest}->[0];

print "shortest length = $short | $short_id | $short_type\n";
print "longest length = $long | $long_id | $long_type\n";


exit;

我应该更改哪个部分来打印每个生物的长度?

输入示例:

LOCUS       NM_001112            40 bp    mRNA    linear   PRI 20-AP
DEFINITION  Homo sapiens transcript variant 5, mRNA.
ACCESSION   NM_001112
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
ORIGIN      
        1 actgggcggc ccttagaccc
       //
LOCUS       NM_854               212 bp    mRNA    linear   PRI 20-AP
DEFINITION  Homo sapiens transcript variant 1, mRNA.
ACCESSION   NM_854
KEYWORDS    RefSeq.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
ORIGIN      
        1 gggcaaaag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag

//
LOCUS       AW057463                 469 bp    mRNA    linear   EST 29-SE
DEFINITION  ca03d09.x1 C elegans fem3 Q23 S1 Caenorhabditis elegans cDNA 3'
ACCESSION   AW057463
VERSION     AW057463.1  GI:5933102
DBLINK      BioSample: LIBEST_002392
KEYWORDS    EST.
SOURCE      Caenorhabditis elegans
  ORGANISM  Caenorhabditis elegans
ORIGIN      
        1 ttttactcaa aactatctat ccaagttaat cagtagtgtt agttctagtt aagttattaa
       61 ggcgcacggt ctgtctcctt gcttcttctc tttgtatccc ctttctcctt tttcaaaact
      121 tcactttcat caataattgg ttctttagaa tacagttttc caatttccac gtactctctt
      181 ctcttccgat ccttgtcaaa ctttttcttc gggagctcat cttctggaac tactttcaca
//
LOCUS       AW04463                 259 bp    mRNA    linear   EST 14-SE
DEFINITION  ca02d86.x1 C elegans fem6 S12 Q3 Caenorhabditis elegans cDNA 3'
ACCESSION   AW04463
VERSION     AW04463.1  GI:90872
DBLINK      BioSample: LIBEST_004372
KEYWORDS    EST.
SOURCE      Caenorhabditis elegans
  ORGANISM  Caenorhabditis elegans
ORIGIN      
      241 tttttcgatg gaaccaaacg ggaacgagtt ggcttttcca ccaaaagatt agcgtactcc
      301 gaactgtatt tccccttctt tttcttttca agaggaacat tttctcgttg agtatcatcg
      361 tcctccaaac tttgttgagt agtcatggac tgggtccgag agaattcaac ggtaggcatg
      421 gaacctttgc tcttgtcgtc gtttgccttt ggtgcctttc ccttttgaa
//

1 个答案:

答案 0 :(得分:0)

  

我应该更改哪个部分来打印每个生物的长度?

我认为您应该更改程序的许多部分(主要是我会更改使用序列作为哈希键的逻辑,而是将有机体用作哈希键)。这是一个建议:

use feature qw(say);
use strict;
use warnings;

my $filename = 'datafile.txt';
chomp ($filename);
open(IN, $filename) or die "\n error opening file \n;/\n";

$/ = "//";

my %organisms ;
while (my $block = <IN>)  {
    next if $block =~ /^\s*\n\s*$/;
    my ($definition , $sequence) = split "ORIGIN", $block; 

    my $accession; 
    $definition =~ m/ACCESSION\s*(\S+)/x
        ? $accession = $1
        : die "No ACCESSION";

    my $organism;
    $definition =~ m/ORGANISM\s*(\S.+)\n/x
        ? $organism = $1
        : die "No ORGANISM";

    $sequence =~ s/[\d\n\s\t\/]//g;
    push @{ $organisms{ $organism }{info} }, [ $sequence, $accession ];
}

for my $organism ( keys %organisms ) {
    say "Organism: $organism";
    my $info = $organisms{ $organism }{info};
    my $sum = 0;
    map { $sum += length $_->[0] } @$info;
    my ( $min_len, $min_id, $max_len, $max_id ) = map { length $_->[0], $_->[1] }
      (sort { length $a->[0] <=>  length $b->[0]} @$info)[0,-1];

    say "average length = " .  $sum / ( scalar @$info );
    say "shortest length = $min_len | $min_id";
    say "longest length = $max_len | $max_id";
    say "";
}

现在输出:

Organism: Homo sapiens
average length = 39.5
shortest length = 20 | NM_001112
longest length = 59 | NM_854

Organism: Caenorhabditis elegans
average length = 234.5
shortest length = 229 | AW04463
longest length = 240 | AW057463