获取文件中最短和最长的序列

时间:2016-05-28 00:34:47

标签: perl

我试图在包含多个类似genbank的条目的文件中获得最短和最长的序列。文件示例:

LOCUS       NM_182854               2912 bp    mRNA    linear   PRI 20-APR-2016
DEFINITION  Homo sapiens mRNA.
ACCESSION   NM_182854
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.

ORIGIN      
        1 gggcgatcag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag
       61 tggcagcccc tcggcttgtc gccggagctg agaaccaaga gctcgaaggg gccatatgac
      //

LOCUS       NM_001323410            6992 bp    mRNA    linear   PRI 20-APR-2016
DEFINITION  Homo sapiens  mRNA.
ACCESSION   NM_001323410
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.

ORIGIN      
        1 actacttccg gcttccccgc cccgccccgt ccccgggcgt ctccattttg gtctcaggtg
       61 tggactcggc aagaaccagc gcaagaggga agcagagtta tagctacccc ggc
      //

我想打印入藏号,最短序列的生物类型和最长的序列

到目前为止我的代码:

#!/usr/bin/perl

use strict;
use warnings;

print "enter file path\n";

while (my $line = <>){
    chomp $line;
    my @record = ($line);

    foreach my $file(@record){
    open(IN, "$file") or die "\n error opening file \n;/\n";

    $/="//";

    while (my $line = <IN>){
        my @gb_seq = split ("ORIGIN", $line);
        my $definition = $gb_seq[0];
        my $sequence = $gb_seq[1];

        $definition =~ m/ORGANISM[\s\t]+(.+)[\n\s\t]+/;
        my $organism = $1;

        if ($definition =~ m/ACCESSION[\s\t]+(\D\D_\d\d\d\d\d\d(\d*))[\n\s\t]+/){
        my $accession = $1;

            $sequence =~ s/\d//g;
            $sequence =~ s/[\n\s\t]//g;
            my $size = length($sequence);
            my @sorted_keys = sort { $a <=> $b } keys my %size;
            my $shortest = $sorted_keys[0];
            my $longest = $sorted_keys[-1];

            print "this is the shortest: $accession $organism size: $shortest\n";
            print "this is the longest: $accession $organism size: $longest\n";
    }
    }}}
    exit;

我考虑过将长度放在哈希中以获得最短和最长的但是那里有些错误。我收到这些错误:

Use of uninitialized value $organism in concatenation (.) or string at test.pl line 39, <IN> chunk 1
Use of uninitialized value $shortest in concatenation (.) or string at test.pl line 39, <IN> chunk 1.
Use of uninitialized value $longest in concatenation (.) or string at test.pl line 40, <IN> chunk 1.

我应该改变哪一部分?感谢

2 个答案:

答案 0 :(得分:2)

我们需要找到极长条目,同时能够识别它们所属的记录。通过//读取记录再次是一个好主意。但是,然后每个记录都是一个字符串,直接从中拉出序列比先将它分成行更难。因此,我们也可以逐行进行,因为所有需要的东西都有清晰的标记。

数据结构的选择很重要,取决于目的。在这里,我组织数据,以便使用元素

的哈希很容易
%block = ( 'accession' => { 'type' => type, 'sequence' => sequence }, ... )

一旦数据被读入,执行搜索将通过按顺序组织这一过程得到极大的帮助。 (而不是通过加入&#39;),但这将使它很难使用。我认为这最终可能会被用于更多,并且速度的轻微损失并不重要。如果这里的唯一目标是以最佳性能回答具体问题,则其他方法将更合适。评论遵循代码。

use warnings;
use strict;
use feature qw(say);

my $file = 'data_seqs.txt';
open my $fh, '<', $file or die "Can't open $file -- $!";

# Hash, helper variables, flag (inside a sequence?), sequence-end marker
my (%block, $accession, $sequence);
my $is_seq = 0;
my $end_marker = qr(\s*//); # marks end of sequence: //

while (my $line = <$fh>) 
{
    chomp($line);
    next if $line =~ /^\s*$/;       # skip empty lines

    if ($line =~ /$end_marker/) {  # done with the sequence
        $is_seq = 0;
        $sequence = ''; 
        next;
    }   

    if ($line =~ /^\s*ACCESSION\s*(\w+)/) { 
        $accession = $1; 
    }   
    elsif ($line =~ /^\s*ORGANISM\s*(.+)/) {
        $block{$accession}{'type'} = $1; 
    }   
    elsif ($line =~ /^\s*ORIGIN/) {  # start sequence on next line
        $is_seq = 1;
    }   
    elsif ($is_seq) {                # read (and add to) sequence
        if ($line =~ /^\s*\d+\s*(.*)/) {
            $block{$accession}{'sequence'} .= $1; 
        }
        else { warn "Not sequence? Line: $line " }
    }   
}

# Identify keys for max and min lenght. Initialize with any keys
my ($max, $min) = keys %block;

foreach my $acc (keys %block) 
{
    my $current_len = length($block{$acc}{'sequence'});
    if ( $current_len > length($block{$max}{'sequence'}) ) { 
        $max = $acc;
    }
    if ( $current_len < length($block{$min}{'sequence'}) ) {
        $min = $acc;
   }
}

say "Maximum length sequence:  ACCESSION: $max, ORGANISM: " . $block{$max}{'type'};
say "Minimum length sequence:  ACCESSION: $min, ORGANISM: " . $block{$min}{'type'};

use Data::Dumper;
print Dumper(\%block);

打印(省略了翻斗机的打印输出)

Maximum length sequence:  ACCESSION: NM_182854, ORGANISM Homo sapiens
Minimum length sequence:  ACCESSION: NM_001323410, ORGANISM Homo sapiens

关于搜索效率的评论

一种常见的方法是首先构建反向查找哈希,然后使用库(例如来自List::Utils)来查找最大值和最小值,然后查找它们所属的位置。为此,我们需要构建查找哈希,并且我们使用该库两次,而如上所述通过手工搜索使得一次遍历结构并且也更简单。另一种选择是将散列顶级键作为序列,然后直接找到max和min。但是,这样的哈希将更难以使用。

另一种方法是将数据组织到一个结构中,以便更有效地检索这些特定信息,可能是基于数组。

然而,效率增益似乎并不能证明方便性的大幅下降。如果速度证明是一个问题,那么应该考虑这一点。

如果您需要处理多个文件,只需将循环更改为while (<>)并在命令行上提交。然后,所有这些行中的所有行将逐行读取,代码保持不变。

可能是我误解了一些条款。我不会从&#34;序列&#34;中删除空格,并且仅在第一行使用单词&#34;键入&#34;,只是为了命名几个候选项。这些很容易调整,请告诉我。

答案 1 :(得分:1)

您声明您需要两个数据 - 加入和有机体 - 用于最长和最短的序列。这意味着您的哈希值需要存储两个元素。除此之外,当你使用&#39; //&#39;作为记录分隔符,&#39; //&#39;仍然出现在每条记录的末尾。因此,当您从序列中过滤掉空格和数字时,您仍然会离开&#39; //&#39;最后。当我通过调试器运行代码时,由于这个原因,我发现长度都是2。

其他一些事情:

  1. 使用正则表达式时,请使用&#39;扩展模式&#39;,/x,这样就可以包含空格以获得可读性
  2. 你在挖出$definition时假设一场成功的比赛 - 更好地测试你的正则表达式并在比赛中分配,在错配上死亡
  3. 您可以存储序列并稍后计算长度,而不是将长度存储在散列中(并丢失序列本身);
  4. 我将变量$line重命名为$chunk,因为它包含多行
  5. 计算最短和最长并且打印结果的所有事情都需要移出循环。取而代之的是,您只需要输入哈希值即可。如上所述,哈希值必须是具有两个值的数组 - 加入和有机体。
  6. 您可以在一个命令中删除序列中的数字,然后从另一个命令中的序列中删除空格 - 也可以将它们同时删除。在我们处理此问题时,不妨删除记录末尾的&#39; /。&#39;
  7. 鉴于上述模式,我得到了;

    use v5.14;
    use warnings;
    
    print "Enter file path: ";
    chomp(my $filename = <>);
    open(IN, $filename) or die "\n error opening file \n;/\n";
    
    $/ = "//" ;
    
    my %organisms ;
    while (my $chunk = <IN>)  {
        next if $chunk =~ /^\s*\n\s*$/ ;
        my ($definition , $sequence) = split "ORIGIN", $chunk ;
    
        my $organism ;
        $definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
            ? $organism = $1
            : die "Couldnt find ORGANISM line" ;
    
        my $accession ;
        $definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*))  [\n\s\t]+ /x
            ? $accession = $1
            : die "Cant find ACCESSION line" ;
    
        $sequence =~ s/[\d\n\s\t\/]//g;
        $organisms{ $sequence } = [ $accession , $organism ] ;
    }
    
    
    my @sorted_keys = sort { length $a  <=>  length $b } keys %organisms ;
    my $shortest = $sorted_keys[0];
    my $longest  = $sorted_keys[-1];
    
    say "this is the shortest: ",  $organisms{$shortest}->[0],
                            ", ",  $organisms{$shortest}->[1],
                       " size: ",  length $shortest, "\n",
                   " sequence: ",  $shortest ;
    
    say  "this is the longest: ",  $organisms{$longest}->[0],
                            ", ",  $organisms{$longest}->[1],
                       " size: ",  length $longest, "\n",
                   " sequence: ",  $longest ;
    
    exit;
    

    在运行数据时,会生成;

    $ ./sequence.pl
    Enter file path: data.txt
    this is the shortest: NM_001323410, Homo sapiens size: 113
     sequence: actacttccggcttccccgccccgccccgtccccgggcgtctccattttggtctcaggtgtggactcggcaagaaccagcgcaagagggaagcagagttatagctaccccggc
    this is the longest: NM_182854, Homo sapiens size: 120
     sequence: gggcgatcagaagcaggtcacacagcctgtttcctgttttcaaacggggaacttagaaagtggcagcccctcggcttgtcgccggagctgagaaccaagagctcgaaggggccatatgac
    

    <强>更新 上面代码的问题在于,如果相同的序列出现在两个块中,那么数据将在哈希中被覆盖并丢失。下面是一个更新版本,它将数据存储在一个阵列数组中,这将解决问题。它产生完全相同的输出:

    use v5.14;
    use warnings;
    
    print "Enter file path: ";
    chomp(my $filename = <>);
    open(IN, $filename) or die "\n error opening file \n;/\n";
    
    $/ = "//" ;
    
    my @organisms ;
    while (my $chunk = <IN>)  {
        next if $chunk =~ /^\s*\n\s*$/ ;
        my ($definition , $sequence) = split "ORIGIN", $chunk ;
    
        my $organism ;
        $definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
            ? $organism = $1
            : die "Couldnt find ORGANISM line" ;
    
        my $accession ;
        $definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*))  [\n\s\t]+ /x
            ? $accession = $1
            : die "Cant find ACCESSION line" ;
    
        $sequence =~ s/[\d\n\s\t\/]//g;
        push @organisms, [$organism , $accession , $sequence] ;
    }
    
    
    my @sorted_organisms = sort { length $a->[2]  <=>  length $b->[2] }  @organisms ;
    
    my ($organism , $accession , $sequence) = @{ $sorted_organisms[0] };
    say "this is the shortest: $accession, $organism, size: ",
        length $sequence, "\n", " sequence: ",  $sequence ;
    
    ($organism , $accession , $sequence) = @{ $sorted_organisms[-1] };
    say "this is the longest: $accession, $organism, size: ",
        length $sequence, "\n", " sequence: ",  $sequence ;
    
    exit;