Perl程序错误

时间:2014-09-19 04:58:38

标签: perl fasta dna-sequence blast

我编写了一个PERL程序,它带有一个excel表(通过将扩展名从.xls更改为.txt来转换为文本文件)和一个用于输入的序列文件。 excel表包含序列文件中区域的起点和终点(以及匹配区域两侧的70个侧翼值),需要剪切并提取到第三个输出文件中。有300个值。程序读入每次需要切割的序列的起点和终点,但它反复告诉我,当输入文件显然没有时,该值超出了输入文件的长度。我似乎无法得到这个固定的

这是程序

use strict;
use warnings;

my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> );    # BLAST result file name
print "\n";

my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> );    # sequence file
print "\n";

open IN, "$blast" or die "Can not open file $blast: $!";

my @ids       = ();
my @seq_start = ();
my @seq_end   = ();

while (<IN>) {

    #spliting the result file based on each tab
    my @feilds = split( "\t", $_ );
    push( @ids, $feilds[0] );    #copying the name of sequence
         #coping the 6th tab value of the result which is the start point of from where a value should be cut.
    push( @seq_start, $feilds[6] );
    #coping the 7th tab value of the result file which is the end point of a value should be cut.
    push( @seq_end, $feilds[7] );
}
close IN;

open OUT, ">Result.fasta" or die "Can not open file $database: $!";

for ( $i = 0; $i <= $#ids; $i++ ) {

    ($sequence) = &block( $ids[$i] );

    ( $idline, $sequence ) = split( "\n", $sequence );

    #extracting the sequence from the start point to the end point
    my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;

    my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen );  #storing the extracted substring into $sequence

    $Nucleotides =~ s/(.{1,60})/$1\n/gs;

    print OUT "$idline\n";
    print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";

sub block {
    #block for id storage which is the first tab in the Blast output file.
    my $id1 = shift;
    print "$id1\n";
    my $start = ();

    open IN3, "$database" or die "Can not open file $database: $!";

    my $blockseq = "";
    while (<IN3>) {

        if ( ( $_ =~ /^>/ ) && ($start) ) {

            last;
        }

        if ( ( $_ !~ /^>/ ) && ($start) ) {

            chomp;
            $blockseq .= $_;
        }

        if (/^>$id1/) {

            my $start = $. - 1;
            my $blockseq .= $_;
        }
    }
    close IN3;

    return ($blockseq);
}

BLAST RESULT FILE:http://www.fileswap.com/dl/Ws7ehftejp/

SEQUENCE FILE:http://www.fileswap.com/dl/lPwuGh2oKM/

错误

  

在Nucleotide_Extractor.pl第39行的字符串之外的substr。

     

使用未初始化的值$ Nucleotides替换(s ///)at   Nucleotide_Extractor.pl第41行。

     

在连接(。)或字符串中使用未初始化的值$ Nucleotides   在Nucleotide_Extractor.pl第44行。

非常感谢任何帮助,并始终邀请查询

1 个答案:

答案 0 :(得分:2)

现有代码存在一些问题,我在修复错误时最终重写了脚本。您的实现效率不高,因为它打开,读取和关闭Excel工作表中每个ID的序列文件。更好的方法是从序列文件中读取和存储数据,或者,如果内存有限,则遍历序列文件中的每个条目,并从Excel文件中选择相应的数据。你最好还是使用哈希而不是数组;哈希以键值对存储数据,因此更容易找到您要查找的内容。我也一直使用引用,因为它们可以很容易地将数据传入和传出子例程。

如果您不熟悉perl数据结构,请查看perlfaq4perldscperlreftut包含有关使用引用的信息。

现有代码的主要问题是从fasta文件获取序列的子例程没有返回任何内容。在代码中放置大量的调试语句是一个好主意,以确保它正在按照您的想法执行。我已经离开了我的调试语句,但对它们进行了评论。我也大量评论了我改变的代码。

#!/usr/bin/perl
use strict;
use warnings;
# enables 'say', which prints out your text and adds a carriage return
use feature ':5.10';
# a very useful module for dumping out data structures
use Data::Dumper;

#my $blast = 'infesmall.txt';
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>);     # BLAST result file name
print "\n";

#my $database = 'infe.fasta';
print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>);  # sequence file
print "\n";

open IN,"$blast" or die "Can not open file $blast: $!";

# instead of using three arrays, let's use a hash reference!
# for each ID, we want to store the start and the end point. To do that,
# we'll use a hash of hashes. The start and end information will be in one
# hash reference:
# { start => $fields[6], end => $fields[7] }
# and we will use that hashref as the value in another hash, where the key is
# the ID, $fields[0]. This means we can access the start or end data using
# code like this:
#   $info->{$id}{start}
#   $info->{$id}{end}
my $info;

while(<IN>){
    #splitting the result file based on each tab
    my @fields = split("\t",$_);
    # add the data to our $info hashref with the ID as the key:
    $info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
}
close IN;

#say "info: " . Dumper($info);

# now read the sequence info from the fasta file
my $sequence = read_sequences($database);
#say "data from read_sequences:\n" . Dumper($sequence);

my $out = 'result.fasta';
open(OUT, ">" . $out) or die "Can not open file $out: $!";

foreach my $id (keys %$info) {

    # check whether the sequence exists
    if ($sequence->{$id}) {
        #extracting the sequence from the start point to the end point
        my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;

        #say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};

        #storing the extracted substring into $sequence
        my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
        $nucleotides =~ s/(.{1,60})/$1\n/gs;
        #say "nucleotides: $nucleotides";
        print OUT $sequence->{$id}{header} . "\n";
        print OUT "$nucleotides\n";
    }
}
print "\nExtraction Completed...";

sub read_sequences {
    # fasta file
    my $fasta_file = shift;

    open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";

    # initialise two variables. We will store our sequence data in $fasta
    # and use $id to track the current sequence ID
    # the $fasta hash will look like this:
    # $fasta = {
    #   'gi|7212472|ref|NC_002387.2' => {
    #       header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
    #       seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
    #   }
    #}

    my ($fasta, $id);

    while(<IN3>){
        chomp;
        if (/^>/) {
            if (/^>(\S+) /){
                # the header line with the sequence info.
                $id = $1;
                # save the data to the $fasta hash, keyed by seq ID
                # we're going to build up an entry as we go along
                # set the header to the current line
                $fasta->{ $id }{ header } = $_;
            }
            else {
                # no ID found! Erk. Emit an error and undef $id.
                warn "Formatting error: $_";
                undef $id;
            }
        }
        ## ensure we're getting sequence lines...
        elsif (/^[ATGC]/) {
            # if $id is not defined, there's something weird going on, so
            # don't save the sequence. In a correctly-formatted file, this
            # should not be an issue.
            if ($id) {
                # if $id is set, add the line to the sequence.
                $fasta->{ $id }{ seq } .= $_;
            }
        }
    }
    close IN3;
    return $fasta;
}