我编写了一个PERL程序,它带有一个excel表(通过将扩展名从.xls更改为.txt来转换为文本文件)和一个用于输入的序列文件。 excel表包含序列文件中区域的起点和终点(以及匹配区域两侧的70个侧翼值),需要剪切并提取到第三个输出文件中。有300个值。程序读入每次需要切割的序列的起点和终点,但它反复告诉我,当输入文件显然没有时,该值超出了输入文件的长度。我似乎无法得到这个固定的
这是程序
use strict;
use warnings;
my $blast;
my $i;
my $idline;
my $sequence;
print "Enter Your BLAST result file name:\t";
chomp( $blast = <STDIN> ); # BLAST result file name
print "\n";
my $database;
print "Enter Your Gene list file name:\t";
chomp( $database = <STDIN> ); # sequence file
print "\n";
open IN, "$blast" or die "Can not open file $blast: $!";
my @ids = ();
my @seq_start = ();
my @seq_end = ();
while (<IN>) {
#spliting the result file based on each tab
my @feilds = split( "\t", $_ );
push( @ids, $feilds[0] ); #copying the name of sequence
#coping the 6th tab value of the result which is the start point of from where a value should be cut.
push( @seq_start, $feilds[6] );
#coping the 7th tab value of the result file which is the end point of a value should be cut.
push( @seq_end, $feilds[7] );
}
close IN;
open OUT, ">Result.fasta" or die "Can not open file $database: $!";
for ( $i = 0; $i <= $#ids; $i++ ) {
($sequence) = &block( $ids[$i] );
( $idline, $sequence ) = split( "\n", $sequence );
#extracting the sequence from the start point to the end point
my $seqlen = $seq_end[$i] - $seq_start[$i] - 1;
my $Nucleotides = substr( $sequence, $seq_start[$i], $seqlen ); #storing the extracted substring into $sequence
$Nucleotides =~ s/(.{1,60})/$1\n/gs;
print OUT "$idline\n";
print OUT "$Nucleotides\n";
}
print "\nExtraction Completed...";
sub block {
#block for id storage which is the first tab in the Blast output file.
my $id1 = shift;
print "$id1\n";
my $start = ();
open IN3, "$database" or die "Can not open file $database: $!";
my $blockseq = "";
while (<IN3>) {
if ( ( $_ =~ /^>/ ) && ($start) ) {
last;
}
if ( ( $_ !~ /^>/ ) && ($start) ) {
chomp;
$blockseq .= $_;
}
if (/^>$id1/) {
my $start = $. - 1;
my $blockseq .= $_;
}
}
close IN3;
return ($blockseq);
}
BLAST RESULT FILE:http://www.fileswap.com/dl/Ws7ehftejp/
SEQUENCE FILE:http://www.fileswap.com/dl/lPwuGh2oKM/
错误
在Nucleotide_Extractor.pl第39行的字符串之外的substr。
使用未初始化的值$ Nucleotides替换(s ///)at Nucleotide_Extractor.pl第41行。
在连接(。)或字符串中使用未初始化的值$ Nucleotides 在Nucleotide_Extractor.pl第44行。
非常感谢任何帮助,并始终邀请查询
答案 0 :(得分:2)
现有代码存在一些问题,我在修复错误时最终重写了脚本。您的实现效率不高,因为它打开,读取和关闭Excel工作表中每个ID的序列文件。更好的方法是从序列文件中读取和存储数据,或者,如果内存有限,则遍历序列文件中的每个条目,并从Excel文件中选择相应的数据。你最好还是使用哈希而不是数组;哈希以键值对存储数据,因此更容易找到您要查找的内容。我也一直使用引用,因为它们可以很容易地将数据传入和传出子例程。
如果您不熟悉perl数据结构,请查看perlfaq4和perldsc,perlreftut包含有关使用引用的信息。
现有代码的主要问题是从fasta文件获取序列的子例程没有返回任何内容。在代码中放置大量的调试语句是一个好主意,以确保它正在按照您的想法执行。我已经离开了我的调试语句,但对它们进行了评论。我也大量评论了我改变的代码。
#!/usr/bin/perl
use strict;
use warnings;
# enables 'say', which prints out your text and adds a carriage return
use feature ':5.10';
# a very useful module for dumping out data structures
use Data::Dumper;
#my $blast = 'infesmall.txt';
print "Enter Your BLAST result file name:\t";
chomp($blast = <STDIN>); # BLAST result file name
print "\n";
#my $database = 'infe.fasta';
print "Enter Your Gene list file name:\t";
chomp($database = <STDIN>); # sequence file
print "\n";
open IN,"$blast" or die "Can not open file $blast: $!";
# instead of using three arrays, let's use a hash reference!
# for each ID, we want to store the start and the end point. To do that,
# we'll use a hash of hashes. The start and end information will be in one
# hash reference:
# { start => $fields[6], end => $fields[7] }
# and we will use that hashref as the value in another hash, where the key is
# the ID, $fields[0]. This means we can access the start or end data using
# code like this:
# $info->{$id}{start}
# $info->{$id}{end}
my $info;
while(<IN>){
#splitting the result file based on each tab
my @fields = split("\t",$_);
# add the data to our $info hashref with the ID as the key:
$info->{ $fields[0] } = { start => $fields[6], end => $fields[7] };
}
close IN;
#say "info: " . Dumper($info);
# now read the sequence info from the fasta file
my $sequence = read_sequences($database);
#say "data from read_sequences:\n" . Dumper($sequence);
my $out = 'result.fasta';
open(OUT, ">" . $out) or die "Can not open file $out: $!";
foreach my $id (keys %$info) {
# check whether the sequence exists
if ($sequence->{$id}) {
#extracting the sequence from the start point to the end point
my $seqlen = $info->{$id}{end} - $info->{$id}{start} - 1;
#say "seqlen: $seqlen; stored seq length: " . length($sequence->{$id}{seq}) . "; start: " . $info->{$id}{start} . "; end: " . $info->{$id}{end};
#storing the extracted substring into $sequence
my $nucleotides = substr($sequence->{$id}{seq}, $info->{$id}{start}, $seqlen);
$nucleotides =~ s/(.{1,60})/$1\n/gs;
#say "nucleotides: $nucleotides";
print OUT $sequence->{$id}{header} . "\n";
print OUT "$nucleotides\n";
}
}
print "\nExtraction Completed...";
sub read_sequences {
# fasta file
my $fasta_file = shift;
open IN3, "$fasta_file" or die "Can not open file $fasta_file: $!";
# initialise two variables. We will store our sequence data in $fasta
# and use $id to track the current sequence ID
# the $fasta hash will look like this:
# $fasta = {
# 'gi|7212472|ref|NC_002387.2' => {
# header => '>gi|7212472|ref|NC_002387.2| Phytophthora...',
# seq => 'ATAAAATAATATGAATAAATTAAAACCAAGAAATAAAATATGTT...',
# }
#}
my ($fasta, $id);
while(<IN3>){
chomp;
if (/^>/) {
if (/^>(\S+) /){
# the header line with the sequence info.
$id = $1;
# save the data to the $fasta hash, keyed by seq ID
# we're going to build up an entry as we go along
# set the header to the current line
$fasta->{ $id }{ header } = $_;
}
else {
# no ID found! Erk. Emit an error and undef $id.
warn "Formatting error: $_";
undef $id;
}
}
## ensure we're getting sequence lines...
elsif (/^[ATGC]/) {
# if $id is not defined, there's something weird going on, so
# don't save the sequence. In a correctly-formatted file, this
# should not be an issue.
if ($id) {
# if $id is set, add the line to the sequence.
$fasta->{ $id }{ seq } .= $_;
}
}
}
close IN3;
return $fasta;
}