Question

我需要帮助调试下面显示的代码。我已经问过这个问题的类似版本，但是我还没能开发出有效的脚本。我的输入文件是这样的：

LINE1
  AAAAAAAAAAAAAAA
  2号线
  BBBBBBBBBBBBBBB
  3号线
  CCCCCCCCCCCCCCC
  4号线
  DDDDDDDDDDDDDDD

我希望脚本随机随机播放文件中的行，例如：

LINE2
      BBBBBBBBBBBBBBB
  一号线
      AAAAAAAAAAAAAAA
  4号线
      DDDDDDDDDDDDDDD
  3号线
      CCCCCCCCCCCCCCC

该文件中有相当多的行（~1,000,000）。目前，我收到以下错误：

Global symbol "$header_size" requires explicit package name at fasta_corrector9.pl line 40.

和

Global symbol "$header_size" requires explicit package name at fasta_corrector9.pl line 47.

我不明白如何给$header_size一个明确的包名。我不是程序员，所以我需要非常基本的解释。提前谢谢。

#! /usr/bin/perl

use strict;
use warnings;

print "Please enter filename (without extension): ";
my $input = <>;
chomp($input);

print "Please enter total no. of sequence in fasta file: ";
my $orig_size = <> * 2 - 1;
chomp($orig_size);

open(INFILE, "$input.fasta") or die "Error opening input file for shuffling!";
open(SHUFFLED, ">" . "$input" . "_shuffled.fasta")
    or die "Error creating shuffled output file!";

my @array  = (0);    # Need to initialise 1st element in array1&2 for the shift function
my @array2 = (0);
my $i      = 1;
my $index  = 0;
my $index2 = 0;

while (my @line = <INFILE>) {
    while ($i <= $orig_size) {

        $array[$i] = $line[$index];
        $array[$i] =~ s/(.)\s/$1/seg;

        $index++;
        $array2[$i] = $line[$index];
        $array2[$i] =~ s/(.)\s/$1/seg;

        $i++;
        $index++;
    }
}

my $array  = shift(@array);
my $array2 = shift(@array2);
for $i (reverse 0 .. $header_size) {
    my $j = int rand($i + 1);
    next if $i == $j;
    @array[$i,  $j] = @array[$j,  $i];
    @array2[$i, $j] = @array2[$j, $i];
}

while ($index2 <= $header_size) {
    print SHUFFLED "$array[$index2]\n";
    print SHUFFLED "$array2[$index2]\n";
    $index2++;
}
close(INFILE);
close(SHUFFLED);

Answer 1

使用该大小的文件执行此操作的最简单方法是使用Tie::File来允许随机访问数据文件的行

使用O_RDWR模式可以防止在不存在的情况下创建文件

此外，来自List::Util的shuffle函数将允许您随机重新排序原始文件记录的索引

use strict;
use warnings;

use Tie::File;
use Fcntl 'O_RDWR';
use List::Util 'shuffle';

tie my @source, 'Tie::File', $ARGV[0], mode => O_RDWR, autochomp => 0
    or die "Unable to open file '$ARGV[0]': $!";

for my $line (shuffle 1 .. @source/2) {
  printf "line %d\n", $line;
  print $source[$line * 2 - 1];
}

此程序应以

运行

perl shuffle.pl infile > outfile

Answer 2

简而言之，您在代码中使用了$header_size，但没有告诉Perl究竟是什么$header_size。这正是use strict; 高度推荐的原因，否则它将被默认视为未定义的值（数字上下文中为0）。

perldoc perldiag有助于理解此类消息：

全局符号“％s”需要显式包名称

（F）您说过“use strict”或“use strict vars”，表示   所有变量必须是词法范围的（使用“my”或   “state”），事先使用“our”声明，或明确限定为   说出全局变量所在的包（使用“::”）。

将此问题应用于手头的问题，$header_size尚未初始化。在这种情况下要做的是在使用之前指定my $header_size = $some_value;，或者如果你真的想让它保持未定义，只需my $header_size;。

Answer 3

根据您的脚本名称（fasta_corrector9.pl）以及文件的格式，我假设您正在使用FASTA序列。如果这是真的，我认为你应该真正理解CPAN上的Bio命名空间。拥有这些开放格式规范的重点是人们编写工具来操纵格式并免费提供给您。在这种情况下，您应该强烈考虑使用Bio::DB::Fasta作为结构化数据访问FASTA文件。

my $stream  = Bio::DB::Fasta->new('/path/to/files')->get_PrimarySeq_stream;
while (my $seq = $stream->next_seq) {
     # now you are streaming through your FASTA sequences in order.
     # You can accomplish shuffling with O(1) space complexity in this loop. 
 }

全局符号和显式包名称的困难

3 个答案: