Question

我的脚本旨在读取以下格式的脚本：

fixedStep chrom=chr1 start=3 step=1
0.006
0.010
fixedStep chrom=chr1 start=9 step=1
0.002
0.004
0.005
fixedStep chrom=chr1 start=14 step=1
0.010
0.020
0.028
0.666
0.777
fixedStep chrom=chr1 start=22 step=1
0.005
0.009
0.012
0.555

该脚本适用于此类简短的“练习文件”。它的输出如下：

.....
.....
.....
0.006
0.010
.....
.....
.....
.....
0.002
0.004
0.005
.....
.....
0.010
0.020
0.028
0.666
0.777
.....
.....
.....
0.005
0.009
0.012
0.555
.....
.....
.....
.....
.....

因此，脚本正在做的是在单个列中列出从原始文件派生的两个重要事项。第一种重要的是所有那些四位十进制数。第二种重要的是.....的可变数量的实例。那些代表“缺失”的四位数字。在任何连续的十进制数之前和之后出现的.....的数量是根据以fixedStep...开头的行中包含的信息计算的。

该脚本的最终目的是将此处显示的练习文件的大版本转换为输出的大版本。但正如我所说，我的解决方案很慢。有什么想法要改进吗？我确实已经编写了另一个脚本来读取输出，并且该脚本期望输出我刚才描述的特定格式。

这是脚本：

#!/usr/bin/perl

use strict; use warnings;

unless(@ARGV) {
    exit;
}

my $chrpc = shift;
open( PHAST, "<$chrpc" );

这只会打开文件。接下来，我啜饮原始文件。我知道这很慢，但我看到解决方案的路径始于此。我怀疑这是减缓事情的最重要的事情。我承认，后来这个剧本有点令人费解，可以“清理”，希望对性能有影响，而不仅仅是美学。

my @wholething = ();
while ( <PHAST> ) {
    my $line = $_;
    chomp $line;
    push( @wholething, $line );
}

接下来，我开始重新组织数据。我还添加了一些东西，比如逗号或“结束”这个词，期望使用它们来帮助在后续步骤中将事物分开/连接在一起。首先，我创建容器@chunked并将文件的第一行和逗号推入其中。

my @chunked = ();

push ( @chunked, $wholething[ 0 ], ",");

然后循环遍历@wholething，并将包含小数的文件的下一行推入@chunked，并将下一行包含fixesStep，逗号，“结束”，然后是同一行，然后是逗号。

for ( my $i = 1; $i < scalar @wholething ; $i++ ) {
       if ( $wholething[ $i ]=~m/fixedStep/ ){
       chomp $wholething[ $i ];
       push ( @chunked, $wholething[ $i ],",", "end\n", $wholething[ $i ], ","  ); 
  }    

  else {
      chomp $wholething[ $i ];
      push ( @chunked, $wholething[ $i ], "," );
  }
}

最后，我们得到的是一堆“chunked”文件，其中每个连续的十进制数运行都被相邻的fixedStep包含行括起来，除了最后一个块，其运行十进制数字被前面的最后fixedStep行括起来。如果原始文件像这样分块，我可以使用侧翼行中的信息来决定要添加多少.....来表示“缺失”信息。对于最后一个chunck，我手动输入一个值来帮助做出这些决定。但现在，我将@chunked加入一个巨大的字符串中，然后在所有“结束”事件中将其拆分。

my $bigstring = join ( "", @chunked );
my @chunked_array = split ( "end" , $bigstring );
#print "@chunked_array\n\n";

现在重新组织文件，我开始制作新文件。我创建了一个容器@pc_array并将$last定义为某个值。回想一下，在分块形式中，除了最后一个块之外，每个十进制数运行都被相邻的fixedStep行括起来。 $last给出的值用于帮助括起最后一个块的末尾。在这里，这个数字是巨大的。如果重要，则该值是染色体序列的最终位置。输出的所有行对应于染色体中的基本位置（因此文件很大）。对于练习文件，将$last设置为更小的数字。

my @pc_array = ();
my $count = 1;
my $last = 61342429;  ## enter here value of final position for given chr.

一个for loop循环遍历每个块，并确定在块之间添加多少.....。第一次通过循环，我计算在第一个十进制数之前添加到数组的.....。在循环的最后一次，我使用$last来帮助确定最后添加多少....。对于其余部分，我将十进制数字推入数组，然后输入适当数量的.....。我还在输出中生成一些健全性检查，以确保事情正常。我将在最后删除那些以生成输出的最终形式。

for ( my $i = 0; $i < scalar @chunked_array  ; $i++ ) { ## $i = chunk number

      my @lines = split ( "," , $chunked_array[ $i ]);

      my $distance = scalar @lines - 2 ; ## gives number of pc score lines 
      ## notice extra comma in @entries. 


      my ( $position_1, $position_2 ) = ($chunked_array[ $i ] =~ /start\=(\d+)/g); 
      my $post_fill = $position_2 - ( $position_1 + $distance ) ;

      if ( $i == 0 ){ ## when first chunk

           push ( @pc_array, 0, 0, ".....\n" );

           for ( my $j = 0; $j < $position_1 - 1 ; $j++ ){

                 ## fill in 'pre-missing' scores with .'s

             push ( @pc_array, $i, $count, ".....\n" ); 
             $count++;
       } 

        ## fill in pc scores
        for( my $j = 0; $j < $distance; $j++ ){

             push( @pc_array, $i, $count, "$lines[ 1 + $j ]\n" ); 

             $count++;
         }

         ## fill in post-missing pc scores with .'s
         for ( my $j = 0; $j < $post_fill  ; $j++ ){
               push ( @pc_array, $i, $count, ".....\n" ); 
               $count++;
         } 

  } 


  elsif ( $chunked_array[ $i ] eq $chunked_array[ -1 ] ) {
          ## when last chunk

          ## fill in pc scores
          for( my $j = 0; $j < $distance; $j++ ){

               push( @pc_array, $i, $count, "$lines[ 1 + $j ]\n" ); 

               $count++;
          }

          my $final_post_fill = $last - ( $position_1 + $distance ); 

          ## fill is post-missing pc scores with .'s
          for ( my $j = 0; $j < $final_post_fill + 1  ; $j++ ){
               push ( @pc_array, $i, $count, ".....\n" ); 
               $count++;
         }



  }



  else { ## when first or else not the last chunk

        ## fill is pc scores
        for ( my $j = 0; $j < $distance; $j++ ){

             push( @pc_array, $i, $count, "$lines[ 1 + $j ]\n" ); 

             $count++;
         }

         ## fill is post-missing pc scores with .'s
         for ( my $j = 0; $j < $post_fill  ; $j++ ){
               push ( @pc_array, $i, $count, ".....\n" ); 
               $count++;
         } 

   }

}

我看看阵列。输出的第一行是空格。开头有一个额外的空间。

print @pc_array;

我执行以下操作以删除空格，但主要是为了删除输出中的完整性检查，以便达到我需要的输出的最终形式。

my @pc_col =();

for ( my $i = 2; $i < @pc_array; $i=$i+3 ) {
      chomp $pc_array[ $i ];
      print "$pc_array[ $i ]\n";
      push ( @pc_col, $pc_array[ $i ]."\n");
}

print @pc_col;
open( OUT, ">chr19_pc_col.txt");
print OUT @pc_col;

就像我说的那样，脚本可以运行，但我可以使用一些指针来优化它。

Answer 1

你已经非常纠结了。

据我所知，这个程序似乎可以满足您的需求。我假设step属性始终为1，或者至少可以忽略，chrom字段同样不相关。

use strict;
use warnings;

open my $out, '>', 'chr19_pc_col.txt' or die $!;

my $last = 30;

my $line = 0;
while (<>) {
  if (/^fixedStep.*start=(\d+)/) {
    my $start = $1;
    while ($line < $start) {
      print $out ".....\n";
      ++$line;
    }
  }
  else {
    print $out $_;
    ++$line;
  }
}

print $out ".....\n" for $line .. $last;

close $out or die $!;

<强>输出

.....
.....
.....
0.006
0.010
.....
.....
.....
.....
0.002
0.004
0.005
.....
.....
0.010
0.020
0.028
0.666
0.777
.....
.....
.....
0.005
0.009
0.012
0.555
.....
.....
.....
.....
.....

Answer 2

Slurping确实会导致大文件出现性能问题。

我不会为你做整件事，但看起来像这样的模式可能会帮助你开始：

#buffer, holds a few lines of the input file
my @chunk_lines = ();

#read line-by-line until end of file
while (!eof $fh) {
    my $line = readline $fh;
    if ($line =~ /^fixedStep/) {      #if this line is the start of a new chunk...
        process_chunk(@chunk_lines);  #process data
        @chunk_lines = ();            #clear buffer
    }

    #either way, push this line onto the buffer
    push @chunk_lines, $line;
}

#process any remaining buffer
process_chunk(@chunk_lines);

如果你可以单独处理每个块，这很好。你将一堆值push放入一个数组，然后将其split下来进行处理的任何内容？这是你可以优化的地方。

如果将空@chunk_lines传递给process_chunk是不好的，您可以简单地避免它：

process_chunk(@chunk_lines) if @chunk_lines;

Perl脚本工作但速度太慢

2 个答案: