Question

我有很多行文本数据表示在不同日期发生的事件。每个日期与约500个左右的事件相关联。每个事件都需要在该日期发生的其他事件的上下文中进行评估，并且仅需要在其他事件中进行评估。由于将所有数据分成数组并将其分解为更小的数组是不可行的，因此在内存方面，我想使用推荐的while循环过程。

我想做的是：1）用每行打包一个数组，直到下一行显示不同的日期; 2）处理阵列并清除它; 3）继续打包线直到你到达下一个日期，依此类推。

到目前为止，我正在使用以下代码，但它似乎重复了太多事情，成为最惯用的解决方案：

my @chunk;
my $current;

while ( <FILEHANDLE> ) {
    my $date_of_this_line = ( split /\t/ )[0];
    unless ( defined $current and $current eq $date_of_this_line ) {
        do { &process @chunk; undef @chunk } if @chunk;
        $current = $date_of_this_line;
    }
    push @chunk, $_;
}
do{ &process @chunk; undef @chunk } if @chunk;

有什么想法可以更好地解决这类问题吗？我问，因为我确定我不是第一个这样做的人！

修改我想我已经拥有了！在ysth和FM的评论（下面）的帮助下，我能够在没有重复命令的情况下将解决方案减少到几行代码。权衡是我必须在进入while循环之前再声明一个词法变量。

my @chunk;
my $current = 1;
my $date_of_line = 1;

while ( $date_of_line or @chunk ) {
    $date_of_line = defined( $_ = $FILEHANDLE ) ? ( split /\t/ )[0] : 0 and chomp;
    #the reason for 'and chomp'? chomp throws an error if $_ = $FILEHANDLE is not defined

    unless ( $current eq $date_of_line ) {
        process( splice( @chunk ) ) if @chunk;
        #thanks to ysth for pointing out how to process and clear @chunk in one stroke!
        $current = $date_of_line;
    }

    push @chunk, $_ if $date_of_line;
}

不错，是吗？如果我定义子程序'process'给我一个方便的小测试，它确认结果是我们想要的（那是......直到我添加更多数据并且它搞砸了我;）

sub process {
    my @batch = @_;
    my $size = @batch;
    print "size is $size\n"; #simply tells me I'm getting the right size chunks;

    my $dates = keys %{ { map { ( split /\t/ )[0] => undef ) @batch } };
    print "number of different dates in batch: $games\n"; #should only be 1
}

Answer 1

不，这通常是怎么做的。您可以使用&process( splice(@chunk) )一次传递和清除数组。循环可能存在变化：

while( ! $eof || @chunk ) {
    $eof ||= defined( $_ = <FILEHANDLE> );
    if ( $eof || defined($current) && $current ne ( $date_of_this_line = ( split /\t/ )[0] ) ) {
        &process( splice(@chunk) ) if @chunk;
        $current = $date_of_this_line;
    }
    push @chunk, $_ unless $eof;
}

但那有点混乱。

Answer 2

这是一个更冗长的解决方案，它说明了“chunk and commit”子程序的概念，您可以根据自己的要求制作该子程序以保留状态并执行回调。

use strict;
use warnings;

sub make_chunk_proc (&) {
    my( $callback ) = @_;
    my $grouping_key = ''; # start empty
    my @queue;
    return sub {
        if ( @_ ) { # add arguments to current chunk
            my $key = shift;
            return if $grouping_key and $key ne $grouping_key;
            $grouping_key = $key;
            push @queue, [ $key, @_ ];
            return 1;
        }
        else { # commit current chunk and reset state
            $callback->( \@queue );
            $grouping_key = '';
            @queue = ();
        }
    };
}

# ==== main ====

my $chunker = make_chunk_proc {
    my( $queue ) = @_;
    print "@$_\n" for @$queue;
    print '-' x 70, "\n";
};

while ( <> ) {
    chomp;
    my( $key, @rest ) = split /\t/;
    $chunker->( $key, @rest ) or do {
        $chunker->();
        $chunker->( $key, @rest );
    }
}
$chunker->(); # commit remaining stuff

使用这样的数据：

2011-04-19  blabla
2011-04-19  blablub
2011-04-20  super
2011-04-20  total super
2011-04-21  weiter
2011-04-22  immer weiter
2011-04-24  immer weiter weiter

结果如下：

$ perl chunks.pl < chunks.txt
2011-04-19 blabla
2011-04-19 blablub
----------------------------------------------------------------------
2011-04-20 super
2011-04-20 total super
----------------------------------------------------------------------
2011-04-21 weiter
----------------------------------------------------------------------
2011-04-22 immer weiter
----------------------------------------------------------------------
2011-04-24 immer weiter weiter
----------------------------------------------------------------------

Answer 3

一些想法：

（1）将上一个日期设置为某个值（例如，空字符串），因此您不必检查它是否在循环中定义。

（2）修改process()，以便在@chunk为空时返回它。

（3）如果@chunk是全局的，process()可以将其重置为空。

（4）其他值得考虑的事情：（a）使用词汇文件句柄，（b）不要使用领先的＆符号调用process。

根据每个字符串中的信息处理块中“while”循环的输入？

3 个答案: