Question

我正在寻找一种逐行读取文件句柄的方法（然后在每一行上执行一个函数），方法如下：我想把它作为一个＆＃34;行＆＃34 ;应由不同的字符终止，而不仅仅是我定义为$/的单个字符。我现在$INPUT_RECORD_SEPARATOR或$/不支持正则表达式或传递一个字符列表作为行终止符，这就是我的问题所在。

我的文件句柄来自进程的标准输出。因此，我无法在文件句柄内寻找并且完整内容不能立即获得，而是在执行过程时逐位产生。我希望能够为每个＆＃34;行＆＃34;添加时间戳之类的东西。该过程使用我在示例中调用handler的函数生成。每一行都应该在程序生成后立即处理。

不幸的是，我只能提出一种方法，即立即执行handler函数，但看起来非常低效，或者使用缓冲区的方式，但只能导致＆＃34;分组＆＃34;调用handler函数，例如，产生错误的时间戳。

事实上，在我的具体情况下，我的正则表达式甚至会非常简单，只需阅读/\n|\r/。因此，对于这个特殊问题，我甚至不需要完整的正则表达式支持，但只能将多个字符视为行终止符。但是$/并不支持这一点。

在Perl中有解决此问题的有效方法吗？

这是一些快速的伪perl代码，用于演示我的两种方法：

逐字节读取输入文件句柄

这看起来像这样：

my $acc = "";
while (read($fd, my $b, 1)) {
    $acc .= $b;
    if ($acc =~ /someregex$/) {
        handler($acc);
        $acc = "";
    }
}

这里的优点是，一旦读取了足够的字节，就会立即调度handler。缺点是，我们执行字符串追加并检查我们从$fd读取的每个字节的正则表达式。

一次读取带有X字节块的输入文件句柄

这看起来像这样：

my $acc = "";
while (read($fd, my $b, $bufsize)) {
    if ($b =~ /someregex/) {
        my @parts = split /someregex/, $b;
        # for brevity lets assume we always get more than 2 parts...
        my $first = shift @parts;
        handler(acc . $first);
        my $last = pop @parts;
        foreach my $part (@parts) {
            handler($part);
        }
        $acc = $last;
    }
}

这里的优点是，我们更高效，因为我们只检查每个$bufsize个字节。缺点是，handler的执行必须等到读取$bufsize个字节。

Answer 1

将$ INPUT_RECORD_SEPARATOR设置为正则表达式无济于事，因为Perl的readline也使用缓冲IO。诀窍是使用第二种方法，但使用无缓冲的sysread而不是read。如果来自管道sysread，则只要数据可用，调用就会立即返回，即使无法填充整个缓冲区（至少在Unix上）。

Answer 2

nwellnhof的建议让我能够实现这个问题的解决方案：

my $acc = "";
while (1) {
    my $ret = sysread($fh, my $buf, 1000);
    if ($ret == 0) {
        last;
    }
    # we split with a capturing group so that we also retain which line
    # terminator was used
    # a negative limit is used to also produce trailing empty fields if
    # required
    my @parts = split /(\r|\n)/, $buf, -1;
    my $numparts = scalar @parts;
    if ($numparts == 1) {
        # line terminator was not found
        $acc .= $buf;
    } elsif ($numparts >= 3) {
        # first match needs special treatment as it needs to be
        # concatenated with $acc
        my $first = shift @parts;
        my $term = shift @parts;
        handler($acc . $first . $term);
        my $last = pop @parts;
        for (my $i = 0; $i < $numparts - 3; $i+=2) {
            handler($parts[$i] . $parts[$i+1]);
        }
        # the last part is put into the accumulator. This might
        # just be the empty string if $buf ended in a line
        # terminator
        $acc = $last;
    }
}
# if the output didn't end with a linebreak, handle the rest
if ($acc ne "") {
    handler($acc);
}

我的测试表明，如果输入流中有暂停，即使在读取1000个字符之前，sysread也会返回。上面的代码用于连接长度为1000的多条消息，并正确地拆分长度较短或多个终结符的消息。

如果您在上面的代码中发现任何错误，请大声说。

使用$ INPUT_RECORD_SEPARATOR作为正则表达式读取perl文件句柄

逐字节读取输入文件句柄

一次读取带有X字节块的输入文件句柄

2 个答案: