Question

我有这个脚本，它根据相同的部分组合了两个文本。

use warnings;
use strict;
use utf8;
use open ':encoding(utf8)';
binmode(STDOUT, ":utf8");

my $f1 = 'input.txt';
my $f2 = 'add.txt';
my $f3 = 'output.txt';

my %ids;
my $fh;

open $fh, '<', $f2 or die "Can't read the file with replacements: $!";
while (<$fh>) {
    chomp;
    my ($name, $id) = split /=/;
    $ids{$name} = $id;
}
close $fh;

open my $fho, '>', $f3 or die "Can't write output file: $!";
open $fh, '<', $f1 or die "Can't read input file: $!";
while (<$fh>) {
    for my $name (keys %ids) {
        s/$name/${name} $ids{$name}/;
    }
    print $fho $_;
}

close $fh;
close $fho;

例如。

input.txt - ＆＃34; text stream＆＃34;没有特殊结构

random text random text, TARGET TEXT 1 — random
textTARGET TEXT 2! random text random text
random text random text random text
TARGET TEXT 3 random text random text TARGET TEXT 4 random text

add.txt - 要添加的文字列表

TARGET TEXT 1=ADDITIONAL TEXT 1
TARGET TEXT 2=ADDITIONAL TEXT 2
TARGET TEXT 3=ADDITIONAL TEXT 3
TARGET TEXT 4=ADDITIONAL TEXT 4

output.txt将是：

random text random text, TARGET TEXT 1 ADDITIONAL TEXT 1 — random
textTARGET TEXT 2 ADDITIONAL TEXT 2! random text random text
random text random text random text
TARGET TEXT 3 ADDITIONAL TEXT 3 random text random text TARGET TEXT 4 ADDITIONAL TEXT 4

我有一个相当大的文本文件要结合（~40Mb）和脚本执行它的工作超级慢。有没有办法加快速度？或者也许有人知道一个可以做同样事情的工具。

Answer 1

使用in循环的循环始终是可疑的，尤其是涉及IO时。

while (<$fh>) {
    for my $name (keys %ids) {
        s/$name/${name} $ids{$name}/;
    }
    print $fho $_;
}

您可以在此处做的最佳性能改进是不要逐行进行。相反，读取整个文件并将其作为单个文本处理。如果你把整个文件作为一个单独的字符串阅读，那么现在40 megs并不是那么多内存，你可以做一次整个事情。这消除了大量的Perl和IO开销。

# Or use File::Slurp or Path::Tiny
my $text = do { local $/; <$fh> };

for my $name (keys %ids) {
    # The /g is important to replace all instances of each key
    $text =~ s/$name/${name} $ids{$name}/g;
}
print $fho $text;

聪明的缓冲可以提高内存效率。您可以使用read()以大块读取文件，同时确保$text始终以换行符结束，而不是读取整个文件。阅读文件的一般技术值得拥有自己的问题，可能已经有了答案，所以我把它留给你。

下一步改进是不循环每个键。相反，将所有键组合成一个正则表达式，获取每行匹配的所有键，并应用它们。使用Regex::Assemble进行合并。

my $all_keys = Regexp::Assemble->new;
$all_keys->add( keys %ids );
my $all_keys_re = $all_keys->re;

# Get all the matched keys at once, the /g is important.
my @matches = $text =~ /($all_keys_re)/g;

# Replace all the matched keys. Use uniq to avoid doing the replacement twice.
for my $match (uniq @matches) {
    # Use /g to replace multiple copies of the same key on a line.
    $text =~ s/$match/$match $ids{$match}/g;
}
print $fho $text;

如果每个文件包含总可能键的百分比较低，则这将是一个胜利。正则表达式将明显更快，因为它将使用比蛮力重新扫描每个键的文本更有效的算法。它也将在正则表达式引擎中执行，该引擎通常比Perl字节码更有效。

通过使用其他答案中的建议并在一个s///中完成所有操作，可以提高效率。

my $text = do { local $/; <> };

$text =~ s{($all_keys_re)}{$1 $ids{$1}}g;

print $text;

Answer 2

将您的模式（键）连接成一个大型正则表达式：

 .code32
.section .data
stringa:
  .asciz "eax is now %x\n"
stringb:
  .asciz "ebx is now %x\n"
.section .text  
.globl main
main:
  movl $4, %eax
  movl $5,  %ebx
  xchg %eax, %ebx
  pushl %eax
  pushl $stringa
  call printf
  add $8, %esp
  pushl %ebx
  pushl $stringb
  call printf
  add $8, %esp
  bswap %eax
  pushl %eax
  pushl $stringa
  call printf
  add $8, %esp
  call exit

编译一次大型正则表达式，并使用组/(a|b|c|d|...|zzz)/作为查找中的键。

$1

（s/$big_re/$1 . $addtext{$1}/ge;标志使替换成为表达式，而不是文本。您正在编写/e但可能希望在表达式中执行其他操作（调用函数，使其小写，添加更多格式化等。请参阅文档here，在示例中查找$1 . $text标记。

Answer 3

这一点你可以非常轻松地加速：

for my $name (keys %ids) {
    s/$name/${name} $ids{$name}/;
}

将其编译为正则表达式：

my $search = join "|", map {quotemeta} keys %ids; 
   $search = qr/\b($search)\b/;

然后在循环中：

s/$search/$1 $ids{$1}/g;

注意 - 我已经添加了\b用于分词匹配，因为它不太可能使用子字符串和排序顺序来绊倒你。显然，你不需要。

但这意味着你不会在每次迭代时进行正则表达式匹配循环。

Answer 4

虽然看起来似乎有道理，但上述答案假设替换模式的应用与 add.txt 中定义的模式的顺序无关。

原始问题应该更加明确，以便正确回答。

例如，<b> input.txt </b> can be changed only once等

如果 add.txt 中的一个模式更改某些行，然后 add.txt 中的其他模式会更改以前更改过的模式，该怎么办？

有什么方法可以加速这个Perl脚本？

4 个答案: