正则表达式简单替换字典哈希(Perl)中的文档

时间:2014-08-02 17:41:25

标签: regex perl rex

我需要尽可能快地从大型文档中的哈希中查找和替换关键字。 我厌倦了以下两种方法,一种方法更快320%,但我确信我这样做是错误的,确定有更好的方法。

这个想法我只想替换字典哈希中存在的关键字并保留那些不存在的关键字,所以我知道它不在字典中。

以下两种方法都会扫描两次,以便按照我的想法进行查找和替换。我相信正面或反面的正则表达式可以更快地优化它。

#!/usr/bin/perl

use strict;
use warnings;

use Benchmark qw(:all);

my %dictionary = (
            pollack => "pollard",
            polynya => "polyoma",
            pomaces => "pomaded",
            pomades => "pomatum",
            practic => "praetor",
            prairie => "praised",
            praiser => "praises",
            prajnas => "praline",
            quakily => "quaking",
            qualify => "quality",
            quamash => "quangos",
            quantal => "quanted", 
            quantic => "quantum",
    );

my $content =qq{
        Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
        can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
        Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
        one kind of problem {qualify} {doesNotExist} end.
    };

# just duplicate content many times
$content .= $content;

cmpthese(100000, {
    replacer_1 => sub {my $text = replacer1($content)},
    replacer_2 => sub {my $text = replacer2($content)},
});

print replacer1($content) , "\n--------------------------\n";
print replacer2($content) , "\n--------------------------\n";
exit;

sub replacer1 {
    my ($content) = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub replacer2 {
    my ($content) = shift;
    my @names = $content =~ /\{(.+?)\}/g;
    foreach my $name (@names) {
        if (exists $dictionary{$name}) {
            $content =~ s/\{$name\}/\[$dictionary{$name}\]/;
        }
    }
    return $content;
}

以下是基准测试结果:

              Rate replacer_2 replacer_1
replacer_2  5565/s         --       -76%
replacer_1 23397/s       320%         --

3 个答案:

答案 0 :(得分:3)

这是一种更快,更紧凑的方式:

sub replacer3 {
    my ($content) = shift;
    $content =~ s#\{(.+?)\}#"[".($dictionary{$1} // $1)."]"#ge;
    return $content;
}

在Perl 5.8中,如果您的字典值都不是“假”,则可以使用||而不是//

使用已包含大括号和括号的字典还可以获得一点:

sub replacer5 {
    my ($content) = shift;
    our %dict2;
    if (!%dict2) {
        %dict2 = map { "{".$_."}" => "[".$dictionary{$_}."]" } keys %dictionary
    }
    $content =~ s#(\{.+?\})#$dict2{$1} || $1#ge;
    return $content;
}

基准测试结果:

              Rate replacer_2 replacer_1 replacer_3 replacer_5
replacer_2  2908/s         --       -79%       -83%       -84%
replacer_1 14059/s       383%         --       -20%       -25%
replacer_3 17513/s       502%        25%         --        -7%
replacer_5 18741/s       544%        33%         7%         --

答案 1 :(得分:3)

它有助于构建一个预先匹配任何哈希键的正则表达式。喜欢这个

my $pattern = join '|', sort {length $b <=> length $a } keys %dictionary;
$pattern = qr/$pattern/;

sub replacer4 {
    my ($string) = @_;
    $string =~ s# \{ ($pattern) \} #"[$dictionary{$1}]"#gex;
    $string;
}

带有这些结果

              Rate replacer_2 replacer_1 replacer_3 replacer_4
replacer_2  4883/s         --       -80%       -84%       -85%
replacer_1 24877/s       409%         --       -18%       -22%
replacer_3 30385/s       522%        22%         --        -4%
replacer_4 31792/s       551%        28%         5%         --

如果您可以使用哈希中的大括号和括号,而不是每次都必须添加它们,那么它也会有所改进。

答案 2 :(得分:2)

我建议为您的基准测试子程序使用有意义的名称,它会使输出和意图更清晰。

以下再现了鲍罗丁和暴徒尝试过的一些内容,然后将它们结合起来。

#!/usr/bin/perl

use strict;
use warnings;
use feature 'state';

use Benchmark qw(:all);

# Data separated by paragraph mode.
my %dictionary = split ' ', do {local $/ = ''; <DATA>};
my $content = do {local $/; <DATA>};

# Quadruple Content
$content = $content x 4;

cmpthese(100_000, {
    original        => sub { my $text = original($content) },
    build_list      => sub { my $text = build_list($content) },
    xor_regex       => sub { my $text = xor_regex($content) },
    list_and_xor    => sub { my $text = list_and_xor($content) },
});

exit;

sub original {
    my $content = shift;
    $content =~ s/\{(.+?)\}/exists $dictionary{$1} ? "[$dictionary{$1}]": "\{$1\}"/gex;
    return $content;
}

sub build_list {
    my $content = shift;
    state $list = join '|', map quotemeta, keys %dictionary;
    $content =~ s/\{($list)\}/[$dictionary{$1}]/gx;
    return $content;
}

sub xor_regex {
    my $content = shift;

    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{.+?\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

sub list_and_xor {
    my $content = shift;

    state $list = join '|', map quotemeta, keys %dictionary;
    state $with_brackets = {
        map {("{$_}" => "[$dictionary{$_}]")} keys %dictionary
    };

    $content =~ s{(\{(?:$list)\})}{$with_brackets->{$1} // $1}gex;

    return $content;
}

__DATA__
pollack pollard
polynya polyoma
pomaces pomaded
pomades pomatum
practic praetor
prairie praised
praiser praises
prajnas praline
quakily quaking
qualify quality
quamash quangos
quantal quanted 
quantic quantum

Start this is the text that contains the words to replace. {quantal} A computer {pollack} is a general {pomaces} purpose device {practic} that 
can be {quakily} programmed to carry out a set {quantic} of arithmetic or logical operations automatically {quamash}.
Since a {prajnas} sequence of operations can {praiser} be readily changed, the computer {pomades} can solve more than {prairie}
one kind of problem {qualify} {doesNotExist} end.

输出:

                Rate     original    xor_regex   build_list list_and_xor
original     19120/s           --         -23%         -24%         -29%
xor_regex    24938/s          30%           --          -1%          -8%
build_list   25253/s          32%           1%           --          -7%
list_and_xor 27027/s          41%           8%           7%           --

我的解决方案大量使用state变量来避免重新初始化静态数据结构。但是,也可以使用闭包或our $var; $var ||= VAL

关于提高正则表达式LHS的附录

实际上,编辑LHS以使用显式列表是关于改进正则表达式。而这一变化表明速度提高了30%。

这可能不是任何神奇的解决方案。您有一个值列表,您想要替换它们。它并不像是有一些神秘的方法来简化这个目标的语言。

如果词典哈希中不存在单词,您可以使用LHS中的代码块来失败并跳过。但是,以下显示这实际上比原始方法慢36%:

sub skip_fail {
    my $content = shift;

    $content =~ s{\{(.+?)\}(?(?{! $dictionary{$1}})(*SKIP)(*FAIL))}{[$dictionary{$1}]}gx;

    return $content;
}

输出:

                Rate   skip_fail    original   xor_regex build_list list_and_xor
skip_fail     6769/s          --        -36%        -46%       -49%         -53%
original     10562/s         56%          --        -16%       -21%         -27%
xor_regex    12544/s         85%         19%          --        -6%         -14%
build_list   13355/s         97%         26%          6%         --          -8%
list_and_xor 14537/s        115%         38%         16%         9%           --