Question

我的问题是如何让我的脚本快速（我使用大文件）

我上面的脚本如果单词存在于包含单词序列的其他文件中，则在单词之间添加“bbb” 例如file2.txt：i eat big pizza .my big pizza ... file1.txt（序列）：

                          eat big pizza
                          big pizza

结果Newfile

i eatbbbbigbbbpizza.my bigbbbpizza ...

我的剧本：

use strict;
use warnings;
use autodie;

open Newfile ,">./newfile.txt" or die "Cannot create Newfile.txt";
 my %replacement;
my ($f1, $f2) = ('file1.txt', 'file2.txt');

open(my $fh, $f1);
my @seq;
foreach (<$fh> )
{
  chomp;
  s/^\s+|\s+$//g;
  push @seq, $_;
}
close $fh;

@seq = sort bylen @seq;

open($fh, $f2);
foreach (<$fh> ) {
  foreach my $r (@seq) {

    my $t = $r;
    $t =~ s/\h+/bbb/g;

    s/$r/$t/g;
  }
  print Newfile ;
}
close $fh;
close Newfile ;
exit 0;

sub bylen {
   length($b) <=> length($a);
}

Answer 1

而不是数组

my @seq;

将您的单词定义为哈希。

my %seq;

而不是推词

push @seq, $_;

将单词存储在哈希中。预先计算更换并将其移出循环。

my $t = $_;
$t =~ s/\h+/bbb/g;
$seq{$_} = $t;

预先计算外环前面的单词：

my @seq = keys %seq;

使用哈希查找在内循环中查找替换：

my $t = $seq{$r};

这可能会快一些，但不要期望太多。

在大多数情况下，最好通过以某种方式准备输入来减少问题，这使得解决方案更容易。例如grep -f比Perl循环快得多。使用grep查找需要替换的行，并使用Perl或Sed进行替换。

另一种方法是并行工作。您可以将输入分为n个部分，并在n个CPU上并行运行n个进程。请参阅GNU parallel tutorial。

Answer 2

这样的正则表达式怎么样（请注意这种方法会引起安全问题）？

use strict;
use warnings;

open (my $Newfile, '>', 'newfile.txt') or die "Cannot create Newfile.txt: $!";
my ($f1, $f2) = qw(file1.txt file2.txt);

open (my $fh, $f1) or die "Can't open $f1 for reading: $!";
my @seq = map {split ' ', $_ } <$fh>;
close $fh;
# an improvement would be to use an hash to avoid dupplicates

my $regexp = '(' . join('|', @seq) . ')';

open($fh, $f2) or die "Can't open $f2 for reading: $!";
foreach my $line (<$fh> ) {
    $line =~ s/$regexp/$1bbb/g;
    print $Newfile $line;
}
close $fh;
close $Newfile ;
exit 0;

perl：快速编写脚本以使用大文件

2 个答案: