Question

我正在尝试回到Perl，并且在我的代码中遇到了错误的时间。我有一个大源.DAT文件（2GB）。我有另一个.TXT文件，其中包含我想在该.DAT文件中搜索的字符串（差不多2000个）。我将该.TXT文件中的值抛出到数组中。

我想有效地搜索数组中的每个字符串，然后输出匹配项。任何人都可以帮助理顺我吗？提前谢谢！

my $source = "/KEYS.txt";
my $data= "/claims.dat";
my @array;
my $arraySize = scalar (@DESYarray);

open (DAT, $data) or die "Cannot open file!";
open (LOG, ">>/output.log");

open (TXT,$source);
while (my $searchValues = <TXT>) {
    push (@array, $searchValues);
}
close (TXT);


while (my $line = <DAT>) {      
for (my $x = 0; $x <= $arraySize; $x++) {
    if (my $line =~ /$array[$x]/) {
        print LOG $line;
    }
}
}

close (DAT);
close (LOG);

Answer 1

您在内循环中重新声明my $line，这意味着它将等于：

if (undef =~ /$array[$x]/) {

当然，这总会失败。如果您使用过use warnings，则会收到错误消息：

Use of uninitialized value in pattern match (m//) at ...

这让我怀疑你没有使用警告，这是一个非常坏主意。

此外，请注意，当您将值读入@array时，您会在结尾处获得换行符，因此您在DAT文件中搜索以\n结尾的字符串，这可能不会成为你想成为的人。例如。如果您有foo\n，则与foo bar baz不匹配。

解决方法是chomp您的数据：

chomp(my @array = <TXT>);

是的，您可以选择一个数组，并且可以通过这种方式将整个文件分配给数组。

您可以而且应该稍微改进您的脚本。使用数组索引进行循环是非常必要的，除非您实际上需要使用索引。

use strict;
use warnings;    # ALWAYS use these!
use autodie;     # handles the open statements for convenience

my $source = "/KEYS.txt";
my $data= "/claims.dat";

open $txt, '<', $source;
chomp(my @array = <$txt>);
close $txt;

open my $dat, '<', $data;   # use three argument open and lexical file handle
open my $log, '>>', "/output.log";

while (<$dat>) {            # using $_ for convenience
    for my $word (@array) {
        if (/\Q$word/i) {   # adding /i modifier to match case insensitively
            print $log $line;   # also adding \Q to match literal strings
    }
}

使用\Q可能非常重要，具体取决于您的KEYS.txt文件包含的内容。正则表达式的元字符可能会导致细微的不匹配，如果您希望它们按字面意思匹配。例如。如果您有foo?之类的字词，则正则表达式/foo?/将与foo匹配，但它也会匹配for。

此外，您可能希望决定是否允许部分匹配。例如。 /foo/也会匹配football。为了克服这个问题，一种方法是使用单词boundary escape character：

/\b\Q$word\E\b/i

您需要将它们放在\Q .. \E序列之外，否则它们将按字面解释。

ETA：正如tchrist指出并且Borodin建议的那样，用所有单词构建一个正则表达式将节省你获得重复的行。例如。如果您有单词"foo"，"bar"和"baz"以及行foo bar baz，则会为此行打印三次，每个匹配单词一次。

之后可以通过以某种合适的方式重复数据来修复此问题。只有您知道您的数据以及这是否是一个问题。出于性能原因，我会毫不犹豫地编译这么长的正则表达式，但你可以尝试一下，看看它是否适合你。

Answer 2

您应该始终使用use strict和use warnings启动您的计划，特别是如果您要求提供代码帮助时。它们对调试有很大的帮助，并且经常会发现容易被忽视的简单错误。

KEYS.txt 中的字符串有多长？使用join '|', @array从它们构建正则表达式可能是可行的。顺便说一下，你编写的代码相当于@array = <TXT>，并且不要忘记选择内容！

我建议这样的事情

use strict;
use warnings;

my $source = "/KEYS.txt";
my $data= "/claims.dat";

open my $dat, '<', $data or die "Cannot open data file: $!";
open my $log, '>>', '/output.log' or die "Cannot open output file: $!";

open my $txt, '<', $source or die "Cannot open keys file: $!";
my @keys = <$txt>;
chomp @keys;
close $txt;

my $regex = join '|', map quotemeta, @keys;
$regex = qr/$regex/i;

while (my $line = <$dat>) {
  next unless $line =~ $regex;
  print $log $line;
}

close $log or die "Unable to close log file: $!";

Answer 3

我过去使用Regexp :: Assemble来获取令牌列表，创建优化的正则表达式并使用它来过滤大量文本。一旦我们离开了|定界的regexp到Regexp :: Assemble我们看到了很好的性能提升。

Regexp::Assemble

搜索大量文件字符串的有效方法是什么？

3 个答案: