Question

这很难描述，但在我正在处理的输出中提取数据很有用（我希望将此代码用于多种用途）

这是一个例子：假设我有一个带有单词和一些特殊字符（$，＃，！等）的文本文件，内容为：

等等等等把这个词添加到列表中：1234.56 blah blah
等等等等现在不要忘记将这个词添加到列表中：PINAPPLE等等等等而对于奖励积分，
很高兴知道脚本
能够将这个词添加到列表中：1！@＃$％^＆amp; *（）[] {} ;:'“，＆lt;。＆gt; /？asdf blah blah
等等等等

如示例所示，我想在某种形式的列表中添加任何“单词”（定义为在此上下文中不包含空格的任何字符串），以便我可以将列表的元素作为列表[2]提取列表[3]或列表（4）列表（5），或沿着这些行的东西。

这将是非常通用的，并且在另一个线程和另一个论坛中进行一些提问后，我希望在perl中使用它会使其执行相对较快 - 因此即使对于大型文本文件也能正常工作。我打算用它来读取不同程序生成的输出文件中的数据，不管输出文件的结构如何，即如果我知道要搜索的字符串，我就可以得到数据。

Answer 1

我认为你的问题中有一些遗漏的话:) 但这听起来像你想要的（假设即使“大文本文件”适合内存 - 如果没有，你会逐行循环推送到$ list）。

my $filecontents = File::Slurp::read_file("filename");
@list = $filecontents =~ /add this word to the list: (\S+)/g;

Answer 2

如果搜索字符串相同，请使用搜索短语输入记录分隔符 Perl进行处理：

open my $fh, '<', 'test.dat' or die "can't open $!"; # usual way of opening a file

my @list;                                            # declare empty array 'list' (results)
$/= 'add this word to the list:';                    # define custom input  record seperator

while( <$fh> ) {                                     # read records one by one
   push @list, $1 if /(\S\S*)/
}
close $fh;                                           # thats it, close file!

print join "\n", @list;                              # this will list the results

以上是“几乎可以”，它将保存$ list [0]中文件的第一个单词，因为处理方式。但这种方式很容易理解（imho）

blah                 <== first word of the file
1234.56
PINAPPLE
1!@#$%^&*()[]{};:'",<.>/?asdf

问：为什么不简单地在整个数据上使用一个正则表达式查找字符串（如此处已经建议的那样）。因为根据我的经验，使用每个记录正则表达式（在实际用例中可能非常复杂的正则表达式）的记录处理将更快 - 特别是在非常大的文件上。这就是原因。

真实世界测试

为了支持这个说法，我用200MB的数据文件进行了一些测试你的标记。测试来源如下：

use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
use FILE::Slurp;
# 'data.dat', a 200MB data file, containing 10_000
# markers: 'add this word to the list:' and a
# one of different data items after each.

my $t = timethese(10,
 {
  'readline+regex' => sub { # trivial reading line-by-line
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     my @list;                                            
                     while(<$fh>) { 
                        push @list,$1 if /add this word to the list:\s*(\S+)/
                     }
                     close $fh;                                           
                     return scalar @list;   
                  },
  'readIRS+regex' => sub { # treat each 'marker' as start of an input record
                     open my $fh, '<', 'data.dat' or die "can't open $!"; 
                     $/= 'add this word to the list:';    # new IRS                
                     my @list;                                            
                     while(<$fh>) { push @list, $1 if /(\S+)/ }       
                     close $fh;                                           
                     return scalar @list;   
                  },
  'slurp+regex' => sub { # read the whole file and apply regular expression
                     my $filecontents = File::Slurp::read_file('data.dat');
                     my @list = $filecontents =~ /add this word to the list:\s*(\S+)/g;
                     return scalar @list;
                  },
 }
);
cmpthese( $t ) ;

输出以下时序结果：

Benchmark: timing 10 iterations of readIRS+regex, readline+regex, slurp+regex...
readIRS+regex: 43 wallclock secs (37.11 usr +  5.48 sys = 42.59 CPU) @  0.23/s (n=10)
readline+regex: 42 wallclock secs (36.47 usr +  5.49 sys = 41.96 CPU) @  0.24/s (n=10)
slurp+regex: 142 wallclock secs (135.85 usr +  4.98 sys = 140.82 CPU) @  0.07/s (n=10)
               s/iter    slurp+regex  readIRS+regex readline+regex
slurp+regex      14.1             --           -70%           -70%
readIRS+regex    4.26           231%             --            -1%
readline+regex   4.20           236%             1%             --

这基本上意味着简单的行式读取和自定义IRS的块式读取比通过常规扫描文件和扫描快约2.3倍（一次通过~4秒）表达

这基本上说，如果你在像我这样的系统上处理这个大小的文件;-)，如果您的搜索问题位于一行并且已阅读，则应逐行阅读自定义输入记录分隔符如果您的搜索问题涉及多行（我的$ 0.02）。

想要进行测试吗？这一个：

use strict; use warnings; sub getsomerandomtext { my ($s, $n) = ('', (shift)); while($n --> 0) { $s .= chr( rand(80) + 30 ); $s .= "\n" if rand($n) < $n/10 } $s x 10 } my @stuff = ( q{1234.56}, q{PINEAPPLE}, q{1!@#$%^&*()[]{};:'",<.>/?asdf} ); my $fn = 'data.dat'; open my $fh, '>', $fn or die $!; my $phrase='add this word to the list:'; my $x = 10000; while($x --> 0) { print $fh getsomerandomtext(1000), ' ', $phrase, ' ', $stuff[int(rand(@stuff))], ' ', getsomerandomtext(1000), "\n", } close $fh; print "done.\n";

创建200MB输入文件'data.dat'。

此致

RBO

Answer 3

怎么样：

my(@list);
my $rx = qr/.*add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

这允许包含多个“添加”标记的长行。如果肯定只能是一个，请将内部while替换为if。（当然，除了我使用了一个贪婪的'.*'，它将比赛的最后一次出现... ...

my(@list);
my $rx = qr/(?:.*?)add this word to the list: +(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

使用可选标记：

my $marker = "add this word to the list:";
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1;
          s/$rx//;
     }
}

没有重复：

my $marker = "add this word to the list:";
my(%hash);
my(@list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
     while (m/$rx/)
     {
          push @list, $1 unless defined $hash{$1};
          $hash{$1} = 1;
          s/$rx//;
     }
}

等

而且，正如@ysth指出的那样，你（我）不需要替换 - Perl DWIM在内循环中正确地是一个g限定匹配：

#!/bin/perl -w
use strict;
my(@list);
my(%hash);
my($marker) = "add this word to the list:";
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
    while (m/$rx/g)
    {
        push @list, $1 unless defined $hash{$1};
        $hash{$1} = 1;
    }
}

foreach my $i (@list)
{
    print "$i\n";
}

perl代码，用于生成文本文件中给定字符串后面的所有单词的列表

3 个答案:

真实世界测试