查找文本文件中非空字段的索引

时间:2012-02-13 21:00:29

标签: regex perl parsing

我有一个非常大的文本文件,其行是以逗号分隔的值。缺少一些值。对于每一行,我想打印所有非空字段的索引和值。

例如,一行可能看起来像

,,10.3,,,,5.2,3.1,,,,,,,

在这种情况下我想要的输出是

2,10.3,6,5.2,7,3.1

我知道如何通过首先将输入拆分为数组,然后使用for循环遍历数组来实现此目的,但这些是巨大的文件(数GB)我想知道是否有更快的方法。 (例如使用一些高级正则表达式)

3 个答案:

答案 0 :(得分:2)

我还没有对它进行基准测试,但我会假设

my $line = ",,10.3,,,,5.2,3.1,,,,,,,";
my $index = 0;
print join ",",
    map {join ",", @$_}
    grep $_->[1],
    map {[$index++, $_]}
    split ",", $line;

比某些高级正则表达式快。

问题在于,只要你必须知道索引,你仍然必须以某种方式跟踪那些缺失的条目。

这样的事情可能不会太慢​​:

my ($i, @vars);

while ($line =~ s/^(,*)([^,]+)//) {
    push @vars, $i += length($1), $2;
}

print join ",", @vars;

您可能会遗漏第一个捕获组并使用pos()来计算索引。

以下是我的两个建议和罪恶与1M迭代的比较:

           Rate flesk1    sin flesk2
flesk1  87336/s     --    -8%   -27%
sin     94518/s     8%     --   -21%
flesk2 120337/s    38%    27%     --

似乎我的正则表达式比我想象的要好。

答案 1 :(得分:1)

您可以混合使用正则表达式和代码 -

$line =~ /(?{($cnt,@ary)=(0,)})^(?:([^,]+)(?{push @ary,$cnt; push @ary,$^N})|,(?{$cnt++}))+/x
and print join( ',', @ary);

扩展 -

$line =~ /
  (?{($cnt,@ary)=(0,)})
  ^(?:
      ([^,]+) (?{push @ary,$cnt; push @ary,$^N})
    | , (?{$cnt++})
   )+
/x
and print join( ',', @ary);

一些基准

稍微调整一下flesk和sln(寻找fleskNew和slnNew),
当取代运算符被移除时,获胜者是fleskNew。

代码 -

use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;

cmpthese( -5, {

    flesk1 => sub{
                    $index = 0;
                    join ",",
                       map {join ",", @$_}
                       grep $_->[1],
                       map {[$index++, $_]}
                       split ",", $line;
           },

    flesk2 => sub{
              ($i, @vars) = (0,);
              while ($line =~ s/^(,*)([^,]+)//) {
                  push @vars, $i += length($1), $2;
              }
              $line = $samp;
           },

    fleskNew => sub{
              ($i, @vars) = (0,);
              while ($line =~ /(,*)([^,]+)/g) {
                  push @vars, $i += length($1), $2;
              }
           },

    sln1 => sub{
              $line =~ /
                 (?{($cnt,@ary)=(0,)})
                 ^(?:
                     ([^,]+) (?{push @ary,$cnt; push @ary,$^N})
                   | , (?{$cnt++})
                  )+
               /x
           },

    slnNew => sub{
              $line =~ /
                 (?{($cnt,@ary)=(0,)})
                 (?:
                     (,*) (?{$cnt += length($^N)})
                     ([^,]+) (?{push @ary, $cnt,$^N})
                   )+
               /x
           },

} );

数字 -

            Rate   flesk1     sln1   flesk2   slnNew fleskNew
flesk1   20325/s       --     -51%     -52%     -56%     -60%
sln1     41312/s     103%       --      -1%     -10%     -19%
flesk2   41916/s     106%       1%       --      -9%     -17%
slnNew   45978/s     126%      11%      10%       --      -9%
fleskNew 50792/s     150%      23%      21%      10%       --

一些基准2

添加Birei的在线替换和修剪(一体化)解决方案。

Abberations:

修改Flesk1以删除最终的“加入”,因为它不包括在中 其他正则表达式解决方案。这使它有机会更好地适应。

Birei在替补席上偏离,因为它将原始字符串修改为最终解决方案 那个方面不能被拿出来。 Birei1和BireiNew之间的区别在于 新的删除最后的','。

Flesk2,Birei1和BireiNew有额外的恢复原始字符串的开销 由于替代算子。

获胜者仍然看起来像FleskNew ..

代码 -

use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;

cmpthese( -5, {

    flesk1a => sub{
                $index = 0;
                map {join ",", @$_}
                   grep $_->[1],
                   map {[$index++, $_]}
                   split ",", $line;
       },

    flesk2 => sub{
          ($i, @vars) = (0,);
          while ($line =~ s/^(,*)([^,]+)//) {
              push @vars, $i += length($1), $2;
          }
          $line = $samp;
       },

    fleskNew => sub{
          ($i, @vars) = (0,);
          while ($line =~ /(,*)([^,]+)/g) {
              push @vars, $i += length($1), $2;
          }
       },

    sln1 => sub{
          $line =~ /
             (?{($cnt,@ary)=(0,)})
             ^(?:
                 ([^,]+) (?{push @ary,$cnt; push @ary,$^N})
               | , (?{$cnt++})
              )+
           /x
       },

    slnNew => sub{
          $line =~ /
             (?{($cnt,@ary)=(0,)})
             (?:
                 (,*) (?{$cnt += length($^N)})
                 ([^,]+) (?{push @ary, $cnt,$^N})
             )+
           /x
       },


    Birei1 => sub{
          $i = -1;
          $line =~
          s/
           (?(?=,+)
               ( (?: , (?{ ++$i }) )+ )
             | (?<no_comma> [^,]+ ,? ) (?{ ++$i })
           )
          /
          defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]
          /xge;

          $line = $samp;
       },

    BireiNew => sub{
          $i = 0;
          $line =~ 
          s/
            (?: , (?{++$i}) )*
            (?<data> [^,]* )
            (?: ,*$ )?
            (?= (?<trailing_comma> ,?) )
          /
            length $+{data} ? "$i,$+{data}$+{trailing_comma}" : ""
          /xeg;

          $line = $samp;
       },

} );

结果 -

            Rate BireiNew   Birei1  flesk1a   flesk2     sln1   slnNew fleskNew
BireiNew  6030/s       --     -18%     -74%     -85%     -86%     -87%     -88%
Birei1    7389/s      23%       --     -68%     -82%     -82%     -84%     -85%
flesk1a  22931/s     280%     210%       --     -44%     -45%     -51%     -54%
flesk2   40933/s     579%     454%      79%       --      -2%     -13%     -17%
sln1     41752/s     592%     465%      82%       2%       --     -11%     -16%
slnNew   47088/s     681%     537%     105%      15%      13%       --      -5%
fleskNew 49563/s     722%     571%     116%      21%      19%       5%       --

答案 2 :(得分:1)

使用正则表达式(虽然我确信它可以更简单):

s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;

说明:

s/PATTERN/REPLACEMENT/ge              # g -> Apply to all occurrences
                                      # e -> Evaluate replacement as a expression.
(?
  (?=,+)                              # Check for one or more commas.
  ((?:,(?{ ++$i }))+)                 # If (?=,+) was true, increment variable '$i' with each comma found.                
  |
  (?<no_comma>[^,]+,?)(?{ ++$i })     # If (?=,+) was false, get number between comma and increment the $i variable only once.
)
defined $+{no_comma}                  # If 'no_comma' was set in 'pattern' expression...
$i . qq[,] . $+{no_comma}             # insert the position just before it.
qq[]                                  # If wasn't set, it means that pattern matched only commas, so remove then.

我的测试:

script.pl的内容:

use warnings;
use strict;

while ( <DATA> ) { 
    our $i = -1; 
    chomp;
    printf qq[Orig = $_\n];
    s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
#    s/,\Z//;
    printf qq[Mod = $_\n\n];

}

__DATA__
,,10.3,,,,5.2,3.1,,,,,,,
10.3,,,,5.2,3.1,,,,,,,
,10.3,,,,5.2,3.1
,,10.3,5.2,3.1,

运行如下脚本:

perl script.pl

输出:

Orig = ,,10.3,,,,5.2,3.1,,,,,,,
Mod = 2,10.3,6,5.2,7,3.1,

Orig = 10.3,,,,5.2,3.1,,,,,,,
Mod = 0,10.3,4,5.2,5,3.1,

Orig = ,10.3,,,,5.2,3.1
Mod = 1,10.3,5,5.2,6,3.1

Orig = ,,10.3,5.2,3.1,
Mod = 2,10.3,3,5.2,4,3.1,

正如您所看到的,它保留了最后一个逗号。我不知道如何在没有额外正则表达式的情况下删除它,只需在以前的代码中取消注释s/,\Z//;