我有一个非常大的文本文件,其行是以逗号分隔的值。缺少一些值。对于每一行,我想打印所有非空字段的索引和值。
例如,一行可能看起来像
,,10.3,,,,5.2,3.1,,,,,,,
在这种情况下我想要的输出是
2,10.3,6,5.2,7,3.1
我知道如何通过首先将输入拆分为数组,然后使用for循环遍历数组来实现此目的,但这些是巨大的文件(数GB)我想知道是否有更快的方法。 (例如使用一些高级正则表达式)
答案 0 :(得分:2)
我还没有对它进行基准测试,但我会假设
my $line = ",,10.3,,,,5.2,3.1,,,,,,,";
my $index = 0;
print join ",",
map {join ",", @$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
比某些高级正则表达式快。
问题在于,只要你必须知道索引,你仍然必须以某种方式跟踪那些缺失的条目。
这样的事情可能不会太慢:
my ($i, @vars);
while ($line =~ s/^(,*)([^,]+)//) {
push @vars, $i += length($1), $2;
}
print join ",", @vars;
您可能会遗漏第一个捕获组并使用pos()
来计算索引。
以下是我的两个建议和罪恶与1M迭代的比较:
Rate flesk1 sin flesk2
flesk1 87336/s -- -8% -27%
sin 94518/s 8% -- -21%
flesk2 120337/s 38% 27% --
似乎我的正则表达式比我想象的要好。
答案 1 :(得分:1)
您可以混合使用正则表达式和代码 -
$line =~ /(?{($cnt,@ary)=(0,)})^(?:([^,]+)(?{push @ary,$cnt; push @ary,$^N})|,(?{$cnt++}))+/x
and print join( ',', @ary);
扩展 -
$line =~ /
(?{($cnt,@ary)=(0,)})
^(?:
([^,]+) (?{push @ary,$cnt; push @ary,$^N})
| , (?{$cnt++})
)+
/x
and print join( ',', @ary);
一些基准
稍微调整一下flesk和sln(寻找fleskNew和slnNew),
当取代运算符被移除时,获胜者是fleskNew。
代码 -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1 => sub{
$index = 0;
join ",",
map {join ",", @$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, @vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push @vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, @vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push @vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,@ary)=(0,)})
^(?:
([^,]+) (?{push @ary,$cnt; push @ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,@ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push @ary, $cnt,$^N})
)+
/x
},
} );
数字 -
Rate flesk1 sln1 flesk2 slnNew fleskNew
flesk1 20325/s -- -51% -52% -56% -60%
sln1 41312/s 103% -- -1% -10% -19%
flesk2 41916/s 106% 1% -- -9% -17%
slnNew 45978/s 126% 11% 10% -- -9%
fleskNew 50792/s 150% 23% 21% 10% --
一些基准2
添加Birei的在线替换和修剪(一体化)解决方案。
Abberations:
修改Flesk1以删除最终的“加入”,因为它不包括在中 其他正则表达式解决方案。这使它有机会更好地适应。
Birei在替补席上偏离,因为它将原始字符串修改为最终解决方案 那个方面不能被拿出来。 Birei1和BireiNew之间的区别在于 新的删除最后的','。
Flesk2,Birei1和BireiNew有额外的恢复原始字符串的开销 由于替代算子。
获胜者仍然看起来像FleskNew ..
代码 -
use Benchmark qw( cmpthese ) ;
$samp = "x,,10.3,,q,,5.2,3.1,,,ghy,g,,l,p";
$line = $samp;
cmpthese( -5, {
flesk1a => sub{
$index = 0;
map {join ",", @$_}
grep $_->[1],
map {[$index++, $_]}
split ",", $line;
},
flesk2 => sub{
($i, @vars) = (0,);
while ($line =~ s/^(,*)([^,]+)//) {
push @vars, $i += length($1), $2;
}
$line = $samp;
},
fleskNew => sub{
($i, @vars) = (0,);
while ($line =~ /(,*)([^,]+)/g) {
push @vars, $i += length($1), $2;
}
},
sln1 => sub{
$line =~ /
(?{($cnt,@ary)=(0,)})
^(?:
([^,]+) (?{push @ary,$cnt; push @ary,$^N})
| , (?{$cnt++})
)+
/x
},
slnNew => sub{
$line =~ /
(?{($cnt,@ary)=(0,)})
(?:
(,*) (?{$cnt += length($^N)})
([^,]+) (?{push @ary, $cnt,$^N})
)+
/x
},
Birei1 => sub{
$i = -1;
$line =~
s/
(?(?=,+)
( (?: , (?{ ++$i }) )+ )
| (?<no_comma> [^,]+ ,? ) (?{ ++$i })
)
/
defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]
/xge;
$line = $samp;
},
BireiNew => sub{
$i = 0;
$line =~
s/
(?: , (?{++$i}) )*
(?<data> [^,]* )
(?: ,*$ )?
(?= (?<trailing_comma> ,?) )
/
length $+{data} ? "$i,$+{data}$+{trailing_comma}" : ""
/xeg;
$line = $samp;
},
} );
结果 -
Rate BireiNew Birei1 flesk1a flesk2 sln1 slnNew fleskNew
BireiNew 6030/s -- -18% -74% -85% -86% -87% -88%
Birei1 7389/s 23% -- -68% -82% -82% -84% -85%
flesk1a 22931/s 280% 210% -- -44% -45% -51% -54%
flesk2 40933/s 579% 454% 79% -- -2% -13% -17%
sln1 41752/s 592% 465% 82% 2% -- -11% -16%
slnNew 47088/s 681% 537% 105% 15% 13% -- -5%
fleskNew 49563/s 722% 571% 116% 21% 19% 5% --
答案 2 :(得分:1)
使用正则表达式(虽然我确信它可以更简单):
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
说明:
s/PATTERN/REPLACEMENT/ge # g -> Apply to all occurrences
# e -> Evaluate replacement as a expression.
(?
(?=,+) # Check for one or more commas.
((?:,(?{ ++$i }))+) # If (?=,+) was true, increment variable '$i' with each comma found.
|
(?<no_comma>[^,]+,?)(?{ ++$i }) # If (?=,+) was false, get number between comma and increment the $i variable only once.
)
defined $+{no_comma} # If 'no_comma' was set in 'pattern' expression...
$i . qq[,] . $+{no_comma} # insert the position just before it.
qq[] # If wasn't set, it means that pattern matched only commas, so remove then.
我的测试:
script.pl
的内容:
use warnings;
use strict;
while ( <DATA> ) {
our $i = -1;
chomp;
printf qq[Orig = $_\n];
s/(?(?=,+)((?:,(?{ ++$i }))+)|(?<no_comma>[^,]+,?)(?{ ++$i }))/defined $+{no_comma} ? $i . qq[,] . $+{no_comma} : qq[]/ge;
# s/,\Z//;
printf qq[Mod = $_\n\n];
}
__DATA__
,,10.3,,,,5.2,3.1,,,,,,,
10.3,,,,5.2,3.1,,,,,,,
,10.3,,,,5.2,3.1
,,10.3,5.2,3.1,
运行如下脚本:
perl script.pl
输出:
Orig = ,,10.3,,,,5.2,3.1,,,,,,,
Mod = 2,10.3,6,5.2,7,3.1,
Orig = 10.3,,,,5.2,3.1,,,,,,,
Mod = 0,10.3,4,5.2,5,3.1,
Orig = ,10.3,,,,5.2,3.1
Mod = 1,10.3,5,5.2,6,3.1
Orig = ,,10.3,5.2,3.1,
Mod = 2,10.3,3,5.2,4,3.1,
正如您所看到的,它保留了最后一个逗号。我不知道如何在没有额外正则表达式的情况下删除它,只需在以前的代码中取消注释s/,\Z//;
。