对于文件中的所有行(大约30000),我想找到 开头的字符数 当前行 与上一行相同。 例如输入:
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/
我期待:
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/
我试图通过将字符串解压缩到字符并计算直到第一次不匹配来在perl
中工作但是我想知道是否存在使用awk
或{{1的内置函数的一些不太慢的方法}}
更新:我已将我的尝试添加为答案。
答案 0 :(得分:2)
像这样,也许?
用Perl编写
use strict;
use warnings 'all';
my $prev = "";
while ( my $line = <DATA> ) {
chomp $line;
my $max = 0;
++$max until $max > length($line) or substr($prev, 0, $max) ne substr($line, 0, $max);
printf "%-2d %s\n", $max-1, $line;
$prev = $line;
}
__DATA__
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/[Finished in 0.1s]
答案 1 :(得分:1)
没有内置功能可以为你做到这一点,而不是一次只能使用1个字符,你可以在一种二进制搜索中一次比较每个字符串的一半,类似于(半as的awk伪 - 代码):
prev = curr
lgthPrev = lgthCurr
curr = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
partCurr = substr(curr,1,partLgth)
partPrev = substr(prev,1,partLgth)
if ( partCurr == partPrev ) {
# add on half of the rest of each string and try again
partLgth = partLgth * 1.5
}
else {
# subtract half of these strings and try again
partLgth = partLgth * 0.5
}
}
当你没有更多的子字符串要比较时退出上面的循环,并且在那时结果是:
这将使用比char-by-char比较可能少得多的迭代,但正如所写,它在每次迭代时都进行字符串而不是字符比较,所以idk是净性能结果。你可以通过在每次迭代时首先进行字符而不是字符串比较来加速它,如果字符在当前位置匹配则只进行字符串比较:
prev = curr
lgthPrev = lgthCurr
curr = $0
lgthCurr = length(curr)
partLgth = (lgthPrev > lgthCurr ? lgthCurr : lgthPrev)
while ( got strings to work with ) {
if ( substr(curr,partLgth,1) == substr(prev,partLgth,1) )
isMatch = (substr(curr,1,partLgth) == substr(prev,1,partLgth) ? 1 : 0)
}
else {
isMatch = 0
}
if ( isMatch )
# add on half of the rest of each string and try again
partLgth = partLgth * 1.5
}
else {
# subtract half of these strings and try again
partLgth = partLgth * 0.5
}
}
答案 2 :(得分:1)
使用gawk
awk -v FS="" 'p{
pl=0;
split(p,a,r);
for(i=1;i in a; i++)
if(a[i]==$i){ pl++ }else { break }
}
{
print pl+0,$0; p=$0
}' file
或强>
awk -v FS="" 'p{
pl=0;
for(i=1;i<=NF; i++)
if(substr(p,i,1)==$i){ pl++ }else { break }
}
{
print pl+0,$0; p=$0
}' file
<强>输入强>
$ cat file
#to
#top
/0linyier
/10000001659/item/1097859586891251/
/10000001659/item/1191085827568626/
/10000121381/item/890759920974460/
/10000154478/item/1118425481552267/
/10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
/1175332/item/10150825241495757/
/806123/item/10210653847881125/
/51927642128/item/488930816844251927642128/341878905879428/
<强>输出强>
$ awk -v FS="" 'p{pl=0; split(p,a,r); for(i=1;i in a; i++)if(a[i]==$i){ pl++ }else { break }}{ print pl+0,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/
解释
awk -v FS="" ' # call awk set field sep=""
p{
pl=0; # reset variable pl
split(p,a,r); # split variable p
for(i=1;i in a; i++) # loop through array
if(a[i]==$i){ # check array element with current field
pl++ # if matched then increment pl
}else {
break # else its over break loop
}
}
{
print pl+0,$0; # print count, and current record
p=$0 # store current record in variable p
}
' file
请注意,如果将空字符串分配给FS
,标准会指出结果未指定。某些版本的awk
将在您的示例中生成上面显示的输出。 awk
上的OS/X
版本会发出警告和输出。
awk: field separator FS is empty
因此,将FS
设置为空字符串的特殊含义在每个awk
中都不起作用。
答案 3 :(得分:0)
perl
脚本:
#!/usr/bin/perl -ln
$c = [ unpack "C*" ]; #current record
$i = 0;
$i++ while $p->[$i] == $c->[$i]; # count till mismatch
print "$i $_";
$p = $c #save current record for next time
没有命令行标志的同样的事情:
#!/usr/bin/perl
while (<>) {
chomp;
$c = [ unpack "C*" ];
$i = 0;
$i++ while $p->[$i] == $c->[$i];
print "$i $_\n";
$p = $c
}
与单行相同:
perl -lne '$c=[unpack "C*"]; $i=0; $i++ while $p->[$i] == $c->[$i]; print "$i $_"; $p = $c'
将包含这些行的文件作为参数传递,或将数据传递给命令。
根据我的实际数据,其运行速度与Borodin's solution:
一样快$ xzcat href.xz |wc -l
33150
$ time xzcat href.xz | ./borodin.pl >borodin.out
real 0m2.437s
user 0m2.684s
sys 0m0.080s
$ time xzcat href.xz | ./pk.pl > pk.out
real 0m2.305s
user 0m2.564s
sys 0m0.088s
$ diff pk.out borodin.out
答案 4 :(得分:0)
在awk中:
$ awk -F '' '{n=split(p,a,"");for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++);print --i,$0; p=$0}' file
0 #to
3 #top
0 /0linyier
1 /10000001659/item/1097859586891251/
19 /10000001659/item/1191085827568626/
6 /10000121381/item/890759920974460/
7 /10000154478/item/1118425481552267/
3 /10897504949/pic/89875494927073741108975049493956/108987352826059/?lang=3
2 /1175332/item/10150825241495757/
1 /806123/item/10210653847881125/
1 /51927642128/item/488930816844251927642128/341878905879428/
说明:
awk -F '' '{ # each char on its own field
n=split(p,a,"") # split prev record p each char in own a cell
for(i=1;i<=(NF<n?NF:n)&&a[i]==$i;i++); # compare while $i == a[i]
print --i,$0 # print comparison count (--fix)
p=$0 # store record to p(revious)
}' file
答案 5 :(得分:-1)
您可以直接使用gawk
进行操作。在这里,它只是将当前行与前一行进行比较,并计算常见前导字符的数量:
BEGIN{
prev="";
}
{
curr=$1;
n = length(curr);
m = length(prev);
s = n<m?n:m;
cnt = 0;
for(i = 1;i <= s;i++){
if(substr(curr, i, 1) == substr(prev, i, 1)){
cnt++;
}else{
break;
}
}
print(cnt, curr);
prev=curr;
}