说我有几个像这样的输入行
blablabla this is always the same 123
blablabla this is always the same 321
blablabla this is always the same 4242
blablabla this is al 242
blablabla this is always 2432
...
开头有一个后缀可能与所有子串相同或不同。在我的案例中,这取决于某些代码到目前为止的引导。我想要做的是去掉所有与所有字符串相同的主要字符。在这种情况下,我希望:
ways the same 123
ways the same 321
ways the same 4242
242
ways 2432
...
我有一个输出正确结果的解决方案,但速度非常慢。我只需要bash解决方案。任何帮助将不胜感激。
[更新]我编辑了我的初始脚本以演示此线程的当前解决方案。
#!/bin/bash
# setup test data
tempf=$( mktemp )
echo "blablabla this is always the same 123
blablabla this is always the same 321
blablabla this is always the same 4242
blablabla this is al 242
blablabla this is always 2432" > $tempf
# BASELINE by myself
find_index_baseline () {
longest_line=$( cat $tempf | wc -L ) # determine end of iteration sequence
for i in $( seq 1 $longest_line ) # iterate over char at position i
do
# find number of different chars by
# - printing all data using echo
# - cutting out the i'th character
# - unique sort resulting character set
# - count resulting characters
diffchars=$( cat $tempf | cut -c${i} | sort -u | wc -l )
[ $diffchars -ge 2 ] && break # if more than 1 character, then break
done
idx=$(( $i - 1 )) # save index
cat $tempf | while read line; do echo "${line:$idx}"; done
}
# OPTIMIZED by anishsane
find_index_anishsane () {
awk 'NR==1{a=$0; next} #Record first line
NR==FNR{ #For entire first pass,
while(match($0, a)!=1) #Find the common part in string
a=substr(a,1,length(a)-1);
next;
}
# In second pass
FNR==1{a=length(a)} # This is just an optimization. You could also use sub/gensub based logic
{print substr($0,a+1)} # Print the substring
' $tempf $tempf
}
# OPTIMIZED by 123
find_index_123 () {
awk 'NR==1{
pos=split($0,a,"")
}
NR==FNR{
split($0,b,"")
for(i=1;i<=pos;i++)
if(b[i]!=a[i]){
pos=i
break
}
next
}
NR!=FNR{
print substr($0,pos)
}' $tempf $tempf
}
echo "--- BASELINE (run once)"
time find_index_baseline > /dev/null # even slow when running once :)
echo "---- ANISHSANE x100"
time for i in {1..100}; do find_index_anishsane > /dev/null; done
echo "---- 123 x100"
time for i in {1..100}; do find_index_123 > /dev/null; done
rm -f $tempf
输出是..
--- BASELINE (run once)
real 0m1.186s
user 0m0.481s
sys 0m1.283s
---- ANISHSANE x100
real 0m2.277s
user 0m1.024s
sys 0m1.301s
---- 123 x100
real 0m1.984s
user 0m0.772s
sys 0m1.092s
答案 0 :(得分:2)
使用两次传球并在第一次传球中沿着一次传球最远。
awk 'NR==1{
pos=split($0,a,"")
}
NR==FNR{
split($0,b,"")
for(i=1;i<=pos;i++)
if(b[i]!=a[i]){
pos=i
break
}
next
}
NR!=FNR{
print substr($0,pos)
}' file{,}
应该很快
$ for i in {1..10000};do echo -e "blablabla this is always the same 123\nblablabla this is always the same 321\nblablabla this is always the same 4242\nblablabla this is al 242\nblablabla this is always 2432" >> test;done
$ wc -l < test
50000
在我的机器上计时50000行
real 0m1.444s
user 0m0.888s
sys 0m0.080s
答案 1 :(得分:2)
这是一个完成工作的Python解决方案:
from itertools import izip, takewhile
import sys
def allEqual(x):
return not x or len(x) == x.count(x[0])
lines = sys.stdin.read().splitlines()
prefixLen = sum(1 for _ in takewhile(allEqual, izip(*set(lines))))
for l in lines:
print l[prefixLen:]
allEquals
函数判断给定序列中的所有元素(例如元组或列表)是否相等(或者序列是否为空)。 commonPrefixLength
函数采用一系列字符串并返回最长公共前缀的长度。最后,主程序从stdin
读取,确定最长公共前缀的长度,并打印除公共前缀之外的所有输入行。
到目前为止,这似乎比基于awk的解决方案更快,例如:
$ for i in {1..10000};do echo -e "blablabla this is always the same 123\nblablabla this is always the same 321\nblablabla this is always the same 4242\nblablabla this is al 242\nblablabla this is always 2432" >> testdata.txt;done
$ time awk -f 123.awk testdata.txt{,} > /dev/null
real 0m3.858s
user 0m3.826s
sys 0m0.030s
$ time awk -f anishane.awk testdata.txt testdata.txt > /dev/null
real 0m0.517s
user 0m0.511s
sys 0m0.005s
$ time python frerich.py < testdata.txt > /dev/null
real 0m0.099s
user 0m0.082s
sys 0m0.014s
它们也产生相同的输出:
$ awk -f anishane.awk testdata.txt testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ awk -f 123.awk testdata.txt{,} | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ python frerich.py < testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3
答案 2 :(得分:0)
使用awk:
awk 'NR==1{a=$0; next} #Record first line
NR==FNR{ #For entire first pass,
while(match($0, a)!=1) #Find the common part in string
a=substr(a,1,length(a)-1);
next;
}
# In second pass
FNR==1{a=length(a)} # This is just an optimization. You could also use sub/gensub based logic
{print substr($0,a+1)} # Print the substring
' test-input.log test-input.log # Pass the file twice
ways the same 123
ways the same 321
ways the same 4242
242
ways 2432
time
输出:
Bash based code:
real 0m0.055s
user 0m0.008s
sys 0m0.000s
awk based code:
real 0m0.005s
user 0m0.000s
sys 0m0.004s