查找所有字符串中不相等的第一个字符索引的最快方法

时间:2015-12-18 10:30:18

标签: bash shell optimization

说我有几个像这样的输入行

blablabla this is always the same 123
blablabla this is always the same 321
blablabla this is always the same 4242
blablabla this is al 242
blablabla this is always 2432
...

开头有一个后缀可能与所有子串相同或不同。在我的案例中,这取决于某些代码到目前为止的引导。我想要做的是去掉所有与所有字符串相同的主要字符。在这种情况下,我希望:

ways the same 123
ways the same 321
ways the same 4242
 242
ways 2432
...

我有一个输出正确结果的解决方案,但速度非常慢。我只需要bash解决方案。任何帮助将不胜感激。

[更新]我编辑了我的初始脚本以演示此线程的当前解决方案。

#!/bin/bash

# setup test data 
tempf=$( mktemp )
echo "blablabla this is always the same 123
blablabla this is always the same 321
blablabla this is always the same 4242
blablabla this is al 242
blablabla this is always 2432" > $tempf 

# BASELINE by myself 
find_index_baseline () {

    longest_line=$( cat $tempf | wc -L )  # determine end of iteration sequence 
    for i in $( seq 1 $longest_line ) # iterate over char at position i 
    do
        # find number of different chars by 
        #  - printing all data using echo 
        #  - cutting out the i'th character 
        #  - unique sort resulting character set 
        #  - count resulting characters 
        diffchars=$( cat $tempf | cut -c${i} | sort -u | wc -l )
        [ $diffchars -ge 2 ] && break # if more than 1 character, then break 
    done
    idx=$(( $i - 1 )) # save index 
    cat $tempf | while read line; do echo "${line:$idx}"; done 
}

# OPTIMIZED by anishsane 
find_index_anishsane () {

   awk 'NR==1{a=$0; next} #Record first line
     NR==FNR{ #For entire first pass,
         while(match($0, a)!=1) #Find the common part in string
             a=substr(a,1,length(a)-1); 
         next;
     }
     # In second pass
     FNR==1{a=length(a)} # This is just an optimization. You could also use sub/gensub based logic

     {print substr($0,a+1)} # Print the substring 
     ' $tempf $tempf
}

# OPTIMIZED by 123 
find_index_123 () {
    awk 'NR==1{
           pos=split($0,a,"")
     }
     NR==FNR{
          split($0,b,"")
          for(i=1;i<=pos;i++)
             if(b[i]!=a[i]){
                pos=i
                break
           }
           next
        }
    NR!=FNR{
       print substr($0,pos)
    }' $tempf $tempf
}

echo "--- BASELINE (run once)"
time find_index_baseline > /dev/null # even slow when running once :) 
echo "---- ANISHSANE x100"
time for i in {1..100}; do find_index_anishsane > /dev/null; done
echo "---- 123 x100"
time for i in {1..100}; do find_index_123 > /dev/null; done

rm -f $tempf

输出是..

--- BASELINE (run once)

real    0m1.186s
user    0m0.481s
sys     0m1.283s
---- ANISHSANE x100

real    0m2.277s
user    0m1.024s
sys     0m1.301s
---- 123 x100

real    0m1.984s
user    0m0.772s
sys     0m1.092s

3 个答案:

答案 0 :(得分:2)

使用两次传球并在第一次传球中沿着一次传球最远。

awk 'NR==1{
           pos=split($0,a,"")
     }
     NR==FNR{
          split($0,b,"")
          for(i=1;i<=pos;i++)
             if(b[i]!=a[i]){
                pos=i
                break
           }
           next
        }
    NR!=FNR{
       print substr($0,pos)
    }' file{,}

应该很快

TEST

$ for i in {1..10000};do echo -e "blablabla this is always the same 123\nblablabla this is always the same 321\nblablabla this is always the same 4242\nblablabla this is al 242\nblablabla this is always 2432" >> test;done

$ wc -l < test
  50000

在我的机器上计时50000行

real    0m1.444s
user    0m0.888s
sys     0m0.080s

答案 1 :(得分:2)

这是一个完成工作的Python解决方案:

from itertools import izip, takewhile
import sys

def allEqual(x):
    return not x or len(x) == x.count(x[0])

lines = sys.stdin.read().splitlines()
prefixLen = sum(1 for _ in takewhile(allEqual, izip(*set(lines))))
for l in lines:
    print l[prefixLen:]

allEquals函数判断给定序列中的所有元素(例如元组或列表)是否相等(或者序列是否为空)。 commonPrefixLength函数采用一系列字符串并返回最长公共前缀的长度。最后,主程序从stdin读取,确定最长公共前缀的长度,并打印除公共前缀之外的所有输入行。

到目前为止,这似乎比基于awk的解决方案更快,例如:

$ for i in {1..10000};do echo -e "blablabla this is always the same 123\nblablabla this is always the same 321\nblablabla this is always the same 4242\nblablabla this is al 242\nblablabla this is always 2432" >> testdata.txt;done
$ time awk -f 123.awk testdata.txt{,} > /dev/null

real    0m3.858s
user    0m3.826s
sys 0m0.030s
$ time awk -f anishane.awk testdata.txt testdata.txt > /dev/null

real    0m0.517s
user    0m0.511s
sys 0m0.005s
$ time python frerich.py < testdata.txt > /dev/null

real    0m0.099s
user    0m0.082s
sys 0m0.014s

它们也产生相同的输出:

$ awk -f anishane.awk testdata.txt testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ awk -f 123.awk testdata.txt{,} | md5
8a3880cb99a388092dd549c8dc4a9cc3
$ python frerich.py < testdata.txt | md5
8a3880cb99a388092dd549c8dc4a9cc3

答案 2 :(得分:0)

使用awk:

awk 'NR==1{a=$0; next} #Record first line
     NR==FNR{ #For entire first pass,
         while(match($0, a)!=1) #Find the common part in string
             a=substr(a,1,length(a)-1); 
         next;
     }
     # In second pass
     FNR==1{a=length(a)} # This is just an optimization. You could also use sub/gensub based logic

     {print substr($0,a+1)} # Print the substring 
     ' test-input.log test-input.log # Pass the file twice


ways the same 123
ways the same 321
ways the same 4242
 242
ways 2432

time输出:

Bash based code:
real    0m0.055s
user    0m0.008s
sys     0m0.000s

awk based code:
real    0m0.005s
user    0m0.000s
sys     0m0.004s