我一直在玩Boyer-Moore sting搜索算法,并从Shriphani Palakodety的基本代码集开始,我创建了2个附加版本(v2和v3) - 每个版本都进行了一些修改,例如删除len()函数循环而不是重构while / if条件。从v1到v2,我看到了大约10%-15%的改善,从v1到v3,改善了25%-30%(显着)。
我的问题是:是否有任何其他mod可以提高性能(如果你可以作为v4提交) - 保持基础'算法'对Boyer-Moore真实。
#!/usr/bin/env python
import time
bcs = {} #the table
def goodSuffixShift(key):
for i in range(len(key)-1, -1, -1):
if key[i] not in bcs.keys():
bcs[key[i]] = len(key)-i-1
#---------------------- v1 ----------------------
def searchv1(text, key):
"""base from Shriphani Palakodety fixed for single char"""
i = len(key)-1
index = len(key) -1
j = i
while True:
if i < 0:
return j + 1
elif j > len(text):
return "not found"
elif text[j] != key[i] and text[j] not in bcs.keys():
j += len(key)
i = index
elif text[j] != key[i] and text[j] in bcs.keys():
j += bcs[text[j]]
i = index
else:
j -= 1
i -= 1
#---------------------- v2 ----------------------
def searchv2(text, key):
"""removed string len functions from loop"""
len_text = len(text)
len_key = len(key)
i = len_key-1
index = len_key -1
j = i
while True:
if i < 0:
return j + 1
elif j > len_text:
return "not found"
elif text[j] != key[i] and text[j] not in bcs.keys():
j += len_key
i = index
elif text[j] != key[i] and text[j] in bcs.keys():
j += bcs[text[j]]
i = index
else:
j -= 1
i -= 1
#---------------------- v3 ----------------------
def searchv3(text, key):
"""from v2 plus modified 3rd if condition
breaking down the comparison for efficiency,
modified the while loop to include the first
if condition (opposite of it)
"""
len_text = len(text)
len_key = len(key)
i = len_key-1
index = len_key -1
j = i
while i >= 0 and j <= len_text:
if text[j] != key[i]:
if text[j] not in bcs.keys():
j += len_key
i = index
else:
j += bcs[text[j]]
i = index
else:
j -= 1
i -= 1
if j > len_text:
return "not found"
else:
return j + 1
key_list = ["POWER", "HOUSE", "COMP", "SCIENCE", "SHRIPHANI", "BRUAH", "A", "H"]
text = "SHRIPHANI IS A COMPUTER SCIENCE POWERHOUSE"
t1 = time.clock()
for key in key_list:
goodSuffixShift(key)
#print searchv1(text, key)
searchv1(text, key)
bcs = {}
t2 = time.clock()
print('v1 took %0.5f ms' % ((t2-t1)*1000.0))
t1 = time.clock()
for key in key_list:
goodSuffixShift(key)
#print searchv2(text, key)
searchv2(text, key)
bcs = {}
t2 = time.clock()
print('v2 took %0.5f ms' % ((t2-t1)*1000.0))
t1 = time.clock()
for key in key_list:
goodSuffixShift(key)
#print searchv3(text, key)
searchv3(text, key)
bcs = {}
t2 = time.clock()
print('v3 took %0.5f ms' % ((t2-t1)*1000.0))
答案 0 :(得分:4)
使用“in bcs.keys()”创建一个列表,然后对列表进行O(N)搜索 - 只需使用“in bcs”。
在搜索功能中执行goodSuffixShift(key)操作。两个好处:调用者只有一个API可供使用,并且你避免将bcs作为全局(可怕的** 2)。
您的缩进在某些地方不正确。
更新
这不是Boyer-Moore算法(使用两个查找表)。它看起来更像是Boyer-Moore-Horspool算法,它只使用第一个BM表。
可能的加速:在设置bcs dict后添加'bcsget = bcs.get'行。然后替换:
if text[j] != key[i]:
if text[j] not in bcs.keys():
j += len_key
i = index
else:
j += bcs[text[j]]
i = index
使用:
if text[j] != key[i]:
j += bcsget(text[j], len_key)
i = index
更新2 - 返回基础知识,例如在优化之前获取正确的代码
版本1有一些错误,你已经将其转移到版本2和3中。一些建议:
将未找到的响应从“未找到”更改为-1。这使它与text.find(key)兼容,您可以使用它来检查结果。
获取更多文字值,例如“R”* 20,“X”* 20和“XXXSCIENCEYYY”用于您现有的键值。
捆绑测试工具,如下所示:
func_list = [searchv1, searchv2, searchv3]
def test():
for text in text_list:
print '==== text is', repr(text)
for func in func_list:
for key in key_list:
try:
result = func(text, key)
except Exception, e:
print "EXCEPTION: %r expected:%d func:%s key:%r" % (e, expected, func.__name__, key)
continue
expected = text.find(key)
if result != expected:
print "ERROR actual:%d expected:%d func:%s key:%r" % (result, expected, func.__name__, key)
运行它,修复v1中的错误,向前执行这些修复,再次运行测试,直到它们都正常。然后,您可以沿着相同的线条整理您的计时线束,并查看性能。然后你可以在这里报告,我会告诉你我对searchv4函数的看法; - )