If I have a collection of strings
is there a data structure or function that could improve the speed of checking if any of the elements of the collections are substrings
on my main string?
Right now I'm looping through my array of strings and using the in
operator. Is there a faster way?
import timing
## string match in first do_not_scan
## 0:00:00.029332
## string not in do_not_scan
## 0:00:00.035179
def check_if_substring():
for x in do_not_scan:
if x in string:
return True
return False
## string match in first do_not_scan
## 0:00:00.046530
## string not in do_not_scan
## 0:00:00.067439
def index_of():
for x in do_not_scan:
try:
string.index(x)
return True
except:
return False
## string match in first do_not_scan
## 0:00:00.047654
## string not in do_not_scan
## 0:00:00.070596
def find_def():
for x in do_not_scan:
if string.find(x) != -1:
return True
return False
string = '/usr/documents/apps/components/login'
do_not_scan = ['node_modules','bower_components']
for x in range(100000):
find_def()
index_of()
check_if_substring()
答案 0 :(得分:2)
def check():
if any(w in string for w in do_not_scan):
return True
else:
return False
Or simpler:
def check():
return any(w in string for w in do_not_scan)
as mentioned by @Two-Bit Alchemist
答案 1 :(得分:2)
不,没有更快的内置方式。
如果您要测试大量字符串,那么最好使用第三方包,而Aho-Corasick会显示。
使用内置方法,最糟糕的情况是:没有匹配,这意味着您已经测试了列表中的每个项目以及每个项目中的几乎每个偏移量。
幸运的是,in
运算符非常快(至少在CPython中)并且在我的测试中速度提高了近三倍:
0.3364804992452264 # substring()
0.867534976452589 # any_substring()
0.8401796016842127 # find_def()
0.9342398950830102 # index_of()
2.7920695478096604 # re implementation
以下是我用于测试的脚本:
from timeit import timeit
import re
def substring():
for x in do_not_scan:
if x in string:
return True
return False
def any_substring():
return any(x in string for x in do_not_scan)
def find_def():
for x in do_not_scan:
if string.find(x) != -1:
return True
return False
def index_of():
for x in do_not_scan:
try:
string.index(x)
return True
except:
return False
def re_match():
for x in do_not_scan:
if re.search(string, x):
return True
return False
string = 'a'
do_not_scan = ['node_modules','bower_components']
print(timeit('substring()', setup='from __main__ import substring'))
print(timeit('any_substring()', setup='from __main__ import any_substring'))
print(timeit('find_def()', setup='from __main__ import find_def'))
print(timeit('index_of()', setup='from __main__ import index_of'))
print(timeit('re_match()', setup='from __main__ import re_match'))
答案 2 :(得分:2)
我没有大型数据集可供尝试:
但是像这样的maybes会起作用吗?
<强> python3 强>
from builtins import any
import timeit
do_not_scan = ['node_modules', 'bower_components']
string = 'a'
def check_if_substring():
return any(string in x for x in do_not_scan)
result = timeit.Timer("check_if_substring()", "from __main__ import check_if_substring")
count = 10000
print(result.timeit(count)/count)
或者相反:
def check_if_substring():
return any(x in string for x in do_not_scan)
我的结果:6.48119201650843e-07
答案 3 :(得分:2)
是的,有一种更快的方式来执行found = any(s in main_string for s in collection_of_strings)
,例如,Aho-Corasick_algorithm允许将基于any()
的{{1}}算法改进为O(n*m*k)
in O(n + m*k)
时间操作n
为len(main_string)
,m
为len(collections_of_strings)
,k
表示集合中字符串的各个长度。
#!/usr/bin/env python
import noaho # $ pip install noaho
trie = noaho.NoAho()
for s in collection_of_strings:
trie.add(s)
found = trie.find_short(main_string)[0] is not None
注意:如果您对Big-O行为感兴趣,则无需测量string = 'a'
等微小字符串的时间性能。要么使用更具代表性的样本作为基准测试,要么在您的情况下不需要更快(渐近)的算法。