Python3 Fast Way To Find If Any Elements In Collections Are Substring Of String

时间:2016-03-04 18:03:38

标签: python algorithm python-3.x big-o string-algorithm

If I have a collection of strings is there a data structure or function that could improve the speed of checking if any of the elements of the collections are substrings on my main string?

Right now I'm looping through my array of strings and using the in operator. Is there a faster way?

import timing

## string match in first do_not_scan
## 0:00:00.029332

## string not in do_not_scan
## 0:00:00.035179
def check_if_substring():
    for x in do_not_scan:
        if x in string:
            return True
    return False

## string match in first do_not_scan
## 0:00:00.046530

## string not in do_not_scan
## 0:00:00.067439
def index_of():
    for x in do_not_scan:
        try:
            string.index(x)
            return True
        except:
            return False

## string match in first do_not_scan
## 0:00:00.047654

## string not in do_not_scan
## 0:00:00.070596
def find_def():
    for x in do_not_scan:
        if string.find(x) != -1:
            return True
    return False

string = '/usr/documents/apps/components/login'
do_not_scan = ['node_modules','bower_components']

for x in range(100000):
    find_def()
    index_of()
    check_if_substring()

4 个答案:

答案 0 :(得分:2)

def check():
    if any(w in string for w in do_not_scan):
        return True
    else:
        return False

Or simpler:

def check():
    return any(w in string for w in do_not_scan)

as mentioned by @Two-Bit Alchemist

答案 1 :(得分:2)

不,没有更快的内置方式。

如果您要测试大量字符串,那么最好使用第三方enter image description here包,而Aho-Corasick会显示。

使用内置方法,最糟糕的情况是:没有匹配,这意味着您已经测试了列表中的每个项目以及每个项目中的几乎每个偏移量。

幸运的是,in运算符非常快(至少在CPython中)并且在我的测试中速度提高了近三倍:

0.3364804992452264  # substring()
0.867534976452589   # any_substring()
0.8401796016842127  # find_def()
0.9342398950830102  # index_of()
2.7920695478096604  # re implementation

以下是我用于测试的脚本:

from timeit import timeit
import re

def substring():
    for x in do_not_scan:
        if x in string:
            return True
    return False

def any_substring():
    return any(x in string for x in do_not_scan)

def find_def():
    for x in do_not_scan:
        if string.find(x) != -1:
            return True
    return False

def index_of():
    for x in do_not_scan:
        try:
            string.index(x)
            return True
        except:
            return False

def re_match():
    for x in do_not_scan:
        if re.search(string, x):
            return True
    return False

string = 'a'
do_not_scan = ['node_modules','bower_components']

print(timeit('substring()', setup='from __main__ import substring'))
print(timeit('any_substring()', setup='from __main__ import any_substring'))
print(timeit('find_def()', setup='from __main__ import find_def'))
print(timeit('index_of()', setup='from __main__ import index_of'))
print(timeit('re_match()', setup='from __main__ import re_match'))

答案 2 :(得分:2)

我没有大型数据集可供尝试:

但是像这样的maybes会起作用吗?

<强> python3

from builtins import any
import timeit

do_not_scan = ['node_modules', 'bower_components']
string = 'a'


def check_if_substring():
    return any(string in x for x in do_not_scan)


result = timeit.Timer("check_if_substring()", "from __main__ import check_if_substring")
count = 10000
print(result.timeit(count)/count)

或者相反:

def check_if_substring():
    return any(x in string for x in do_not_scan)

我的结果:6.48119201650843e-07

答案 3 :(得分:2)

是的,有一种更快的方式来执行found = any(s in main_string for s in collection_of_strings),例如,Aho-Corasick_algorithm允许将基于any()的{​​{1}}算法改进为O(n*m*k) in O(n + m*k)时间操作nlen(main_string)mlen(collections_of_strings)k表示集合中字符串的各个长度。

#!/usr/bin/env python
import noaho # $ pip install noaho

trie = noaho.NoAho()
for s in collection_of_strings:
    trie.add(s)
found = trie.find_short(main_string)[0] is not None

注意:如果您对Big-O行为感兴趣,则无需测量string = 'a'等微小字符串的时间性能。要么使用更具代表性的样本作为基准测试,要么在您的情况下不需要更快(渐近)的算法。