通过一长串字符串优化搜索

时间:2018-12-21 09:30:08

标签: python search list-comprehension

我有以下MWE,在这里我使用列表推导在列表ls中搜索包含strings的字符串:

import numpy as np

strings = ["ASD", "DSA", "ABC", "ABQ"]
ls     = np.asarray(["ASD", "DSA", "ASD", "ABC", "ABQ","ASD", "DSA", "ASD", "ABC", "ABQ","ASD", "DSA", "ASD", "ABC", "ABQ"])

for string in strings:
    print(len(ls[[string in s for s in ls]]))  

这按预期工作,但是问题是我的ls列表很长(10 ^ 9个条目),并且列表理解需要相当长的时间。

有没有一种方法可以优化上面的代码?


编辑:我正在寻找一种解决方案,使我能够记录单个事件,即6、3、3和3

2 个答案:

答案 0 :(得分:4)

np.uniquereturn_counts=True一起使用,并使用np.in1d进行布尔索引,并仅保留ls中存在的strings中两个唯一值值和计数:

l, counts = np.unique(ls, return_counts=True)
mask = np.in1d(l,strings)

l[mask]
#array(['ABC', 'ABQ', 'ASD', 'DSA'], dtype='<U3')

counts[mask]
array([3, 3, 6, 3], dtype=int64)

答案 1 :(得分:1)

我建议您使用this post中提出的想法;最好的方法是使用Counter

这一次构建了import collections import numpy as np import timeit def _get_data(as_numpy): data = [] for _ in range(10**6): data.extend(["ASD", "DSA", "ASD", "ABC", "ABQ"]) if as_numpy: data = np.asarray(data) return data def f1(data): search_list = ["ASD", "DSA", "ABC", "ABQ"] result_list = [] for search_str in search_list: result_list.append( len(data[[search_str in s for s in data]])) return result_list def f2(data): search_list = ["ASD", "DSA", "ABC", "ABQ"] result_list = [] c = collections.Counter(data) for search_str in search_list: result_list.append( c[search_str]) return result_list def f3(data): search_list = ["ASD", "DSA", "ABC", "ABQ"] result_list = [] c = collections.Counter(data) for search_str in search_list: result_list.append( data.count(search_str)) return result_list def f4(data): # suggestion by user 'nixon' in another answer to this question search_list = ["ASD", "DSA", "ABC", "ABQ"] l, counts = np.unique(data, return_counts=True) # 'l' and 'counts' are in different order than 'search_list' result_list = [ counts[np.where(l == search_str)[0][0]] for search_str in search_list] return result_list ,然后您可以轻松地查找要计数的单个元素。

这可能看起来像这样:

data1 = _get_data(as_numpy=True)
data2 = _get_data(as_numpy=False)
assert f1(data1) == f2(data2) == f3(data2) == f4(data1)

为确保这些方法获得相同的结果:

print(timeit.timeit(
    'f(data)',
    'from __main__ import f1 as f, _get_data; data = _get_data(as_numpy=True)',
    number=10))
print(timeit.timeit(
    'f(data)',
    'from __main__ import f2 as f, _get_data; data = _get_data(as_numpy=False)',
    number=10))
print(timeit.timeit(
    'f(data)',
    'from __main__ import f3 as f, _get_data; data = _get_data(as_numpy=False)',
    number=10))
print(timeit.timeit(
    'f(data)',
    'from __main__ import f4 as f, _get_data; data = _get_data(as_numpy=True)',
    number=10))

# f1 48.2 sec
# f2  1.7 sec
# f3  3.8 sec
# f4  9.7 sec

比较计时,我得到:

numpy.unique

如您所见,时差有一定的顺序。

这对您的情况有用吗?


编辑:添加了使用collections.Counter的方法,类似于@nixon在对该问题的另一个答案中提出的方法;它似乎仍然比使用{{1}}慢。