在Python中搜索巨大列表中的(数字)字符串匹配项

时间:2013-10-11 02:34:32

标签: python performance algorithm list search

我有一个排序数字列表(1000万),它们是字符格式,每个条目的长度恒定为15个字符。想想:

100000000000000
100000000000001
...
100000010000000

现在,我想在此列表中创建常规细分,以查看条目在不同范围内的累积方式。输出可能是这样的:

100000000xxxxxx, 523121 entries
100000001xxxxxx, 32231 entries

目前我已经尝试将整个列表读取到一个集合,然后搜索它。我尝试了stringint格式。整数版本比当前的字符串版本快 3次。代码如下:

collection_str = set(line.strip() for line in open(inputfile)
collection_int = set(int(line.strip()) for line in open(sys.argv[1]))

def find_str(look_for, ourset):
    count = 0
    for entry in ourset:
            if entry.startswith(look_for):
                    count += 1
    return count

def find_int(look_for, ourset):
    search_min = int(str(look_for) + "000000")
    search_max = int(str(look_for+1) + "000000")

    count = 0
    for entry in ourset:
            if entry >= search_min and entry < search_max:
                    count += 1
    return count

结果如下:

"int version"
100000100 27401 (0.515992sec)
100000101 0 (0.511334sec)
100000102 0 (0.510956sec)
100000103 0 (0.510467sec)
100000104 0 (0.512834sec)
100000105 0 (0.511501sec)

"string version"
100000100 27401 (1.794804sec)
100000101 0 (1.794449sec)
100000102 0 (1.802035sec)
100000103 0 (1.797590sec)
100000104 0 (1.793691sec)
100000105 0 (1.796785sec)

我想知道我能以某种方式让它更快吗?即使有0.5秒/范围,如果我想经常运行这个来创建一些定期统计数据,这仍然需要时间...... 从周围的搜索中我发现有些人使用bisect来做类似的事情,但我似乎无法理解它应该如何工作。

2 个答案:

答案 0 :(得分:2)

将它放入一个numpy数组中。然后你可以使用漂亮而快速的矢量化:)

from random import randint
import numpy
ip = numpy.array(['1{0:014d}'.format(randint(0, 10000000)) for x in xrange(10000000)], dtype=numpy.int64)

numpy.sum(ip <= 100000000010000)
# 9960
%timeit numpy.sum(ip <= 100000000010000)
# 10 loops, best of 3: 35 ms per loop

根据您的搜索功能设置:

import numpy

def find_numpy(look_for, ourset):
    search_min = int('{0:0<15s}'.format(str(look_for)))
    search_max = int('{0:0<15s}'.format(str(look_for+1)))
    return numpy.sum((ourset >= search_min) & (ourset < search_max))

with open('path/to/your/file.txt', 'r') as f:
    ip = numpy.array([line.strip() for line in f], dtype=numpy.int64)

find_numpy(1000000001, ip)
# 99686
%timeit find_numpy(1000000001, ip)
# 10 loops, best of 3: 86.6 ms per loop

答案 1 :(得分:1)

如果列表已排序,bisect将使用bisection search找到符合条件的索引。看起来bisect比使用numpy数组快得多。

import numpy as np
import bisect
from random import randint
from timeit import Timer

ip = ['1{0:014d}'.format(randint(0, 10000000)) for x in xrange(10000000)]
ip = sorted(ip)
print bisect.bisect(ip, '100000000010000')
# 9869
t = Timer("bisect.bisect(ip, '100000000010000')", 'from __main__ import bisect, ip')
print t.timeit(100)
# 0.000268309933485 seconds

ip_int = map(int, ip)
print bisect.bisect(ip_int, 100000000010000)
# 9869
t = Timer("bisect.bisect(ip_int, 100000000010000)", 'from __main__ import bisect, ip_int')
print t.timeit(100)
# 0.000137443078672 seconds

ip_numpy = np.array(ip_int)
print np.sum(ip_numpy <= 100000000010000)
# 9869
t = Timer("np.sum(ip_numpy <= 100000000010000)", 'from __main__ import np, ip_numpy')
print t.timeit(100)
# 8.23690123071 seconds

Binary search algorithm