我有一个很大的列表,上面写着1000万个整数(已排序)" alist"。我需要的是获得一些整数(来自" blist")和列表中的邻居之间的最小距离。我通过找到我寻找的整数的位置来做到这一点,获得前后的项目并测量差异:
alist=[1, 4, 30, 1000, 2000] #~10 million integers
blist=[4, 30, 1000] #~8 million integers
for b in blist:
position=alist.index(b)
distance=min([b-alist[position-1],alist[position+1]-b])
此操作必须重复数百万次,不幸的是,我的机器需要很长时间。有没有办法提高此代码的性能?我使用python 2.6并且python 3不是一个选项。
答案 0 :(得分:4)
我建议使用二进制搜索。使它更快,不需要额外的内存,只需要一点点改变。而不是alist.index(b)
,只需使用bisect_left(alist, b)
。
如果您的blist
也已排序,您还可以使用非常简单的增量搜索,不是从b
的开头搜索当前的alist
,而是从b
的索引搜索之前的389700.01 seconds Andy_original (time estimated)
377100.01 seconds Andy_no_lists (time estimated)
6.30 seconds Stefan_binary_search
2.15 seconds Stefan_incremental_search
3.57 seconds Stefan_incremental_search2
1.21 seconds Jacquot_NumPy
(0.74 seconds Stefan_only_search_no_distance)
。
基于 Python 2.7.11 的基准测试以及包含1000万和800万个整数的列表:
blist
安迪的原件需要大约4.5天,所以我只使用distance = min(...)
的每100000个条目并按比例放大。二进制搜索速度更快,增量搜索速度更快,而NumPy可以全部击败它们,尽管它们只需要几秒钟。
0.74秒的最后一个条目是没有distance = min(...)
行的增量搜索,因此无法比较。但它表明搜索仅占总2.15秒的34%。所以我可以做的更多,因为现在大多数时候509819.56 seconds Andy_original (time estimated)
505257.32 seconds Andy_no_lists (time estimated)
8.35 seconds Stefan_binary_search
4.61 seconds Stefan_incremental_search
4.53 seconds Stefan_incremental_search2
1.39 seconds Jacquot_NumPy
(1.45 seconds Stefan_only_search_no_distance)
计算是负责的。
Python 3.5.1 的结果类似:
def Andy_original(alist, blist):
for b in blist:
position = alist.index(b)
distance = min([b-alist[position-1], alist[position+1]-b])
def Andy_no_lists(alist, blist):
for b in blist:
position = alist.index(b)
distance = min(b-alist[position-1], alist[position+1]-b)
from bisect import bisect_left
def Stefan_binary_search(alist, blist):
for b in blist:
position = bisect_left(alist, b)
distance = min(b-alist[position-1], alist[position+1]-b)
def Stefan_incremental_search(alist, blist):
position = 0
for b in blist:
while alist[position] < b:
position += 1
distance = min(b-alist[position-1], alist[position+1]-b)
def Stefan_incremental_search2(alist, blist):
position = 0
for b in blist:
position = alist.index(b, position)
distance = min(b-alist[position-1], alist[position+1]-b)
import numpy as np
def Jacquot_NumPy(alist, blist):
a_array = np.asarray(alist)
b_array = np.asarray(blist)
a_index = np.searchsorted(a_array, b_array) # gives the indexes of the elements of b_array in a_array
a_array_left = a_array[a_index - 1]
a_array_right = a_array[a_index + 1]
distance_left = np.abs(b_array - a_array_left)
distance_right = np.abs(a_array_right - b_array)
min_distance = np.min([distance_left, distance_right], axis=0)
def Stefan_only_search_no_distance(alist, blist):
position = 0
for b in blist:
while alist[position] < b:
position += 1
from time import time
alist = list(range(10000000))
blist = [i for i in alist[1:-1] if i % 5]
blist_small = blist[::100000]
for func in Andy_original, Andy_no_lists:
t0 = time()
func(alist, blist_small)
t = time() - t0
print('%9.2f seconds %s (time estimated)' % (t * 100000, func.__name__))
for func in Stefan_binary_search, Stefan_incremental_search, Stefan_incremental_search2, Jacquot_NumPy, Stefan_only_search_no_distance:
t0 = time()
func(alist, blist)
t = time() - t0
print('%9.2f seconds %s' % (t, func.__name__))
包含所有版本和测试的完整代码:
iris
答案 1 :(得分:1)
我非常喜欢这种计算的Numpy模块。
在你的情况下,那就是(这是一个很长的答案,可以分解为更有效率):
import numpy as np
alist = [1, 4, 30, 1000, 2000]
blist = [4, 30, 1000]
a_array = np.asarray(alist)
b_array = np.asarray(blist)
a_index = np.searchsorted(a_array, b_array) # gives the indexes of the elements of b_array in a_array
a_array_left = a_array[a_index - 1]
a_array_right = a_array[a_index + 1]
distance_left = np.abs(b_array - a_array_left)
distance_right = np.abs(a_array_right - b_array)
min_distance = np.min([distance_left, distance_right], axis=0)
如果blist的第一个元素是alist的第一个元素,那么它将无效。 我想:
alist = [b[0] - 1] + alist + [b[-1] + 1]
是一种肮脏的解决方法。
<强>基准强>
&#34;仍在运行&#34;我可能是我的电脑故障..
alist = sorted(list(np.random.randint(0, 10000, 10000000)))
blist = sorted(list(alist[1000000:9000001]))
a_array = np.asarray(alist)
b_array = np.asarray(blist)
矢量化解决方案
%%timeit
a_index = np.searchsorted(a_array, b_array)
a_array_left = a_array[a_index - 1]
a_array_right = a_array[a_index + 1]
min_distance = np.min([b_array - a_array_left, a_array_right - b_array], axis=0)
1 loop, best of 3: 591 ms per loop
二进制搜索解决方案
%%timeit
for b in blist:
position = bisect.bisect_left(alist, b)
distance = min([b-alist[position-1],alist[position+1]-b])
Still running..
OP的解决方案
%%timeit
for b in blist:
position=alist.index(b)
distance=min([b-alist[position-1],alist[position+1]-b])
Still running..
较小的输入
alist = sorted(list(np.random.randint(0, 10000, 1000000)))
blist = sorted(list(alist[100000:900001]))
a_array = np.asarray(alist)
b_array = np.asarray(blist)
矢量化解决方案
%%timeit
a_index = np.searchsorted(a_array, b_array)
a_array_left = a_array[a_index - 1]
a_array_right = a_array[a_index + 1]
min_distance = np.min([b_array - a_array_left, a_array_right - b_array], axis=0)
10 loops, best of 3: 53.2 ms per loop
二进制搜索解决方案
%%timeit
for b in blist:
position = bisect.bisect_left(alist, b)
distance = min([b-alist[position-1],alist[position+1]-b])
1 loop, best of 3: 1.57 s per loop
OP的解决方案
%%timeit
for b in blist:
position=alist.index(b)
distance=min([b-alist[position-1],alist[position+1]-b])
Still running..