我试图在大量的大型numpy数组中找到匹配的元素。我当前这样做的方式(将数组保存在字典中并循环遍历它们几次)非常慢,我正在寻找一种更快的方法来解决这个问题。
这是一个自包含的代码段,用于解决问题:
import numpy as np
data = {}
for frame in xrange(100):
data[frame] = np.random.randint(100, size=(100, 3))
# data structure is 100 'pages' each containing an array of 100 elements
# trying to find matching elements from arrays on different pages
for page in data:
for point in data[page]:
for page_to_match in data:
if page_to_match == page: # exclude matches on the same page
pass
else:
for point_to_match in data[page_to_match]:
if np.array_equal(point, point_to_match):
print 'Found a match -', point, '- pages:', page, page_to_match
# should find about 0 - 3 matches per page
你可以看到它的效果非常好。
编辑:这是代码的最小版本。它适用于像这样的小型阵列,但对于像上面这样的大型阵列则很慢。用以下内容替换上一节中的前三行:
data = {}
data[0] = np.array([[1,2],[3,4]])
data[1] = np.array([[5,6],[7,8]])
data[2] = np.array([[3,4],[5,8]])
答案 0 :(得分:2)
如果内存不是问题,我会解决这个问题:
>>> data = {}
>>> for frame in range(100):
... data[frame] = np.random.randint(100, size=(100, 3))
...
>>> from collections import defaultdict
>>> grouper = defaultdict(list)
>>> for k, v in data.items():
... for row in v:
... grouper[tuple(row)].append(k)
...
>>> for k, v in grouper.items():
... if len(v) > 1:
... print("Duplicate row", np.array(k), "found in pages:")
... print(*v)
...
结果:
Duplicate row [89 50 83] found in pages:
13 96
Duplicate row [76 88 29] found in pages:
32 56
Duplicate row [74 26 81] found in pages:
11 99
Duplicate row [19 4 53] found in pages:
17 44
Duplicate row [56 20 4] found in pages:
41 91
Duplicate row [92 62 50] found in pages:
6 45
Duplicate row [86 9 39] found in pages:
12 62
Duplicate row [64 47 84] found in pages:
5 51
Duplicate row [52 74 44] found in pages:
9 19
Duplicate row [60 21 39] found in pages:
14 47
Duplicate row [80 42 33] found in pages:
65 82
Duplicate row [ 4 63 85] found in pages:
8 24
Duplicate row [70 84 35] found in pages:
42 69
Duplicate row [54 14 31] found in pages:
43 47
Duplicate row [38 81 38] found in pages:
51 67
Duplicate row [55 59 10] found in pages:
29 54
Duplicate row [84 77 37] found in pages:
51 53
Duplicate row [76 27 54] found in pages:
33 39
Duplicate row [52 64 20] found in pages:
1 37
Duplicate row [65 97 45] found in pages:
61 80
Duplicate row [69 52 8] found in pages:
60 85
Duplicate row [51 2 37] found in pages:
1 52
Duplicate row [31 36 53] found in pages:
50 84
Duplicate row [24 57 1] found in pages:
34 88
Duplicate row [87 89 0] found in pages:
11 78
Duplicate row [94 38 17] found in pages:
40 89
Duplicate row [46 25 46] found in pages:
54 87
Duplicate row [34 15 14] found in pages:
11 92
Duplicate row [ 3 68 1] found in pages:
48 78
Duplicate row [ 9 17 21] found in pages:
21 62
Duplicate row [17 73 66] found in pages:
1 60
Duplicate row [42 15 24] found in pages:
39 78
Duplicate row [90 8 95] found in pages:
41 61
Duplicate row [19 0 51] found in pages:
30 43
Duplicate row [65 99 51] found in pages:
57 81
Duplicate row [ 5 79 61] found in pages:
17 80
Duplicate row [49 74 71] found in pages:
0 57
Duplicate row [77 3 3] found in pages:
18 57
Duplicate row [37 54 66] found in pages:
5 13
Duplicate row [64 64 82] found in pages:
19 23
Duplicate row [64 6 21] found in pages:
27 39
Duplicate row [39 92 82] found in pages:
8 98
Duplicate row [99 10 15] found in pages:
39 68
Duplicate row [45 16 57] found in pages:
8 53
Duplicate row [12 62 98] found in pages:
29 96
Duplicate row [78 73 56] found in pages:
3 79
Duplicate row [62 87 18] found in pages:
84 97
Duplicate row [46 72 87] found in pages:
5 10
Duplicate row [27 75 25] found in pages:
50 57
Duplicate row [92 62 38] found in pages:
0 87
Duplicate row [27 95 86] found in pages:
15 80
注意,我的解决方案是用Python 3编写的。您应该使用d.iteritems()
而不是d.items()
。
使用numpy
可能有更好的解决方案,但这很快并且可以在线性而非多项式时间内工作。
答案 1 :(得分:0)
假设您想加速代码,它似乎非常适合Cython。 Pythons for循环有一些开销,根据你比较的内容,你可能不需要进行丰富的比较。这给加速带来了很大的潜力。 在Cython中,您可以编译稍微修改过的Python代码,因此您的循环运行速度与本机C循环一样快。 这是一个小例子:
cdef int i, I
cdef int x[1000]
i, I = 0, 1000
for i in range(1000):
x[i] = i
使用Cython编译时,首先从Cython代码生成C代码。然后,与纯Python相比,它可以高度加速执行。