Question

我有两个相同形状的2d数组：given_array和reference_array。我必须为reference_array计算平均值的每个唯一值编写一个文件，其中唯一值在给定数组中。

import numpy as np

given_array = np.array([[2,4,5,8,9,11,15],[1,2,3,4,5,6,7]])

reference_array = np.array([[2,2,2,8,8,8,15],[2,2,2,4,8,8,9]])

unique_value = np.unique(reference_array)

file_out = open('file_out', 'w')

for unique in unique_value:
    index = reference_array == unique
    mean = np.mean(given_array[index])
    file_out.write(str(unique) + ',' + str(mean) + '\n')

file_out.close()

上面的代码可行，但在我的实际问题中，从光栅图像读取时，两个数组非常大，并且需要几天才能完成处理。

如果有人能提供产生相同结果的最快方法，将不胜感激。

Answer 1

只使用一次数组可能会更快，即使它使用纯python：

from collections import defaultdict
from itertools import izip

add = lambda (sum_, count), value: (sum_+value, count+1)
unique = defaultdict(lambda:(0,0))
for ref, value in izip(reference_array.flat, given_array.flat):
    unique[ref] = add(unique[ref], float(value))

with open('file.out', 'w') as out:
    for ref, (sum_, count) in unique.iteritems():
        out.write('%f,%f\n' % (ref, sum_ / count))

与OP的解决方案相反，找到唯一值并计算平均值是在一个循环中完成的。 unique是一个字典，其中键是一个参考值，值是一对具有相同参考值的所有给定值的总和和计数。在循环之后，不仅所有唯一参考值都被放入字典unique中，而且所有给定元素都被排序为它们的参考值作为总和和计数，这可以很容易地用于计算第二步中的平均值。

问题的复杂性从size_of_array * number_of_unique_values减少到size_of_array + number_of_unique_values。

Answer 2

您可以使用unique和bincount让整个事情在numpy中发挥作用。由于numpy的unique使用排序，它将具有线性复杂性，但它通常使用字典胜过纯Python代码，尽管线性复杂。

如果您使用的是numpy 1.9或更新版本：

>>> unq, inv, cnts = np.unique(reference_array, return_inverse=True,
...                            return_counts=True)
>>> means = np.bincount(inv, weights=given_array.ravel()) / cnts

>>> unq
array([ 2,  4,  8,  9, 15])
>>> means
array([  2.83333333,   4.        ,   7.8       ,   7.        ,  15.        ])

对于年龄较大的numpy，它会稍微慢一点，但你会做类似的事情：

>>> unq, inv = np.unique(reference_array, return_inverse=True)
>>> cnts = np.bincount(inv)
>>> means = np.bincount(inv, weights=given_array.ravel()) / cnts

修改

对于更精细的操作，您需要复制np.unique的内容。首先，根据reference_array：
的内容对两个展平的数组进行排序
>>> sort_idx = np.argsort(reference_array, axis=None) >>> given_sort = given_array.ravel()[sort_idx] >>> ref_sort = reference_array.ravel()[sort_idx]

然后计算每组中的项目数：

>>> first_mask = np.concatenate(([True], ref_sort[:-1] != ref_sort[1:])) >>> first_idx, = np.nonzero(first_mask) >>> cnts = np.diff(np.concatenate((first_idx, [ref_sort.size]))) >>> cnts array([6, 1, 5, 1, 1]) >>> unq = ref_sort[first_mask] >>> unq array([ 2, 4, 8, 9, 15])

最后，使用ufuncs及其reduceat方法计算您的小组计算，例如对于小组max：

>>> np.maximum.reduceat(given_sort, first_idx) array([ 5, 4, 11, 7, 15])

计算每个指数平均值的最快方法

2 个答案: