numpy.sum performance depending on axis

时间:2018-06-04 16:55:39

标签: python performance numpy

When summing over a dimension in a numpy array, is there a performance difference between the first and the last axis?

Specifically, considering the following code, which of sum1 and sum2 will be performed faster?

import numpy as np
a = np.ones((1000,200))
b = np.ones((200,1000))
sum1 = np.sum(a, axis=0)
sum2 = np.sum(b, axis=-1)

I believe this question actually boils down to how does numpy internally store dimensions and that this can be overriden to use row-wise or column-wise format. However, when using the default setting, which of these will be faster? Also, what about N-dimensional arrays?

1 个答案:

答案 0 :(得分:4)

It is quite easy to check whether or not there is a performance difference (IPython, I increased a bit the numbers to have a more noticeable difference):

import numpy as np

a = np.ones((10000, 2000))
b = np.ones((2000, 10000))
%timeit np.sum(a, axis=0)
# 27.6 ms ± 541 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit  np.sum(b, axis=-1)
# 34.6 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now, by the time you are having an actual performance issue with np.sum you will probably have run out of memory anyway, but yes, there is a difference. By default, NumPy arrays are stored in row-major order, so first goes the first row, then the second, etc. It does make sense, then, that summing (or operating) in outer dimensions is faster, because the cache will be way more effective. Simply puy, in the first case, when you get the first element of the array a bunch of contiguous data will come to the cache with it, so when you want to sum the next elements they will be already there. In the second case, on the other hand, elements to sum are quite far away from each other (2000 elements of distance, actually), so the cache won't be helping much, column-wise. That is not to say the cache won't help at all, since you are summing all the columns, so cached data will still be reused to a degree, but not as effectively. This is a rather gross approximation, in general there are several cache levels, some shared among cores and some not, and understanding the exact effect that one or another code has on it is a complicated topic, but the general idea holds.