Question

似乎有一个普遍的常识，即使用np.take比数组索引要快得多。例如http://wesmckinney.com/blog/numpy-indexing-peculiarities/，Fast numpy fancy indexing和Fast(er) numpy fancy indexing and reduction?。也有人建议np.ix_在某些情况下会更好。

我已经进行了一些分析，在大多数情况下，这似乎是对的，尽管随着数组变大，差异会减小。
性能受阵列的大小，索引的长度（对于行）和采用的列数影响。行数似乎有最大的影响，即使索引为1D，数组中的列数也有影响。更改索引的大小似乎对方法之间的影响不大。

因此，问题有两个： 1.为什么两种方法的性能会有如此大的差异？ 2.什么时候使用一种方法优于另一种方法？是否存在一些始终可以更好地工作的数组类型，顺序或形状？

有很多事情可能会影响性能，因此我在下面展示了其中的一些内容，并包括了用于尝试使其可再现的代码。

编辑我已经更新了图中的y轴，以显示完整的值范围。可以清楚地看出，差异小于一维数据的差异。

一维索引

通过查看运行时与比较行数可以发现索引是相当一致的，并且略有上升趋势。随着行数的增加，take的速度始终较慢。

随着列数的增加，两者都变慢，但是take的增加幅度更大（这仍然是一维索引）。

2D索引

对于2D数据，结果相似。还显示了使用ix_，它似乎总体上性能最差。

数字代码

from pylab import *
import timeit


def get_test(M, T, C):
    """
    Returns an array and random sorted index into rows
    M : number of rows
    T : rows to take
    C : number of columns
    """
    arr = randn(M, C)
    idx = sort(randint(0, M, T))
    return arr, idx


def draw_time(call, N=10, V='M', T=1000, M=5000, C=300, **kwargs):
    """
    call : function to do indexing, accepts (arr, idx)
    N : number of times to run timeit
    V : string indicating to evaluate number of rows (M) or rows taken (T), or columns created(C)
    ** kwargs : passed to plot
    """
    pts = {
        'M': [10, 20, 50, 100, 500, 1000, 2000, 5000, 10000, 20000, 50000, 100000, 200000, 500000, ],
        'T': [10, 50, 100, 500, 1000, 5000, 10000, 50000],
        'C': [5, 10, 20, 50, 100, 200, 500, 1000],
    }
    res = []

    kw = dict(T=T, M=M, C=C) ## Default values
    for v in pts[V]:
        kw[V] = v
        try:
            arr, idx = get_test(**kw)
        except CallerError:
            res.append(None)
        else:
            res.append(timeit.timeit(lambda :call(arr, idx), number=N))

    plot(pts[V], res, marker='x', **kwargs)
    xscale('log')
    ylabel('runtime [s]')

    if V == 'M':
        xlabel('size of array [rows]')
    elif V == 'T':
        xlabel('number of rows taken')
    elif V == 'C':
        xlabel('number of columns created')

funcs1D = {
    'fancy':lambda arr, idx: arr[idx],
    'take':lambda arr, idx: arr.take(idx, axis=0),
}

cidx = r_[1, 3, 7, 15, 29]
funcs2D = {
    'fancy2D':lambda arr, idx: arr[idx.reshape(-1, 1), cidx],
    'take2D':lambda arr, idx: arr.take(idx.reshape(-1, 1)*arr.shape[1] + cidx),
    'ix_':lambda arr, idx: arr[ix_(idx, cidx)],
}

def test(funcs, N=100, **kwargs):
    for descr, f in funcs.items():
        draw_time(f, label="{}".format(descr), N=100, **kwargs)
    legend()

figure()
title('1D index, 30 columns in data')
test(funcs1D, V='M')
ylim(0, 0.25)
# savefig('perf_1D_arraysize', C=30)

figure()
title('1D index, 5000 rows in data')
test(funcs1D, V='C', M=5000)
ylim(0, 0.07)
# savefig('perf_1D_numbercolumns')

figure()
title('2D index, 300 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize')

figure()
title('2D index, 30 columns in data')
test(funcs2D, V='M')
ylim(0, 0.01)
# savefig('perf_2D_arraysize_C30', C=30)

Answer 1

答案很低，与C编译器和CPU缓存优化有关。请在此numpy issue上与Sebastian Berg和Max Bolingbroke（都是numpy的撰稿人）进行积极的讨论。

花式索引试图在存储方式（C顺序与F顺序）之间变得“智能”，而.take将始终保持C顺序。这意味着对于F排序的数组，花式索引通常会快得多，而对于大型数组，无论如何都应该总是更快。现在，numpy决定什么是“智能”方式，而不考虑阵列的大小或运行的特定硬件。因此，对于较小的阵列，由于更好地使用了CPU缓存中的读取，因此选择“错误的”内存顺序实际上可能会获得更好的性能。

为什么`arr.take（idx）`比`arr [idx]`更快

一维索引

2D索引

数字代码

1 个答案: