性能对比两种数组案例

Question

如果我有两个并行列表，并希望按照第一个元素的顺序对它们进行排序，那么很容易：

>>> a = [2, 3, 1]
>>> b = [4, 6, 7]
>>> a, b = zip(*sorted(zip(a,b)))
>>> print a
(1, 2, 3)
>>> print b
(7, 4, 6)

如何使用numpy数组而不将它们解压缩到传统的Python列表中呢？

Answer 1

b[a.argsort()]应该可以解决问题。

这是它的工作原理。首先，你需要找到一个排序的排列。 argsort是一种计算此方法的方法：

>>> a = numpy.array([2, 3, 1])
>>> p = a.argsort()
>>> p
[2, 0, 1]

您可以轻松检查这是否正确：

>>> a[p]
array([1, 2, 3])

现在将相同的排列应用于b。

>>> b = numpy.array([4, 6, 7])
>>> b[p]
array([7, 4, 6])

Answer 2

这是一种不创建中间Python列表的方法，但它确实需要NumPy“记录数组”来用于排序。如果你的两个输入数组实际上是相关的（比如电子表格中的列）那么这可能会打开一种处理数据的有利方式，而不是一直保持两个不同的数组，在这种情况下你已经有了记录数组，只需在数组上调用sort（）就可以解答原始问题。

将两个数组打包到记录数组后，这会in-place sort：

>>> from numpy import array, rec
>>> a = array([2, 3, 1])
>>> b = array([4, 6, 7])
>>> c = rec.fromarrays([a, b])
>>> c.sort()
>>> c.f1   # fromarrays adds field names beginning with f0 automatically
array([7, 4, 6])

为了简单起见，

编辑使用rec.fromarrays（），跳过冗余dtype，使用默认排序键，使用默认字段名称而不是指定（基于this example）。

Answer 3

这可能是最简单，最通用的方式来做你想要的。（我在这里使用了三个数组，但这适用于任何形状的数组，无论是两列还是两百个。）

import numpy as NP
fnx = lambda : NP.random.randint(0, 10, 6)
a, b, c = fnx(), fnx(), fnx()
abc = NP.column_stack((a, b, c))
keys = (abc[:,0], abc[:,1])          # sort on 2nd column, resolve ties using 1st col
indices = NP.lexsort(keys)        # create index array
ab_sorted = NP.take(abc, indices, axis=0)

一个怪癖w / lexsort是你必须以相反的顺序指定键，即把你的主键放在第二位，把你的第二把钥匙放在第一位。在我的例子中，我想使用第二列作为主键进行排序，所以我将其列为第二列;第1列仅解析关系，但它首先列出。

Answer 4

我遇到了同样的问题，并想知道对一个数组进行排序并相应地对另一个数组重新排序的不同方法的性能。

性能对比两种数组案例

我认为这里提到的解决方案列表很全面，但我也想知道性能。因此，我实现了所有算法并进行了性能比较。

使用 zip 两次排序

def zip_sort(s, p):
    ordered_s, ordered_p = zip(*sorted(list(zip(s, p))))
    return np.array(ordered_s, dtype=s.dtype), np.array(ordered_p, dtype=p.dtype)

使用argsort进行排序。这样就不会考虑其他数组进行辅助排序了

def argsort(s, p):
    indexes = s.argsort()
    return s[indexes], p[indexes]

使用 numpy recarrays 进行排序

def recarray_sort(s, p):
    rec = np.rec.fromarrays([s, p])
    rec.sort()
    return rec.f0, rec.f1

使用 numpy lexsort 进行排序

def lexsort(s, p):
    indexes = np.lexsort([p, s])
    return s[indexes], p[indexes]

对 100000 个随机整数的两个列表 p 和 q 进行排序将产生以下性能

zip_sort
258 ms ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

argsort
9.67 ms ± 278 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

recarray_sort
86.4 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

lexsort
12.4 ms ± 288 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

因此 argsort 是最快的，但也会产生与其他算法略有不同的结果。如果不需要辅助排序，则应使用 argsort。

性能对比多数组案例

接下来，可能需要对多个数组进行这样的排序。修改算法以处理多个数组看起来像

使用 zip 两次排序

def zip_sort(*arrays):
    ordered_lists = zip(*sorted(list(zip(*arrays))))
    return tuple(
        (np.array(l, dtype=arrays[i].dtype) for i, l in enumerate(ordered_lists))
    )

使用argsort进行排序。这将不考虑其他数组进行辅助排序

def argsort(*arrays):
    indexes = arrays[0].argsort()
    return tuple((a[indexes] for a in arrays))

使用 numpy recarrays 进行排序

def recarray_sort(*arrays):
    rec = np.rec.fromarrays(arrays)
    rec.sort()
    return tuple((getattr(rec, field) for field in rec.dtype.names))

使用 numpy lexsort 进行排序

def lexsort(*arrays):
    indexes = np.lexsort(arrays[::-1])
    return tuple((a[indexes] for a in arrays))

用每 100000 个随机整数 (arrays = [np.random.randint(10, size=100000) for _ in range (100)]) 对 100 个数组的列表进行排序，现在产生以下性能

zip_sort
13.9 s ± 570 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

argsort
49.8 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

recarray_sort
491 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

lexsort
881 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

argsort 仍然是最快的，这似乎是合乎逻辑的，因为忽略了辅助排序。对于具有辅助列排序的其他算法，基于 recarray 的解决方案现在优于 lexsort 变体。

免责声明：其他数据类型的结果可能会有所不同，并且还取决于数组数据的随机性。我用 42 作为种子。

我如何“压缩排序”并行numpy数组？

4 个答案:

性能对比两种数组案例

性能对比多数组案例