Question

有没有快速的方法来获得numpy中的独特元素？我的代码与此类似（最后一行）

tab = numpy.arange(100000000)

indices1 = numpy.random.permutation(10000)
indices2 = indices1.copy()
indices3 = indices1.copy()
indices4 = indices1.copy()

result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]]))

这只是一个例子，在我的情况下indices1, indices2,...,indices4包含不同的索引集并且具有不同的大小。最后一行被执行多次并且Inoticed它实际上是我的代码中的瓶颈（{numpy.core.multiarray.arange}是先发制人的）。此外，排序并不重要，索引数组中的元素为int32类型。我正在考虑使用带有元素值的哈希表作为键并尝试：

seq = itertools.chain(tab[indices1].flatten(), tab[indices2].flatten(), tab[indices3].flatten(), tab[indices4].flatten())
myset = {}
map(myset.__setitem__, seq, [])
result = numpy.array(myset.keys())

但情况更糟。

有没有办法加快速度？我想性能损失来自复制数组的'花式索引'，但我只需要读取结果元素（我不修改任何东西）。

Answer 1

[以下内容实际上部分不正确（参见PS）：]

以下获取所有子阵列中的唯一元素的方法非常快：

seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat)
result = set(seq)

请注意，使用flat（返回迭代器）而不是flatten()（返回完整数组），并且set()可以直接调用（而不是使用{{ 1}}和字典，就像你的第二种方法一样。）

以下是计时结果（在IPython shell中获得）：

map()

在这个例子中，set / flat方法快40倍。

PS ：>>> %timeit result = numpy.unique(numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]])) 100 loops, best of 3: 8.04 ms per loop >>> seq = itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat) >>> %timeit set(seq) 1000000 loops, best of 3: 223 ns per loop的时间实际上并不代表。实际上，时序的第一个循环清空set(seq)迭代器，随后的seq计算返回一个空集！正确的时序测试如下

set()

表明set / flat方法实际上并不快。

PPS ：这是对mtrw建议的一次（不成功的）探索;事先找到唯一的指数可能是一个好主意，但我找不到比上述方法更快的方法来实现它：

>>> %timeit set(itertools.chain(tab[indices1].flat, tab[indices2].flat, tab[indices3].flat, tab[indices4].flat))
100 loops, best of 3: 9.12 ms per loop

因此，找到所有不同指数的集合本身就很慢。

PPPS ：>>> %timeit set(indices1).union(indices2).union(indices3).union(indices4) 100 loops, best of 3: 11.9 ms per loop >>> %timeit set(itertools.chain(indices1.flat, indices2.flat, indices3.flat, indices4.flat)) 100 loops, best of 3: 10.8 ms per loop实际上比numpy.unique(<concatenated array of indices>)快2-3倍。这是Bago答案（set(<concatenated array of indices>)）中获得加速的关键。原因可能是让NumPy自己处理其数组通常比将纯Python（unique(concatenate((…)))）与NumPy数组连接更快。

结论：因此，此答案仅记录了不应完全遵循的失败尝试，以及关于使用迭代器的时序代码可能有用的注释......

Answer 2

抱歉，我不完全理解您的问题，但我会尽力帮助您。

Fist {numpy.core.multiarray.arange}是numpy.arange而不是花哨的索引，遗憾的是，花哨的索引不会在探查器中显示为单独的行项目。如果你在循环中调用np.arange，你应该看看是否可以将它移到外面。

In [27]: prun tab[tab]
     2 function calls in 1.551 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    1.551    1.551    1.551    1.551 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler'    objects}

In [28]: prun numpy.arange(10000000)
     3 function calls in 0.051 CPU seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.047    0.047    0.047    0.047 {numpy.core.multiarray.arange}
    1    0.003    0.003    0.051    0.051 <string>:1(<module>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

其次我假设您的代码中tab不是np.arange(a, b)，因为如果它不是tab[index] == index + a，但我认为这只是为了您的示例。

第三，np.concatenate比np.array快10倍

In [47]: timeit numpy.array([tab[indices1], tab[indices2], tab[indices3], tab[indices4]])
100 loops, best of 3: 5.11 ms per loop

In [48]: timeit numpy.concatenate([tab[indices1], tab[indices2], tab[indices3],     tab[indices4]])
1000 loops, best of 3: 544 us per loop

（同样np.concatenate给出一个（4 * n，）数组，np.array给出一个（4，n）数组，其中n是索引[1-4]的长度。后者只有在indices1-4的长度都相同。）

最后，如果您可以执行以下操作，还可以节省更多时间：

indices = np.unique(np.concatenate((indices1, indices2, indices3, indices4)))
result = tab[indices]

按此顺序执行此操作会更快，因为您减少了需要在选项卡中查找的索引数量，但只有当您知道选项卡的元素是唯一的时才会起作用（否则您可能会在结果中获得重复）如果指数是唯一的。）

希望有所帮助

在numpy和python中快速删除重复项

2 个答案: