Question

我有一个带有大量ID值的大型numpy数组（称之为X）：

X:
id   rating
1    88
2    99
3    77
4    66
...

等。我还有另一个numpy“坏ID”数组 - 它表示我想从X中删除的行。

B: [2, 3]

所以，当我完成时，我想：

X:
id   rating
1    88
4    66

最简洁的方法是什么，没有迭代？

Answer 1

这是我能想到的最快方式：

import numpy

x = numpy.arange(1000000, dtype=numpy.int32).reshape((-1,2))
bad = numpy.arange(0, 1000000, 2000, dtype=numpy.int32)

print x.shape
print bad.shape

cleared = numpy.delete(x, numpy.where(numpy.in1d(x[:,0], bad)), 0)
print cleared.shape

打印：

(500000, 2)
(500,)
(499500, 2)

并且运行速度比ufunc快得多。它会使用一些额外的内存，但这对你来说是否合适取决于你的阵列有多大。

<强>解释

numpy.in1d返回与x大小相同的数组如果元素在True数组中，则包含bad 否则False。
numpy.where将True / False数组转换为包含数组为True的索引值的整数数组。
然后将索引位置传递给numpy.delete，告诉它沿第一轴删除（0）

Answer 2

从OP中重现问题规范：

X = NP.array('1 88 2 99 3 77 4 66'.split(), dtype=int).reshape(4, 2)
bad_ids = [3,2]
bad_ideas = set(bad_ideas)    # see jterrance comment below this Answer

从Python的成员资格测试中矢量化 bult-in - 即， X in Y 语法

@NP.vectorize
def filter_bad_ids(id) :
    return id not in bad_ids


>>> X_clean = X[filter_bad_ids(X[:,0])]
>>> X_clean                                # result
   array([[ 1, 88],
          [ 4, 66]])

Answer 3

如果您想完全删除错误ID的信息，请尝试以下操作：

x = x[numpy.in1d(x[:,0], bad, invert=True)]

此解决方案使用相当少的内存，应该非常快。（bad转换为numpy数组，因此不应该是一个有效的设置，请参阅http://docs.scipy.org/doc/numpy/reference/generated/numpy.in1d.html中的注释）
如果坏是非常小，可能会更快一点：

from functools import reduce
x = x[~reduce(numpy.logical_or, (x[:,0] == b for b in bad))]

注意：第一行仅在Python3中需要由于使用了发生器，这也使用很少的内存。

有效地删除NumPy中的行

3 个答案: