Question

我的程序中存在瓶颈，原因如下：

A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = numpy.array([1,4,5,6,7,8,9])

C = numpy.array([i for i in A if i in B])

C的预期结果如下：

C = [4 6 7 1 5 4 1 1 9]

是否有更有效的方法来执行此操作？

请注意，数组A包含重复值，需要将它们考虑在内。我无法使用集合交集，因为取交点将省略重复值，仅返回[1,4,5,6,7,9]。

另请注意，这只是一个简单的演示。实际的数组大小可以是数千个，也可以是数百万个。

Answer 1

您可以使用np.in1d：

>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])

np.in1d返回一个布尔数组，指示A的每个值是否也出现在B中。然后，可以使用此数组索引A并返回公共值。

它与您的示例无关，但值得一提的是，如果A和B每个都包含唯一值，那么np.in1d可以加速设置assume_unique=True：

np.in1d(A, B, assume_unique=True)

您可能也对np.intersect1d感兴趣，它返回两个数组共有的唯一值数组（按值排序）：

>>> np.intersect1d(A, B)
array([1, 4, 5, 6, 7, 9])

Answer 2

使用numpy.in1d：

>>> A[np.in1d(A, B)]
array([4, 6, 7, 1, 5, 4, 1, 1, 9])

Answer 3

如果您只检查B（if i in B）中是否存在，那么很明显可以使用set。只要至少有一个，B中有多少四个并不重要。当然你是对的，你不能使用两套和十字路口。但即使一个set也应该提高性能，因为搜索复杂度小于O（n）：

A = numpy.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
B = set([1,4,5,6,7,8,9])

C = numpy.array([i for i in A if i in B])

Answer 4

1-使用设置的相交，因为在这种情况下它非常快

c = set(a).intersection(b)

2-使用numpy intersect1d方法比循环更快但比第一种方法慢

c = numpy.intersect1d(a,b)

Answer 5

我们可以使用np.searchsorted来提高性能，对于查找数组已对唯一值进行排序的情况，更是如此-

def intersect1d_searchsorted(A,B,assume_unique=False):
    if assume_unique==0:
        B_ar = np.unique(B)
    else:
        B_ar = B
    idx = np.searchsorted(B_ar,A)
    idx[idx==len(B_ar)] = 0
    return A[B_ar[idx] == A]

assume_unique标志使得它既适用于一般情况，又适用于特殊情况下的B唯一且已排序。

样品运行-

In [89]: A = np.array([10,4,6,7,1,5,3,4,24,1,1,9,10,10,18])
    ...: B = np.array([1,4,5,6,7,8,9])

In [90]: intersect1d_searchsorted(A,B,assume_unique=True)
Out[90]: array([4, 6, 7, 1, 5, 4, 1, 1, 9])

两种情况下在大型数组上与另一个基于矢量np.in1d的基于矢量化解决方案（在其他答案中列出）进行比较的时间-

In [103]: A = np.random.randint(0,10000,(1000000))

In [104]: B = np.random.randint(0,10000,(1000000))

In [105]: %timeit A[np.in1d(A, B)]
     ...: %timeit A[np.in1d(A, B, assume_unique=False)]
     ...: %timeit intersect1d_searchsorted(A,B,assume_unique=False)
1 loop, best of 3: 197 ms per loop
10 loops, best of 3: 190 ms per loop
10 loops, best of 3: 151 ms per loop

In [106]: B = np.unique(np.random.randint(0,10000,(5000)))

In [107]: %timeit A[np.in1d(A, B)]
     ...: %timeit A[np.in1d(A, B, assume_unique=True)]
     ...: %timeit intersect1d_searchsorted(A,B,assume_unique=True)
10 loops, best of 3: 130 ms per loop
1 loop, best of 3: 218 ms per loop
10 loops, best of 3: 80.2 ms per loop

计算两个numpy数组之间相交值的有效方法

5 个答案: