我有一个numpy数组,其中的索引以(n, 2)
的形式存储。例如:
[[0, 1],
[2, 3],
[1, 2],
[4, 2]]
然后,我进行一些处理并创建一个(m, 2)
形状的数组,其中n > m
。例如:
[[2, 3]
[4, 2]]
现在我想删除第二个数组中也可以找到的第一个数组中的每一行。所以我想要的结果是:
[[0, 1],
[1, 2]]
我当前的解决方法如下:
for row in second_array:
result = np.delete(first_array, np.where(np.all(first_array == second_array, axis=1)), axis=0)
但是,如果秒数很大,这是安静的时间。有人知道仅numpy的解决方案,不需要循环吗?
答案 0 :(得分:2)
这是一个利用matrix-multiplication
进行降维的正数的事实-
def setdiff_nd_positivenums(a,b):
s = np.maximum(a.max(0)+1,b.max(0)+1)
return a[~np.isin(a.dot(s),b.dot(s))]
样品运行-
In [82]: a
Out[82]:
array([[0, 1],
[2, 3],
[1, 2],
[4, 2]])
In [83]: b
Out[83]:
array([[2, 3],
[4, 2]])
In [85]: setdiff_nd_positivenums(a,b)
Out[85]:
array([[0, 1],
[1, 2]])
此外,似乎第二个数组b
是a
的子集。因此,我们可以利用np.searchsorted
进一步利用该方案来提高性能,就像这样-
def setdiff_nd_positivenums_searchsorted(a,b):
s = np.maximum(a.max(0)+1,b.max(0)+1)
a1D,b1D = a.dot(s),b.dot(s)
b1Ds = np.sort(b1D)
return a[b1Ds[np.searchsorted(b1Ds,a1D)] != a1D]
时间-
In [146]: np.random.seed(0)
...: a = np.random.randint(0,9,(1000000,2))
...: b = a[np.random.choice(len(a), 10000, replace=0)]
In [147]: %timeit setdiff_nd_positivenums(a,b)
...: %timeit setdiff_nd_positivenums_searchsorted(a,b)
10 loops, best of 3: 101 ms per loop
10 loops, best of 3: 70.9 ms per loop
对于通用数字,这是另一个使用views
-
# https://stackoverflow.com/a/45313353/ @Divakar
def view1D(a, b): # a, b are arrays
a = np.ascontiguousarray(a)
b = np.ascontiguousarray(b)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel(), b.view(void_dt).ravel()
def setdiff_nd(a,b):
# a,b are the nD input arrays
A,B = view1D(a,b)
return a[~np.isin(A,B)]
样品运行-
In [94]: a
Out[94]:
array([[ 0, 1],
[-2, -3],
[ 1, 2],
[-4, -2]])
In [95]: b
Out[95]:
array([[-2, -3],
[ 4, 2]])
In [96]: setdiff_nd(a,b)
Out[96]:
array([[ 0, 1],
[ 1, 2],
[-4, -2]])
时间-
In [158]: np.random.seed(0)
...: a = np.random.randint(0,9,(1000000,2))
...: b = a[np.random.choice(len(a), 10000, replace=0)]
In [159]: %timeit setdiff_nd(a,b)
1 loop, best of 3: 352 ms per loop
答案 1 :(得分:1)
这是一个函数,可以处理任何形状的2D整数数组,并且可以接受正数和负数:
import numpy as np
# Gets a boolean array of rows of a that are in b
def isin_rows(a, b):
a = np.asarray(a)
b = np.asarray(b)
# Subtract minimum value per column
min = np.minimum(a.min(0), b.min(0))
a = a - min
b = b - min
# Get maximum value per column
max = np.maximum(a.max(0), b.max(0))
# Compute multiplicative base for each column
base = np.roll(max, 1)
base[0] = 1
base = np.cumprod(max)
# Make flattened version of arrays
a_flat = (a * base).sum(1)
b_flat = (b * base).sum(1)
# Check elements of a in b
return np.isin(a_flat, b_flat)
# Test
a = np.array([[0, 1],
[2, 3],
[1, 2],
[4, 2]])
b = np.array([[2, 3],
[4, 2]])
a_in_b_mask = isin_rows(a, b)
a_not_in_b = a[~a_in_b_mask]
print(a_not_in_b)
# [[0 1]
# [1 2]]
编辑:考虑b
中可能的行数会带来一种可能的优化。如果b
的行数超过了可能的组合数量,那么您可能会首先找到其唯一元素,因此np.isin
会更快:
import numpy as np
def isin_rows_opt(a, b):
a = np.asarray(a)
b = np.asarray(b)
min = np.minimum(a.min(0), b.min(0))
a = a - min
b = b - min
max = np.maximum(a.max(0), b.max(0))
base = np.roll(max, 1)
base[0] = 1
base = np.cumprod(max)
a_flat = (a * base).sum(1)
b_flat = (b * base).sum(1)
# Count number of possible different rows for b
num_possible_b = np.prod(b.max(0) - b.min(0) + 1)
if len(b_flat) > num_possible_b: # May tune this condition
b_flat = np.unique(b_flat)
return np.isin(a_flat, b_flat)
条件len(b_flat) > num_possible_b
可能应该进行更好的调整,因此,只有在确实值得的情况下,您才能找到唯一的元素(可能是len(b_flat) > 2 * num_possible_b
或len(b_flat) > num_possible_b + CONSTANT
)。对于值较少的大型数组,似乎可以改善一些情况:
import numpy as np
# Test setup from @Divakar
np.random.seed(0)
a = np.random.randint(0, 9, (1000000, 2))
b = a[np.random.choice(len(a), 10000, replace=0)]
print(np.all(isin_rows(a, b) == isin_rows_opt(a, b)))
# True
%timeit isin_rows(a, b)
# 100 ms ± 425 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit isin_rows_opt(a, b)
# 81.2 ms ± 324 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
答案 2 :(得分:1)
numpy-indexed程序包(免责声明:我是它的作者)设计用于在nd数组上有效地执行此类操作。
import numpy_indexed as npi
# if the output should consist of unique values and there is no need to preserve ordering
result = npi.difference(first_array, second_array)
# otherwise:
result = first_array[~npi.in_(first_array, second_array)]