Question

我有两个（排序的）数组A和B，它们的长度不同，每个数组都包含重复多次的唯一标签。 A中每个标签的计数小于或等于B中的计数。 A中的所有标签都将在B中，但B中的某些标签不会出现在A中。

我需要一个与B长度相同的对象，其中对于A中的每个标签i（发生k_i次），k_i的前i个出现的标签False B中的B需要设置为True。其余元素应为import numpy as np # The labels and their frequency A = np.array((1,1,2,2,3,4,4,4)) B = np.array((1,1,1,1,1,2,2,3,3,4,4,4,4,4,5,5)) A_uniq, A_count = np.unique(A, return_counts = True) new_ind = np.ones(B.shape, dtype = bool) for i in range(len(A_uniq)): new_ind[np.where(B == A_uniq[i])[0][:A_count[i]]] = False print(new_ind) #[False False True True True False False False True False False False # True True True True]。

以下代码满足了我的需要，但是如果A和B很大，则可能需要很长时间：

config_window()

是否有更快或更有效的方法？我觉得我可能缺少一些明显的广播或矢量化解决方案。

Answer 1

这里是np.searchsorted-

idx = np.searchsorted(B, A_uniq)
id_ar = np.zeros(len(B),dtype=int)
id_ar[idx] = 1
id_ar[A_count+idx] -= 1
out = id_ar.cumsum()==0

我们可以进一步优化以使用其排序性质而不是使用A_uniq,A_count来计算np.unique-

mask_A = np.r_[True,A[:-1]!=A[1:],True]
A_uniq, A_count = A[mask_A[:-1]], np.diff(np.flatnonzero(mask_A))

Answer 2

不带numpy的示例

A = [1,1,2,2,3,4,4,4]
B = [1,1,1,1,1,2,2,3,3,4,4,4,4,4,5,5]

a_i = b_i = 0
while a_i < len(A):
  if A[a_i] == B[b_i]:
    a_i += 1
    B[b_i] = False
  else:
    B[b_i] = True
  b_i += 1
# fill the rest of B with True
B[b_i:] = [True] * (len(B) - b_i)
# [False, False, True, True, True, False, False, False, True, False, False, False, True, True, True, True]

Answer 3

此解决方案的灵感来自@Divakar的解决方案，使用itertools.groupby：

import numpy as np
from itertools import groupby
A = np.array((1, 1, 2, 2, 3, 4, 4, 4))
B = np.array((1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5))

indices = [key + i for key, group in groupby(np.searchsorted(B, A)) for i, _ in enumerate(group)]
result = np.ones_like(B, dtype=np.bool)
result[indices] = False

print(result)

输出

[False False  True  True  True False False False  True False False False
  True  True  True  True]

想法是使用np.searchsorted来查找A的每个元素的插入位置，因为相等的元素将具有相同的插入位置，因此您必须将每个元素移动一个，因此groupby 。然后创建一个True数组，并将indices的值设置为False。

如果可以使用pandas，请像这样计算indices：

values = np.searchsorted(B, A)
indices = pd.Series(values).groupby(values).cumcount() + values

对于一个数组中的每个标签，将另一个数组中的前k个出现设置为False

3 个答案: