Question

在一个收集一些样本的循环中，我需要时不时地获取一些有关其排序索引的统计信息，为此argsort会返回我所需要的。但是，每次迭代仅添加一个样本，并且将整个样本数组一直传递给argsort函数会浪费大量资源，尤其是因为samples数组非常大。是否没有与argsort等效的增量高效技术？

我相信可以通过维护样本的有序列表来实现有效的增量argsort函数，一旦有新样本到达，就可以为适当的searched索引使用insertion。然后，可以将此类索引用于维护样本列表的顺序以及生成类似argsort的增量期望输出。到目前为止，我已经使用@Divakar的searchsorted2d函数，并对其进行了少许修改以获取插入索引，并构建了一些例程，该例程可以在每次插入样本后调用所需的输出（b = 1 ）。但是，这效率低下，我想在收集第k个样本（例如b = 10）之后调用该例程。 对于批量插入，searchsorted2d似乎返回了错误的索引，那就是我停止了！

import time
import numpy as np

# By Divakar
# See https://stackoverflow.com/a/40588862
def searchsorted2d(a, b):
    m, n = a.shape
    max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
    r = max_num * np.arange(m)[:,np.newaxis]
    p = np.searchsorted((a + r).ravel(), (b + r).ravel()).reshape(b.shape)
    return p #- n * (np.arange(m)[:,np.newaxis])

# The following works with batch size b = 1,
# but that is not efficient ...
# Can we make it work for any b > 0 value?
class incremental(object):
    def __init__(self, shape):
        # Express each row offset
        self.ranks_offset = np.tile(np.arange(shape[1]).reshape(1, -1),
                                    (shape[0], 1))
        # Storage for sorted samples
        self.a_sorted = np.empty((shape[0], 0))
        # Storage for sort indices
        self.a_ranks = np.empty((shape[0], 0), np.int)

    def argsort(self, a):
        if self.a_sorted.shape[1] == 0: # Use np.argsort for initialization
            self.a_ranks = a.argsort(axis=1)
            self.a_sorted = np.take_along_axis(a, self.a_ranks, 1)
        else: # In later itterations,
            # searchsorted the input increment
            indices = searchsorted2d(self.a_sorted, a)
            # insert the stack pos to track the sorting indices
            self.a_ranks = np.insert(self.a_ranks, indices.ravel(),
                                     self.ranks_offset.ravel() +
                                     self.a_ranks.shape[1]).reshape((n, -1))
            # insert the increments to maintain a sorted input array
            self.a_sorted = np.insert(self.a_sorted, indices.ravel(),
                                      a.ravel()).reshape((n, -1))
        return self.a_ranks

M = 1000 # number of iterations
n = 16   # vector size
b = 10   # vectors batch size

# Storage for samples
samples = np.zeros((n, M)) * np.nan

# The proposed approach
inc = incremental((n, b))

c = 0 # iterations counter
tick = time.time()
while c < M:
    if c % b == 0: # Perform batch computations
        #sample_ranks = samples[:,:c].argsort(axis=1)
        sample_ranks = inc.argsort(samples[:,max(0,c-b):c]) # Incremental argsort

        ######################################################
        # Utilize sample_ranks in some magic statistics here #
        ######################################################

    samples[:,c] = np.random.rand(n) # collect a sample
    c += 1 # increment the counter
tock = time.time()

last = ((c-1) // b) * b
sample_ranks_GT = samples[:,:last].argsort(axis=1) # Ground truth
print('Compatibility: {0:.1f}%'.format(
      100 * np.count_nonzero(sample_ranks == sample_ranks_GT) / sample_ranks.size))
print('Elapsed time: {0:.1f}ms'.format(
      (tock - tick) * 1000))

我希望与argsort函数具有100％的兼容性，但它需要比调用argsort更有效。至于使用增量方法的执行时间，对于给定的示例，似乎15ms左右应该绰绰有余。到目前为止，任何一种探索的技术都只能满足这两个条件中的一个。

长话短说，上面显示的算法似乎是order-statistic tree的一种变体，用于估计数据等级，但是当批量添加样本（b > 1）时，它不能这样做。到目前为止，它仅在一个接一个地插入样本（b = 1时有效。但是，每次调用insert时都会复制数组，这会导致巨大的开销并形成瓶颈，因此应成批添加样本，而不是单独添加。

您能否引入更有效的增量argsort算法，或者至少在上面的方法中弄清楚如何支持批量插入（b > 1）？

如果您选择从我停止的地方开始，那么可以将问题减少为修复以下快照中的错误：

import numpy as np

# By Divakar
# See https://stackoverflow.com/a/40588862
def searchsorted2d(a, b):
    m, n = a.shape
    max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
    r = max_num * np.arange(m)[:,np.newaxis]
    p = np.searchsorted((a + r).ravel(), (b + r).ravel()).reshape(b.shape)
    # It seems the bug is around here...
    #return p - b.shape[0] * np.arange(b.shape[1])[np.newaxis]
    #return p - b.shape[1] * np.arange(b.shape[0])[:,np.newaxis]
    return p

n = 16  # vector size
b = 2   # vectors batch size

a = np.random.rand(n, 1) # Samples array
a_ranks = a.argsort(axis=1) # Initial ranks
a_sorted = np.take_along_axis(a, a_ranks, 1) # Initial sorted array

new_data = np.random.rand(n, b) # New block to append into the samples array
a = np.hstack((a, new_data)) #Append new block

indices = searchsorted2d(a_sorted, new_data) # Compute insertion indices
ranks_offset = np.tile(np.arange(b).reshape(1, -1), (a_ranks.shape[0], 1)) + a_ranks.shape[1] # Ranks to insert
a_ranks = np.insert(a_ranks, indices.ravel(), ranks_offset.ravel()).reshape((n, -1)) # Insert ransk according to their indices
a_ransk_GT = a.argsort(axis=1) # Ranks ground truth

mask = (a_ranks == a_ransk_GT)
print(mask) #Why they are not all True?
assert(np.all(mask)), 'Oops!' #This should not fail, but it does :(

我最初的想法似乎与批量插入有关，searchsorted2d不应受到指责。以排序数组a = [ 1, 2, 5 ]和两个要插入的新元素block = [3, 4]为例。如果我们迭代并插入，那么np.searchsorted(a, block[i])将返回[2]和[3]，这没关系。但是，如果调用np.searchsorted(a, block)（所需行为-等同于不插入而进行迭代），则将得到[2, 2]。这对于实现增量argsort是有问题的，因为即使np.searchsorted(a, block[::-1])也会产生相同的结果。任何想法？

Answer 1

事实证明，在处理批处理输入时，searchsorted返回的索引不足以确保数组排序。如果要插入的块包含两个乱序的条目，但是它们最终将被放置在目标数组中相邻，那么它们将收到完全相同的插入索引，因此按其当前顺序被插入，从而导致故障。因此，输入块本身需要在插入之前进行排序。有关数字示例，请参阅问题的最后一段。

通过对输入块进行排序并调整其余部分，可以获得与argsort的100.0％兼容解决方案，并且效率非常高（在十乘十的块中插入1000个条目需要的时间为15.6ms，{{ 1}}）。可以通过将问题中发现的有问题的b = 10类替换为以下内容来重现该问题：

incremental

如何在Python中高效地递增argsort向量？

1 个答案: